Pages

Ads 468x60px

Labels

Thursday, 26 June 2008

Lessons From CMMI (A Software Process Model) For BI Practitioners

Hexaware successfully completed the CMMI Level 5 re-certification recently with KPMG auditing and certifying the company’s software process to be in line with Version 1.2 of the model. This is the highest level in the Capability Maturity Model Integration model developed by Software Engineering Institute in collaboration with Carnegie Mellon. For the uninitiated, Capability Maturity Model Integration (CMMI) is a process improvement approach that provides organizations with essential elements of effective process.
Now, what has CMMI got to do with Business Intelligence?
I participated in the re-certification audit as one of the project managers and I learnt some lessons which I think would be useful for all of us as BI practitioners. The CMMI model has 22 different process areas covering close to 420 odd specific practices. Though the specifics are daunting, the ultimate goal of the model is simple to understand and there-in lies our lesson.
In the CMMI model, Maturity Levels 2 and 3 act as building blocks in creating the process infrastructure to ensure that the higher maturity levels are achievable and sustainable.The high-maturity practices (Levels 4 and 5) of the model focus on:
1) Establish Quantitative Goals in line with the business objectives
2) Measure the performance with respect to the goals using statistical tools
3) Take corrective action to bring the performance in line with the goals
4) Measure again to ensure that the action taken has contributed positively to performance improvement.
Key Lessons for BI Practitioners:
1) Single-minded focus to “close the loop” – CMMI model evaluates every project management action in the context of project goals and measures them quantitatively. Business Intelligence, ideally, should measure all actions in the context of business goals and provide the facility to compare metrics before and after the decision implementation.
2) Strong information infrastructure – Higher levels of maturity in CMMI are sustainable only if the lower maturity levels are strongly established. In the BI context, this translates to a robust architecture that makes measurements possible
3) Accuracy + Precision is the key – Controlling variation (sustainability) is as important as hitting your targets. BI in organizations is weak along the sustainability dimension. For instance, enterprises do have analytics around “How am I doing now” but not much on questions like a) How long will this growth continue? b) When will we get out of this declining trend? etc.
In a way, this post is related to one of my earlier blog on BI and Sig Sigma with the central idea being that, for enterprises to be analytics driven both numbers and processes behind those numbers are equally important. CMMI Model, in its simplest form, also has that as its core theme for achieving high process maturity in an organization.
Thanks for reading and please do share your thoughts. Read More About : CMMI

Thursday, 19 June 2008

Data Integration Challenge – Carrying Id-Date Fields from Source


Ids in Source System: Sometimes we would have been in a dilemma to decide whether to carry the identity (id) fields from source system as identity fields into the data warehouse as well. There is couple of situations which would push us to this state1.The business users are familiar with the product ids like 1211 , 1212 than by the product name it self and they need them in the target system

2.Why should we create an additional id field in the target table when I can have a unique identity for each record from the source system
What are Opensource Id fields, they are usually the surrogate keys or unique record keys like product id, customer id etc which the business might be more familiar than with their descriptions, descriptions of these ids are more found on report printouts. In general most of the source id fields would get mapped to the dimension tables.
Here are the reasons why we should carry the source id fields as it is
  1. The business is comfortable talking and analyzing in terms of ids than descriptions
  2. Having source ids fields which are usually numeric or if not of smaller length is very much lookup friendly, using ids for lookup or filter or join conditions when pulling data from source systems is much better than descriptions
  3. Source id fields enables linking of the data from the data warehouse to the source system
  4. Just consider the id as another attribute to the dimension and not as a unique identifier
Here are the reasons why we should create additional id (key) field in addition to the source id field
  1. Avoiding duplicate keys if the data to be sourced from multiple systems
  2. The source ids can merge, split, anything can happen, we would want to avoid the dependency on the source system
  3. The id field created in the data warehouse would be index friendly
  4. Having a unique id for each record in the data warehouse would help in determining the number of unique records in a much easier way
Dates in Source System: One other field that we usually confuse is the date field in the source systems. The date field present in the source record might provide the information when the record arrived in the source system and a date field generated in the data warehouse system would provide the information when a source record arrived into the data warehouse.
The data warehouse record date and the source record date can be same if the source record gets moved into the data warehouse the same day, certainly both date fields might represent different date values if there is a delay in the source data arrival into the data warehouse.
Why we need to store Source Date in the data warehouse, this need is very clear, we always perform date based analysis based on the arrival of source record in the source systems.
Why we need to generate a Data Warehouse Date, capturing the arrival of the record into the data warehouse answers queries related to audit, data growth and as well to determine what new records arrived into the warehouse which is especially useful for providing incremental extracts for downstream marts or other systems.

Read Brief About: Data Integration Challenge 

Monday, 9 June 2008

Hybrid OLAP – The Future of Information Delivery

As I get to see more Enterprise BI initiatives, it is becoming increasingly clear (atleast to me!) that when it comes to information dissemination, Hybrid Online Analytical Processing (HOLAP) is the way to go. Let me explain my position here.
As you might be aware, Relational (ROLAP), Multi-dimensional (MOLAP) and Hybrid OLAP (HOLAP) are the 3 modes of information delivery for BI systems. In an ROLAP environment, the data is stored in a relational structure and is accessed through a semantic layer (usually!). MOLAP on the other hand stores data in proprietary format providing the notion of a multi-dimensional cube to users. HOLAP combines the power of both ROLAP and MOLAP systems and with the rapid improvements made by BI tool vendors, seems to have finally arrived on the scene.
In my mind, the argument for subscribing to the HOLAP paradigm goes back to the “classic” article
by Ralph Kimball on different types of fact table grains. According to him, there are 3 types of fact tables – Transaction grained, Periodic snapshot, Accumulating snapshot and that atleast 2 of them are required to model a business situation completely. From an analytical standpoint, this means that operational data has to be analyzed along with summarized data (snapshots) for business users to take informed decisions.
Traditionally, the BI world has handled this problem in 2 ways:
1) Build everything on the ROLAP architecture. Handle the summarization either on the fly or thro’ summarized reporting tables at the database level. This is not a very elegant solution as everybody in the organization (even those analysts working with summarized information) gets penalized for the slow performance of SQL queries issued against the relational database through the semantic layer.
2) Profile users and segregate operational analysts from strategic analysts. Operational users are provided ROLAP tools while business users working primarily with summarized information are provided their “own” cubes (MOLAP) for high-performance analytics.
Both solutions are rapidly becoming passé. In many organizations now, business users wants to look at summarized information and based on what they see, needs the facility to drill down to granular level information. A good example is the case of analyzing Ledger information (Income statement & Balance Sheet) and then drilling down to Journal entries as required. All this drilling down has to happen through a common interface – either an independent BI Tool or an enterprise portal with an underlying OLAP engine.
This is the world of HOLAP and it is here to stay. The technology improvement that is making this possible is the relatively new wonder-kid, XMLA (XML for Analysis). More about XMLA in my subsequent posts.
As an example of HOLAP architecture, you can take a look at this link
to understand the integration of Essbase cubes (MOLAP at its best) with OBIEE (Siebel Analytics – ROLAP platform) to provide a common semantic model for end-user analytics.
Information Nugget: If you are interested in Oracle Business Intelligence, please do stop by at http://www.rittmanmead.com/blog/ to read his blogs. The articles are very informative and thoroughly practical.
Thanks for reading. Please do share your thoughts.
Read More Hybrid OLAP

Monday, 2 June 2008

Data Integration Challenge – Parent-Child Record Sets, Child Updates

There are certain special set of records like Loan & its Guarantor details in a banking system, each Loan record can have one or more Guarantor record. In a similar way for a services based industry Contracts & its contract Components exist, these sets can be called as parent-child records where in for one parent record like Loan we might have zero to many child records of Guarantor.
During data modeling we would have one table for the parent level record and its attribute, another separate table for the child records and its attributes.
As part of the data load process, have seen situations where a complete refresh (delete & insert) of the Child records is required whenever there is a change in certain attributes of a parent record. This requirement can be implemented in different ways; here we would look at one of the best ways to get this accomplished.
The following steps would be involved in the ETL process
  1. Read the parent-child record
  2. Determine if a change in the incoming parent record
  3. If a change has occurred then issue a delete to the particular set of child records
  4. Write corresponding incoming new child records into a flat file
  5. Once step 1 to 4 is completed for all parent records have another ETL flow that would bulk load the records from the flat file to the child table
We didn’t issue an insert with a new incoming child record after the delete because the deleted record wouldn’t have got committed and an insert can lock the table. We can issue a commit after every delete and then follow it with an insert but having a commit after each delete would be costlier, writing the inserts to the files handles this situation perfectly.
Also an option to insert first with a different key and then delete the older records would be costlier in terms of locating the records that needs to the deleted.
We could have also looked at the option of updating the records in place of deletion then we would at times end up having dead records in the child tables; the records that have been deleted in the source would still exist in the target child table, also updating a record can disturb contagious memory, deletion and insert would have the pages intact.
Read More about  Data Integration