Pages

Ads 468x60px

Labels

Tuesday, 22 July 2008

Business Intelligence Landscape Documentation – The Gold Copy

Software systems, in general, need good comprehensive documentation. This need becomes a “must-have” in case of BI systems, as these systems evolve over a period of time. My recommendation (a best practice, if you will) is to create a “Gold Copy” documentation of the enterprise BI landscape. The “Gold Copy” is distinct from the individual project documentation (this would continue to exist) and is updated with appropriate version controls over a period of time.
The “Gold Copy” would comprise of the following documents related to the Business Intelligence environment:
1) Architecture and Environment
This document has two broad sections. The first section explores the physical architecture and then attempts to explore each of the architectural components in more detail. The second section explores the environmental and infrastructure requirements for development, production and operations.
2) Data Sources
This document explores the various source systems from which data is acquired by the datawarehouse. The document attempts to provide a layman’s view of what data is being extracted and maps it to the source system data structures where they data originate from.
3) Data Models
This document builds on the two previous documents. The document attempts to map the source data onto the datawarehouse and the data mart models and provides a high-level picture of the subject orientation.
4) Reporting and OLAP
The document takes a closer look at information delivery processes and technology to the end user from the datawarehouse. The initial section explores the architectural components by way of middleware server software and hardware as well as the end user desktop tools and utilities used for this purpose. The second section looks at some of the more important reports and describes the purpose of each of them.
5) Process
This document takes a process view of data movement within the enterprise datawarehouse (EDW) starting from extraction, staging, loading the EDW and subsequently the data mart. It explores the various related aspects like mode-of-extraction, extract strategy, control-structures, re-start and recovery. The document also looks at naming conventions followed for all objects within the datawarehouse environment.
The above set of documents cover the BI landscape with its focus on 3 critical themes – Architecture Track, Data Track and Process Track. Each of these tracks have a suggested reading sequence of above mentioned documents
Architecture Track – This theme focuses entirely on components, mechanisms and modes from an architectural angle. The suggested reading sequence for this track is – Architecture and Environment, Data Models, Reporting and OLAP
Data Track – This theme focuses on data – the methods of its sourcing, its movement across the datawarehouse, the methods of its storage and logistics of its delivery to the business users. The suggested reading sequence for this track is -Sources, Data Models, Reporting and OLAP.
Process Track – This theme focuses on datawarehouse from a process perspective and explores the different aspects related to it. The suggested reading sequence for this track is – Architecture and Environment, Process, Reporting and OLAP.
I have found it extremely useful to create such documentation for enterprise wide BI systems to ensure a level of control as functional complexity increases over a period of time.
Thanks for reading. Please do share your thoughts. Read More About  Business Intelligence

Friday, 18 July 2008

Data Integration Challenge – Error Handling

Determining the error and handling the errors encountered in the process of data transformation is one of the key design aspects in building a robust data integration platform. When an error occurs how do we capture the errors and use them for effective analysis. Following are the best practices related to error handling
  1. Differentiate the error handling process into the Generic (Null, Datatype, Data format) and the Specific like the rules related to the business process. This differentiation enables to build reusable error handling code
  2.  
  3. Do not stop validations when the record fails for one of the validations; continue with the other validations on the incoming data. If we have 5 validations to be done on a record, we need to design that the incoming record is taken through all the validations, this ensures that we capture all the errors in a record in one go
  4.  
  5. Have a table Error_Info; this has the repository of all the error messages. The fields would be ErrorCode, ErrorType and the ErrorMessage. The ErrorType would carry the values ‘warning’ or ‘error’, the ErrorMessage would have a detail description of the error and the ErrorCode a numeric value which is used in place of the description.
  6.  
  7. In general each validation should have an error message, we could also see the table Error_Info as a repository of all error validations performed in the system. In case of business rules that involve multiple fields, the field ErrorMessage in the table Error_Info can have the details of the business rule applied along with the field name, we can also create an additional field Error_Category to group the error messages
  8.  
  9. Have a table Error_Details; this stores the errors captured. The fields of this table would be KeyValue, FieldName, FieldValue and ErrorCode. The KeyValue would hold the value of the primary key of the record which has an error, the FieldName would store name of the field which has an error, the FieldValue has the value of the field which has failed or is an error, the ErrorCode details the error through a link to the table Error_Info.
  10.  
  11. Write each error captured as a separate record in the table Error_Deatils i.e., if a record fails for two conditions like a NULL check on field ‘ CustomerId’ and the data format check on the field ‘Date’ then ensure we write two records one for the NULL failure and one for the data format failure
  12.  
  13. To retain the whole incoming record have a table structure Source_Datasame as the incoming data. Have a field FLAG in the Source_Data, a value of ‘1’ would say that the record has passed all the validations and ‘0’ would say that it has failed one or more validations
  14.  
In summary the whole process would be to read the incoming record, validate the data, for any validation failure assign the error_code and pipe the errors captured to the Error_Details table, once all validations completed assign the FLAG value (1- Valid record, 0-Invalid record) and insert that record into the Source_data table. Having the data structure as suggested above would enable efficient analysis of the errors captured by the business and IT team.
Read More About  Data Integration Challenge

Tuesday, 8 July 2008

Competencies for Business Intelligence Professionals

The world of BI seems to be largely driven by proficiency in tools that I was stumped during a recent workshop when we were asked to identify BI competencies. The objective of the workshop, conducted by the training wing of my company, was to identify the competencies required for different roles within our practice and also to define 5 levels (Beginner to Expert) for each of the identified competencies.
We were a team of 4 people and started listing down the areas where expertise is required to be a successful BI practice. For the first version we came up with 20 odd competencies ranging from architecture definition to tool expertise to data mining to domain expertise. This was definitely not an elegant proposition considering the fact that for each of the competencies we had to define 5 levels and also create assessment mechanisms for evaluating them. The initial list was far too big for any meaningful competency building and so we decided that we have to fit all this into a maximum of 5 buckets.
After some intense discussions and soul searching, we came up with the final list of BI competencies as given below:
2) BI Solutions
3) Data Related
4) Project / Process Management
5) Domain Expertise
BI Platform covers all tool related expertise ranging from working on the tool with guidance to being an industry authority on specific tools (covering ETL, Databases and OLAP)
BI Solutions straddles the spectrum of solutions available out-of-the-box. These solutions can be packages available with system integrators to help jump-start BI implementations at one end (For ex: Hexaware has a strong proprietary solution around HR Analytics) to the other extreme of Packaged analytics provided by major product companies (Examples are: Oracle Peoplesoft EPM, Oracle BI Applications (OBIA), Business Objects Rapid Marts etc.)
Data Related competency has ‘data’ at its epicenter. The levels here range from understanding and writing SQL Queries to Predictive Analytics / Data Mining at the other extreme. We decided to keep this as a separate bucket as this is a very critical one from BI standpoint for nobody else has so much “data” focus than the tribe of BI professionals.
Project Management covers all aspects of managing projects with specific attention to the risks and issues that can crop up during execution of Business Intelligence projects. This area also includes the assimilation and application of software quality process such as CMMI for project execution and Six Sigma for process optimization.
The fifth area was “Domain Expertise”. We decided to keep this as a separate category considering the fact that for BI to be really effective it has to be implemented in the context of that particular industry. The levels here range from being a business analyst with the ability to understand business processes across domains to being a specialist in a particular industry domain.
This list can serve as a litmus paper for all BI Professionals to rate themselves on these competencies and find ways of scaling up across these dimensions.
I found this exercise really interesting and hope the final list is useful for some of you. If you feel that there are other areas that have been missed out, please do share your thoughts.
The team involved in this exercise: Sundar, Pandian, Mohammed Rafi and I. All of us are part of the Business Intelligence and Analytics Practice at Hexaware.

Thursday, 26 June 2008

Lessons From CMMI (A Software Process Model) For BI Practitioners

Hexaware successfully completed the CMMI Level 5 re-certification recently with KPMG auditing and certifying the company’s software process to be in line with Version 1.2 of the model. This is the highest level in the Capability Maturity Model Integration model developed by Software Engineering Institute in collaboration with Carnegie Mellon. For the uninitiated, Capability Maturity Model Integration (CMMI) is a process improvement approach that provides organizations with essential elements of effective process.
Now, what has CMMI got to do with Business Intelligence?
I participated in the re-certification audit as one of the project managers and I learnt some lessons which I think would be useful for all of us as BI practitioners. The CMMI model has 22 different process areas covering close to 420 odd specific practices. Though the specifics are daunting, the ultimate goal of the model is simple to understand and there-in lies our lesson.
In the CMMI model, Maturity Levels 2 and 3 act as building blocks in creating the process infrastructure to ensure that the higher maturity levels are achievable and sustainable.The high-maturity practices (Levels 4 and 5) of the model focus on:
1) Establish Quantitative Goals in line with the business objectives
2) Measure the performance with respect to the goals using statistical tools
3) Take corrective action to bring the performance in line with the goals
4) Measure again to ensure that the action taken has contributed positively to performance improvement.
Key Lessons for BI Practitioners:
1) Single-minded focus to “close the loop” – CMMI model evaluates every project management action in the context of project goals and measures them quantitatively. Business Intelligence, ideally, should measure all actions in the context of business goals and provide the facility to compare metrics before and after the decision implementation.
2) Strong information infrastructure – Higher levels of maturity in CMMI are sustainable only if the lower maturity levels are strongly established. In the BI context, this translates to a robust architecture that makes measurements possible
3) Accuracy + Precision is the key – Controlling variation (sustainability) is as important as hitting your targets. BI in organizations is weak along the sustainability dimension. For instance, enterprises do have analytics around “How am I doing now” but not much on questions like a) How long will this growth continue? b) When will we get out of this declining trend? etc.
In a way, this post is related to one of my earlier blog on BI and Sig Sigma with the central idea being that, for enterprises to be analytics driven both numbers and processes behind those numbers are equally important. CMMI Model, in its simplest form, also has that as its core theme for achieving high process maturity in an organization.
Thanks for reading and please do share your thoughts. Read More About : CMMI

Thursday, 19 June 2008

Data Integration Challenge – Carrying Id-Date Fields from Source


Ids in Source System: Sometimes we would have been in a dilemma to decide whether to carry the identity (id) fields from source system as identity fields into the data warehouse as well. There is couple of situations which would push us to this state1.The business users are familiar with the product ids like 1211 , 1212 than by the product name it self and they need them in the target system

2.Why should we create an additional id field in the target table when I can have a unique identity for each record from the source system
What are Opensource Id fields, they are usually the surrogate keys or unique record keys like product id, customer id etc which the business might be more familiar than with their descriptions, descriptions of these ids are more found on report printouts. In general most of the source id fields would get mapped to the dimension tables.
Here are the reasons why we should carry the source id fields as it is
  1. The business is comfortable talking and analyzing in terms of ids than descriptions
  2. Having source ids fields which are usually numeric or if not of smaller length is very much lookup friendly, using ids for lookup or filter or join conditions when pulling data from source systems is much better than descriptions
  3. Source id fields enables linking of the data from the data warehouse to the source system
  4. Just consider the id as another attribute to the dimension and not as a unique identifier
Here are the reasons why we should create additional id (key) field in addition to the source id field
  1. Avoiding duplicate keys if the data to be sourced from multiple systems
  2. The source ids can merge, split, anything can happen, we would want to avoid the dependency on the source system
  3. The id field created in the data warehouse would be index friendly
  4. Having a unique id for each record in the data warehouse would help in determining the number of unique records in a much easier way
Dates in Source System: One other field that we usually confuse is the date field in the source systems. The date field present in the source record might provide the information when the record arrived in the source system and a date field generated in the data warehouse system would provide the information when a source record arrived into the data warehouse.
The data warehouse record date and the source record date can be same if the source record gets moved into the data warehouse the same day, certainly both date fields might represent different date values if there is a delay in the source data arrival into the data warehouse.
Why we need to store Source Date in the data warehouse, this need is very clear, we always perform date based analysis based on the arrival of source record in the source systems.
Why we need to generate a Data Warehouse Date, capturing the arrival of the record into the data warehouse answers queries related to audit, data growth and as well to determine what new records arrived into the warehouse which is especially useful for providing incremental extracts for downstream marts or other systems.

Read Brief About: Data Integration Challenge