Pages

Ads 468x60px

Labels

Monday 24 March 2008

Metadata 101 – For BI Practitioners

For as long as I can remember, the definition given for Metadata is “Data about Data”. We have all said this in interviews, heard it from candidates, seen it on presentations, and (almost) always nodded our heads in agreement.
In the transaction processing world, where “data-in” is the paradigm, the definition is precise. The databases store the business data in the relational format and the system tables / catalogs describe the structure of that data – the columns, type, size, etc. This data about the structure of business data is “Metadata”.
In the Business Intelligence world, that definition of metadata is incomplete. A more precise definition of metadata has two components:
Metadata in BI = “Data about Data” + “Information about Information”
The first component “Data about data” is “Technical Metadata” and is similar to the metadata in the OLTP world. Having said that, the technical metadata in BI is arguably more complex, as it not only encompasses the databases but needs to cover the ETL and Reporting tools as well. Each of the tools in the overall BI landscape has its own metadata and this data has to be looked at in a comprehensive fashion to understand data lineage etc.
Even among BI tools, there are different categories – Tools that expose its metadata completely, tools that gives an handle to its metadata thro’ pre-defined APIs and tools that do not allow any access to the metadata. Given the industry direction and the evolution of Common Warehouse Metamodel (CWM) compliance standards, it is only a matter of time before the tool architecture is designed to expose the technical metadata. CWM is a fascinating topic of its own and you can get a feel for it by visiting this website: http://www.omg.org/technology/cwm/
To me, as a BI practitioner, the second piece of the metadata puzzle is more interesting. “Information about information” aspect of metadata is “Business Metadata” and understanding it is crucial to implementing the BI vision in any enterprise.
As an analytical information consumer, there are 2 important requirements:
  1. Need direction to access the required analytical content
    Example:
    • Where can I get Sales by Product for different locations over the last 2 years?
    • Am interested in Customer related Analytics. Where do I access it?
  2. Once the content is retrieved, need guidance on how to make sense of it
    Example:
    • Report shows Forecasted Sales for next quarter in the chart. How is this value calculated?
    • Does the total inventory value displayed in the report include the Raw material inventory or does it exclude it?
Business metadata when properly organized should provide direction to both the points mentioned above.
Metadata management in BI deals with integration of technical and business data in a way that is useful for the organization. The challenge of metadata management becomes even more daunting when one considers both structured and unstructured data. Having said that, it is important for BI practitioners to understand the true nature of BI metadata and provide implementable solutions in their specific organizational context.
In my future posts, I would discuss this fascinating area of Metadata management, with its manifestation as “Technical and Business Metadata” in both structured and unstructured data domains.
Read More about  Metadata 101

Tuesday 18 March 2008

Data Integration Challenge – Initial Load – I


In a data warehouse all tables usually go through two phases of data load process they are the initial load and the incremental load. ‘History Load’ or ‘Initial Seeding/Load’ involves a one time load of the source transaction system data of the past years into the Data Management System. The process of adding only the new records (updations or insertions) to the data warehouse tables either daily or on a predefined frequency is called ‘Incremental Load‘. Also certain tables that are of small in size and largely independent set of tables which receives full data (current data + history data) as input would be loaded by means of a ‘Full Refresh‘; this involves complete delete and reload of data.

Especially code tables would usually under go a one time initial load and may not be required for a regular incremental load, incremental load is common for fact tables. Initial Load of a data warehouse system is quite a challenge in terms of getting it completed successfully within a planned timeframe. Some of the surprises or challenges faced in completing the history load are
  1. Handling invalid records
  2. Data Reconciliation
  3. System performance
  4. Catching up
Handling Invalid Records:
The occurrence of an invalid record becomes much more prominent as we process the history data which was collected into the source system much long before and the history data might not fit into the current business rules. The records from a source system can become invalid in the data warehouse due to multiple reasons like invalid domain value for a column or null value for a non null-able field or aggregate data not matching to the detail data. The ways of handling this problem effectively are
  • Determine the years of data to be loaded into the data warehouse very initially and ensure that the data profiling is performed on the sample data for all the years that has to be loaded. This ensures that most of the rules of data validation are identified up front and built as part of ETL process. In certain situations we may have to build separate data validation and transformation logic based on the year and data
  • Especially in situations like re-platforming or migrating the existing data warehouse to a new platform, even before running the data through regular ETL process we might need to load the old data into a data validation (staging) area through which the data analysis is done, cleaned and then data loaded into the data warehouse through regular ETL process
  • Design the ETL process to divert all the key values of the invalid records to a separate set of tables. In some sites we see that the customer just needs to be aware for the error records and fine if these records doesn’t get aligned into the current warehouse, but at times the invalid records are corrected and reloaded
  • For certain scenarios like aggregate data not matching to detail data, though we might always derive aggregate from detail data at times we might also generate detail data to match aggregate data
Data Reconciliation:
Once the initial load from the source system into the data warehouse has been completed we have to validate to ensure that the data has been moved in correctly.
  • Having a means of loading records in groups separated by years or any logical grouping like by customer or product would give a better control in terms of data validation. In general data validations performed are like count and sum should be tied to certain business specific validation rules like all customers from region ‘A’ belonging to division ‘1’ in the source should be classified under division ‘3’ in the current warehouse.
  • All the validations that needs to be performed after the initial load for each data group has to prepared and verified with the business team, many a times the data is validated by the business as a adhoc query process though the same can be verified by an automated ETL process by the data warehouse team
We shall discuss further on the other challenges in Part II.
Read More About: Data Integration

Friday 29 February 2008

BI Strategy – Approach based on First Principles

Business Intelligence Strategy definition is typically the first step in an organization’s endeavor to implement BI (Business Intelligence). This phase is very crucial as the overall execution direction hinges on decisions taken in this stage.
The precise approach to the BI Strategy definition includes the following steps:
  1. Business Area Identification - Identify and prioritize the business area(s) for which BI is considered. Ex: Human Resource Analytics, Supply Chain Analytics, Enterprise Performance Analytics etc.
  2. Process Mapping Document - Once the business area is identified, map out the individual processes involved in that particular domain. This can be a simple flow-chart that shows the entry and exit criteria for each sub-process.
  3. Business Questions Enumeration – Based on the subject areas involved in the business domain, enumerate the list of questions that are to be answered by the analytical layer.
  4. Data Elements Segregation – For each of the process steps, identify the data elements. These data elements, after subsequent validation (in conjunction with business questions) would translate into dimensions and facts during the data modeling stage.
  5. Data Visualization – Develop a prototype (set of screenshots) on how the data would be visualized for each business question. Business Analysts and domain experts are typically involved at this stage.
  6. BI Architecture Synopsis – At a fundamental level, BI architecture is fairly straightforward. The architecture is almost always a combination of the following processes: Extraction (E), Transformation (T), Loading (L), Cubing (C), and Analyze (Z). The number of layers, type of reporting etc. are a combination of ETLCZ components. Ex: ETLZ, ETLTLCZ, ELTZ, ELCZ are some options for BI architecture definition.
  7. Next Steps Document – The ‘Next Steps’ document would list down the other requirements of / from the analytical infrastructure. These can be points around Tool Evaluation, User profiles, Data volumes, Performance considerations, etc. Each of these requirements would translate to an assessment to be carried out before the actual construction begins.
The most common mistake is to start thinking about technology aspects before the actual business requirement is finalized. A precise definition of business questions goes a long way in designing a scalable and robust BI infrastructure. 
Read More about  BI Strategy

Monday 28 January 2008

Business Intelligence and Six Sigma

I just finished a Six Sigma project and was left wondering as to why BI practitioners are not using more of that Six Sigma power in Business Intelligence. Let me delve on this subject a bit more.
The Six Sigma project that I just completed was on “Developing a Function Point based estimation model for ETL loads”. Essentially, I was facing a lot of problems in estimating the effort for ETL (in this case, Informatica) loads that led to “Effort variances” beyond specified limits. So we kicked off a Six Sigma project that had the following DMAIC phases:
1. Define – Definition of the problem (Ex: Estimation process is out of whack)
2. Measure – We measured the effort variances before the start of the project and also set ourselves a target of where it should be.
3. Analyze – Analyzed the root-cause of the problem. The solution was to let go of the complexity based estimation that was done initially and to adapt Function points. In fact, this FP based estimation model was presented at the International Software Estimation Colloquium last year and won the Runner-up prize (http://www.qaiasia.com/Conferences/sec2007/leadership.htm)
4. Improve – Based on a pilot within the project, the Function points based linear regression model was arrived at and the team was educated on the estimation process. The improvements to the estimation process (effort variances) were measured on a regular basis.
5. Control – Periodic checks to ensure the institutionalization of the process and also fine-tune wherever necessary.
That in a nut-shell is what my Six Sigma project was all about. Basically, Six Sigma tries to improve process efficiencies by following the phases mentioned above.
Now let’s see the connection to Business Intelligence. Analytics at this stage of evolution (in majority of organizations) are being used to find the improvement area at a given point of time. The improvement area can be a problem (Ex: Trend chart showing that the Sales in the West region is dropping by 10% every quarter for the last 3 quarters) or an opportunity (Ex: Market potential for a product is huge and our share is small). BI is reasonably good at providing this information and it will only get better. But BI by itself does not enforce the process / execution rigor that is required for successful organizations.
To summarize, Six Sigma needs an improvement opportunity as the starting point for it to unleash its power to improve processes. BI generates lot of these opportunities with its DW/Reporting/Analytics components but does not enforce the process implementation rigor. I feel that there is lot of synergy in bringing both together – Six Sigma, the left hand and BI, the right hand when brought together can earn a lot of claps in the quest to create learning, performing organizations.
Just to sample the power of Six Sigma techniques, please take a look at the following link:http://www.kaushik.net/avinash/2007/01/excellent-analytics-tip-9-leverage-statistical-control-limits.html, which illustrates the use of control charts (one of Six Sigma’s potent tools) in metrics / KPI management. Fascinating!
Agree / Not Agree, Have more thoughts on this topic, this post is good / rubbish, for anything – Please do send in your comments.
Information Nugget:Having talked about execution rigor, let me recommend one of the best books I have read in that area. “Execution – The Discipline of Getting Things Done” by Larry Bossidy and Ram Charan (http://www.amazon.com/Execution-Discipline-Getting-Things-Done/dp/0609610570)

Wednesday 16 January 2008

Linking BI Technology, Process and People – A Theory of Constraints (TOC) View


With the advent of a new year, let me do a recap of what I have discussed through my 15 odd posts in 2007 and also set the direction for my thoughts in 2008.
I started with the concept (http://blogs.hexaware.com/business_intelligence/2007/06/business-intell.html) of BI Utopia in which information is available to all stakeholders at the right time, right time and in the right format. The bottomline is to help organizations compete on analytics in the marketplace. With that concept as the starting point, I explored some technology enablers like Real Time Data Integration, Data Modeling, Service Oriented architecture etc. and also some implementation enablers like Agile Framework, Calibration of DW systems and Function points based estimation for DW. In my post on Data Governance, I introduced the position of a CDO (Chief Data Officer) to drive home the point that nothing is possible (atleast in BI) without people!
To me, BI is about 3 things – Technology, Process, People. I consider these three as the holy triumvirate for successful implementation of Business Intelligence in any organization – Not only are the individual areas important by itself but the most important thing is the link between these 3 areas. Organizations that are serious about ‘Analytics’ should continuously elevate their technology, process & people capability and more importantly strengthen the link between them – afterall, any business endeavor is only as good as its weakest link.
Theory of Constraints (http://en.wikipedia.org/wiki/Theory_of_Constraints) does offer a perspective, which I feel is really useful for BI practitioners. I will explore more of this in my subsequent posts.
My direction in 2008 for posts on this blog are:
  1. Continue with my thoughts on Business Intelligence along Technology, Process and People dimensions

  2. Provide a “Theory of Constraints” based view of BI with focus on strengthening the link between the 3 dimensions mentioned above.

Almost every interesting business area – Six Sigma, Balanced Scorecard, System Dynamics, Business Modeling, Enterprise Risk, Competitive Intelligence, etc. has its relationship with BI and we will see more of this in 2008.
Please do keep reading and share your thoughts as well Business Intelligence