Pages

Ads 468x60px

Labels

Tuesday, 18 March 2008

Data Integration Challenge – Initial Load – I


In a data warehouse all tables usually go through two phases of data load process they are the initial load and the incremental load. ‘History Load’ or ‘Initial Seeding/Load’ involves a one time load of the source transaction system data of the past years into the Data Management System. The process of adding only the new records (updations or insertions) to the data warehouse tables either daily or on a predefined frequency is called ‘Incremental Load‘. Also certain tables that are of small in size and largely independent set of tables which receives full data (current data + history data) as input would be loaded by means of a ‘Full Refresh‘; this involves complete delete and reload of data.

Especially code tables would usually under go a one time initial load and may not be required for a regular incremental load, incremental load is common for fact tables. Initial Load of a data warehouse system is quite a challenge in terms of getting it completed successfully within a planned timeframe. Some of the surprises or challenges faced in completing the history load are
  1. Handling invalid records
  2. Data Reconciliation
  3. System performance
  4. Catching up
Handling Invalid Records:
The occurrence of an invalid record becomes much more prominent as we process the history data which was collected into the source system much long before and the history data might not fit into the current business rules. The records from a source system can become invalid in the data warehouse due to multiple reasons like invalid domain value for a column or null value for a non null-able field or aggregate data not matching to the detail data. The ways of handling this problem effectively are
  • Determine the years of data to be loaded into the data warehouse very initially and ensure that the data profiling is performed on the sample data for all the years that has to be loaded. This ensures that most of the rules of data validation are identified up front and built as part of ETL process. In certain situations we may have to build separate data validation and transformation logic based on the year and data
  • Especially in situations like re-platforming or migrating the existing data warehouse to a new platform, even before running the data through regular ETL process we might need to load the old data into a data validation (staging) area through which the data analysis is done, cleaned and then data loaded into the data warehouse through regular ETL process
  • Design the ETL process to divert all the key values of the invalid records to a separate set of tables. In some sites we see that the customer just needs to be aware for the error records and fine if these records doesn’t get aligned into the current warehouse, but at times the invalid records are corrected and reloaded
  • For certain scenarios like aggregate data not matching to detail data, though we might always derive aggregate from detail data at times we might also generate detail data to match aggregate data
Data Reconciliation:
Once the initial load from the source system into the data warehouse has been completed we have to validate to ensure that the data has been moved in correctly.
  • Having a means of loading records in groups separated by years or any logical grouping like by customer or product would give a better control in terms of data validation. In general data validations performed are like count and sum should be tied to certain business specific validation rules like all customers from region ‘A’ belonging to division ‘1’ in the source should be classified under division ‘3’ in the current warehouse.
  • All the validations that needs to be performed after the initial load for each data group has to prepared and verified with the business team, many a times the data is validated by the business as a adhoc query process though the same can be verified by an automated ETL process by the data warehouse team
We shall discuss further on the other challenges in Part II.
Read More About: Data Integration

Friday, 29 February 2008

BI Strategy – Approach based on First Principles

Business Intelligence Strategy definition is typically the first step in an organization’s endeavor to implement BI (Business Intelligence). This phase is very crucial as the overall execution direction hinges on decisions taken in this stage.
The precise approach to the BI Strategy definition includes the following steps:
  1. Business Area Identification - Identify and prioritize the business area(s) for which BI is considered. Ex: Human Resource Analytics, Supply Chain Analytics, Enterprise Performance Analytics etc.
  2. Process Mapping Document - Once the business area is identified, map out the individual processes involved in that particular domain. This can be a simple flow-chart that shows the entry and exit criteria for each sub-process.
  3. Business Questions Enumeration – Based on the subject areas involved in the business domain, enumerate the list of questions that are to be answered by the analytical layer.
  4. Data Elements Segregation – For each of the process steps, identify the data elements. These data elements, after subsequent validation (in conjunction with business questions) would translate into dimensions and facts during the data modeling stage.
  5. Data Visualization – Develop a prototype (set of screenshots) on how the data would be visualized for each business question. Business Analysts and domain experts are typically involved at this stage.
  6. BI Architecture Synopsis – At a fundamental level, BI architecture is fairly straightforward. The architecture is almost always a combination of the following processes: Extraction (E), Transformation (T), Loading (L), Cubing (C), and Analyze (Z). The number of layers, type of reporting etc. are a combination of ETLCZ components. Ex: ETLZ, ETLTLCZ, ELTZ, ELCZ are some options for BI architecture definition.
  7. Next Steps Document – The ‘Next Steps’ document would list down the other requirements of / from the analytical infrastructure. These can be points around Tool Evaluation, User profiles, Data volumes, Performance considerations, etc. Each of these requirements would translate to an assessment to be carried out before the actual construction begins.
The most common mistake is to start thinking about technology aspects before the actual business requirement is finalized. A precise definition of business questions goes a long way in designing a scalable and robust BI infrastructure. 
Read More about  BI Strategy

Monday, 28 January 2008

Business Intelligence and Six Sigma

I just finished a Six Sigma project and was left wondering as to why BI practitioners are not using more of that Six Sigma power in Business Intelligence. Let me delve on this subject a bit more.
The Six Sigma project that I just completed was on “Developing a Function Point based estimation model for ETL loads”. Essentially, I was facing a lot of problems in estimating the effort for ETL (in this case, Informatica) loads that led to “Effort variances” beyond specified limits. So we kicked off a Six Sigma project that had the following DMAIC phases:
1. Define – Definition of the problem (Ex: Estimation process is out of whack)
2. Measure – We measured the effort variances before the start of the project and also set ourselves a target of where it should be.
3. Analyze – Analyzed the root-cause of the problem. The solution was to let go of the complexity based estimation that was done initially and to adapt Function points. In fact, this FP based estimation model was presented at the International Software Estimation Colloquium last year and won the Runner-up prize (http://www.qaiasia.com/Conferences/sec2007/leadership.htm)
4. Improve – Based on a pilot within the project, the Function points based linear regression model was arrived at and the team was educated on the estimation process. The improvements to the estimation process (effort variances) were measured on a regular basis.
5. Control – Periodic checks to ensure the institutionalization of the process and also fine-tune wherever necessary.
That in a nut-shell is what my Six Sigma project was all about. Basically, Six Sigma tries to improve process efficiencies by following the phases mentioned above.
Now let’s see the connection to Business Intelligence. Analytics at this stage of evolution (in majority of organizations) are being used to find the improvement area at a given point of time. The improvement area can be a problem (Ex: Trend chart showing that the Sales in the West region is dropping by 10% every quarter for the last 3 quarters) or an opportunity (Ex: Market potential for a product is huge and our share is small). BI is reasonably good at providing this information and it will only get better. But BI by itself does not enforce the process / execution rigor that is required for successful organizations.
To summarize, Six Sigma needs an improvement opportunity as the starting point for it to unleash its power to improve processes. BI generates lot of these opportunities with its DW/Reporting/Analytics components but does not enforce the process implementation rigor. I feel that there is lot of synergy in bringing both together – Six Sigma, the left hand and BI, the right hand when brought together can earn a lot of claps in the quest to create learning, performing organizations.
Just to sample the power of Six Sigma techniques, please take a look at the following link:http://www.kaushik.net/avinash/2007/01/excellent-analytics-tip-9-leverage-statistical-control-limits.html, which illustrates the use of control charts (one of Six Sigma’s potent tools) in metrics / KPI management. Fascinating!
Agree / Not Agree, Have more thoughts on this topic, this post is good / rubbish, for anything – Please do send in your comments.
Information Nugget:Having talked about execution rigor, let me recommend one of the best books I have read in that area. “Execution – The Discipline of Getting Things Done” by Larry Bossidy and Ram Charan (http://www.amazon.com/Execution-Discipline-Getting-Things-Done/dp/0609610570)

Wednesday, 16 January 2008

Linking BI Technology, Process and People – A Theory of Constraints (TOC) View


With the advent of a new year, let me do a recap of what I have discussed through my 15 odd posts in 2007 and also set the direction for my thoughts in 2008.
I started with the concept (http://blogs.hexaware.com/business_intelligence/2007/06/business-intell.html) of BI Utopia in which information is available to all stakeholders at the right time, right time and in the right format. The bottomline is to help organizations compete on analytics in the marketplace. With that concept as the starting point, I explored some technology enablers like Real Time Data Integration, Data Modeling, Service Oriented architecture etc. and also some implementation enablers like Agile Framework, Calibration of DW systems and Function points based estimation for DW. In my post on Data Governance, I introduced the position of a CDO (Chief Data Officer) to drive home the point that nothing is possible (atleast in BI) without people!
To me, BI is about 3 things – Technology, Process, People. I consider these three as the holy triumvirate for successful implementation of Business Intelligence in any organization – Not only are the individual areas important by itself but the most important thing is the link between these 3 areas. Organizations that are serious about ‘Analytics’ should continuously elevate their technology, process & people capability and more importantly strengthen the link between them – afterall, any business endeavor is only as good as its weakest link.
Theory of Constraints (http://en.wikipedia.org/wiki/Theory_of_Constraints) does offer a perspective, which I feel is really useful for BI practitioners. I will explore more of this in my subsequent posts.
My direction in 2008 for posts on this blog are:
  1. Continue with my thoughts on Business Intelligence along Technology, Process and People dimensions

  2. Provide a “Theory of Constraints” based view of BI with focus on strengthening the link between the 3 dimensions mentioned above.

Almost every interesting business area – Six Sigma, Balanced Scorecard, System Dynamics, Business Modeling, Enterprise Risk, Competitive Intelligence, etc. has its relationship with BI and we will see more of this in 2008.
Please do keep reading and share your thoughts as well Business Intelligence

Thursday, 3 January 2008

BI Appliances


What is a BI Appliance?
If a data warehouse class database product or a reporting product or a data integration product or an all-in-one software package is pre installed and available in a preconfigured hardware box, then such a “hardware + software” box is called a ‘Business Intelligence'  Appliance’. The very purpose of an appliance model is to cover the underlying software components complexity and intricacies and make it simple like operating a TV system.
How an Appliance Model evolved?
As businesses gathered huge data, the demand for faster and better ways of analyzing data increased, the data warehouse as a software technology got evolved; there have been continuous efforts to build software systems that are cognizant of data warehouse environments.


We have seen IBM and Oracle releasing their data warehouse specific database editions
We seen the growth of data warehouse specific databases like RedBrick(now part of IBM), Teradata, Greenplum…
We have seen simple list reporting tools getting into proprietary data structures cubes and the emergence of acronyms MOLAP, HOLAP, ROLAP, DOLAP
We had a very new software market created for ETL and EII products
We have seen more new software applications related to BI being created BAM, CPM, Metadata Management, Data Discovery and lot more getting defined every day into the market….

As many organizations started setting up its BI infrastructure or enhanced its existing BI environment with different BI software packages they needed, they also imbibed different platforms and hardware, the maintenance of these became frightening. Getting started with a BI project by itself became a bigger project; we needed to spend sufficient time not just on choosing the right set of BI products but also on the supported hardware, dependent software packages and the platform. No BI vendor currently addresses the complete stack of BI system needs and this has been the driving factor for more acquisitions.
Products like Nettezza (Data base Appliance), CastIron (ETL Appliance) came up with their ‘software in a box’ concept, where we can buy or rent preconfigured ‘hardware + software’ boxes which in a way addresses the need of ‘ready to use’ BI market. Many of these boxes have Linux, open source databases, web server, message queues and proprietary software.
The Appliance based model is not new, IBM has been renting its ‘mainframe + software’ for decades. IBM has addressed the BI market with its ‘Balanced Warehouse’; a preconfigured ‘hardware + software’, its OS can vary from Windows – AIX – Linux with DB2 as database and data reporting can vary from DB2 Cubes – Crystal – Business Objects. HP in a similar way has come out with its Neoview platform which is a revitalized version of NonStop SQL database and NonStop OS.
The need of a CIO has been always ways to shorten the application deployment cycle and reduce the maintenance factor of the servers; the Appliance based products meet these KRA of a CIO and are getting accepted widely.
The Future
More Appliances, Focus on Performance:
We would see more BI appliances coming into market; as the Appliance model covers what’s underneath and in many cases the details being not available; the buying focus would be more on what the products deliver rather than what they have inside.
Common Appliance Standards:
Getting best of breed of software and hardware from a single vendor would not happen. We might see both software and hardware vendors defining a set of basic standards among themselves for the Appliance model. New organizations would also evolve similar to “tpc.org” which would define performance standards for appliances. We might see companies similar to DELL coming up which can assemble best of breed components and deliver a packaged BI Appliance.
More Acquisitions: The current  Business Intelligence Market landscape can also be interpreted as
  1. Hardware + Software or Appliance based vendors – HP, IBM
  2. Pure software or Non-Appliance based vendors – Oracle, Microsoft, SAP

Once the current BI software consolidation gets established the next wave of consolidation would be towards companies like Oracle looking for hardware companies to be added to their portfolio.
TechnologyAppliance Products
DatabaseNetezza
Teradata
DATAllegro
Dataupia
Data IntegrationCASTIron
Reporting-DashboardCognos NOW (Celequest LAVA)
Configurable Stack (with third party support)IBM Balanced Warehouse
HP Neoview