Pages

Ads 468x60px

Labels

Thursday, 16 August 2007

Business Intelligence Utopia – Enabler 2: Real Time Data Integration


Business Intelligence practitioners tend to have lot of respect and reverence for transaction processing systems (OLTP), for without them the world of analytical apps simply does not exist. That explains my previous blog in introducing the first enabler for BI Utopia – The Evolution of OLTP systems to support Operational BI.
In this post, I introduce the second enabler in the “Power of Ten” – Real Time Data Integration
Data Integration in the BI sense, is all about, extracting data from multiple source systems, transforming them using business rules and loading it back into data repositories built to facilitate analysis, reporting, etc.
Given that the raw data has to be converted to a different form more amenable for analysis & decision-making, there are 2 basic questions to be answered:
  1. From a business standpoint, how fast should the ‘data-information’ conversion happen?
  2. From a technology standpoint, how fast can the ‘data-information’ conversion happen?

Traditionally, BI being used more for strategic decision-making,  batch mode of data integration with periodicity of a day or later, was acceptable. But increasingly, businesses demand that the conversion has to happen much faster and technology has to support it. This leads to the concept of “Real Time BI” or more correctly "RightTime Data Integration"
Since the answer to the first question “How Fast” is fast becoming “as fast as possible”, the focus has shifted to the technology side. One area where I foresee a lot of activity, from a Data Warehouse architectural standpoint, is in the close interaction of messaging tools like IBM Websphere MQ etc. with data integration tools. At this point in time, though the technology is available, there aren’t too many places where messaging is embedded into the BI architectural landscape.
Bottom-line is that there is significant value gained by ensuring that raw business data is transformed to information by the BI infrastructure, as fast as possible – the limits being prescribed by business imperatives. The best explanation I have come across to explain the value of information latency is the article by Richard Hackathorn .
Active Data Warehousing is another topic closely related to Real Time Data Integration and you can get some perspective on it thro’ the blog on Decision management by James Taylor:

Wednesday, 1 August 2007

Data Integration Challenge – Identifying changes from a table by a Scratch


In scenarios when a table in the staging area or in the data warehouse needs to be queried to find the changed records (inserted or updated), we can use the Scratch table design. Scratch table is a temporary table that can be designed to hold the changes happening against a table, once the changes are noted by the required application or process then the Scratch table can be cleaned-off.

The process to capture the changes and the clean up would be designed as part of ETL process. The scenarios where to use this concept and the steps to use the Scratch table is discussed below:
Steps to use Scratch table
  1. Create a Scratch table ‘S’ of structure to hold the Primary Key column value from the table ‘T’ that needs to the monitored for changes

  2. In the ETL process that loads the table ‘T’ add the logic in such way that while inserting or updating a record into table ‘T’ we insert the Primary Key column values of the record into the Scratch table ‘S’

  3. If required while inserting the record into the Scratch table ‘S’ have a flag column that says ‘Insert’ or ‘Update’

  4. Any process that needs to find the changes would join the Scratch table ‘S’ and the table ‘T’ to pull the changed records, if it just needs the key directly access ‘S’
  5. Once the changes have been pulled and processed, have a process that would clean up the Scratch table

  6. We can also bind the Scratch table ‘S’ to be always loaded to the memory for higher performance

When to use Scratch table
  1. When we have a persistent staging area, using Scratch table would be ideal choice to move the changes to the data warehouse

  2. When the base table ‘T’ is really huge and only few changes happen

  3. When the changes (or the Primary Key values) in table ‘T’ are required by multiple processes

  4. When the changes in table ‘T’ is to be joined with other tables i.e., now the Scratch Table ‘S’ can be used as the driving table in joins with other tables which would give better performance since the Scratch table would be thinner with few records

Alternate Option: Having a flag or a timestamp column in the table ‘T’ and having an index on it. Having an index on Timestamp is costly and a bit map index on the flag column may be seen as an option, but the aspect of updating the column during updates, huge volume and in scenarios of joining with other tables this would be a disadvantage, have seen Scratch table to be a best option. Let me know the other options you have used to handle such situations…
To add more variety to your thoughts, you can read it More Data Integration Challenge

Friday, 13 July 2007

Data Integration Challenge – Capturing Changes


When we receive the data from source systems, the data file will not carry a flag indicating whether the record provided is new or has it changed. We would need to build process to determine the changes and then push them to the target table.

There are two steps to it
  1. Pull the incremental data from the source file or table

  2. Process the pulled incremental data and determine the impact of it on the target table as Insert or Update or Delete

Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we will not be able to find the new records and have to consider the complete data
For source system’s file or table that has audit columns, we would follow the below steps
  1. While reading the source records for a day (session), find the maximum value of date(audit filed) and store in a persistent variable or a temporary table
  2. Use this persistent variable value as a filter in the next day to pull the incremental data from the source table

Step 2: Determine the impact of the record on target table as Insert/Update/ Delete 
Following are the scenarios that we would face and the suggested approach
  1. Data file has only incremental data from Step 1 or the source itself provide only incremental data

    • do a lookup on the target table and determine whether it’s a new record or an existing record
    • if an existing record then compare the required fields to determine whether it’s an updated record
    • have a process to find the aged records in the target table and do a clean up for ‘deletes’

  2. Data file has full complete data because no audit columns are present

    • The data is of higher

      • have a back up of the previously received file
      • perform a comparison of the current file and prior file; create a ‘change file’ by determining the inserts, updates and deletes. Ensure both the ‘current’ and ‘prior’ file are sorted by key fields
      • have a process that reads the ‘change file’ and loads the data into the target table
      • based on the ‘change file’ volume, we could decide whether to do a ‘truncate & load’
    • The data is of lower volume

      • do a lookup on the target table and determine whether it’s a new record or an existing record
      • if an existing record then compare the required fields to determine whether it’s an updated record
      • have a process to find the aged records in the target table and do a clean up or delete


Friday, 6 July 2007

Business Intelligence Utopia – Dream to Reality: Key Enablers


In the last post, I discussed my view of BI Utopia in which information is available to all stakeholders at the right time, in the right format enabling them to make actionable decisions at both strategic & operational levels. Having said that, the BI street is not paved with gold.

I consider the following key enablers as pre-requisites to achieve true ‘Information Democracy’ in an enterprise. The “Power of Ten” includes:
  1. Proliferation of agile, modular & robust transaction processing systems.

  2. Real Time Data Integration Components

  3. Strong Data Governance structure

  4. Service Oriented Architecture

  5. Extensible Business centric Data Models

  6. Flexible business rules repositories surrounded by clean metadata/reference data environments

  7. Ability to integrate unstructured information into the BI architectural landscape

  8. Guided context-sensitive, user-oriented analytics

  9. Analytical models powered by Simulations

  10. Closed loop Business Intelligence Utopia

Each of these units comprising the “Power of Ten” is a fascinating topic on its own. We will drill-down and focus on some of the salient features of each of these areas in the coming weeks.

Friday, 29 June 2007

Data Integration Challenge – Facts Arrive Earlier than Dimension


The fact transactions that come in earlier than the dimension (master) records are not bad data, such fact records needs to be handled in our ETL process as a special case. Such situations of facts coming in before dimensions can occur quite commonly like in case of a customer opening a bank account and his transactions starting to flow into the data warehouse immediately.
But the customer id creation process from the Customer Reconciliation System can get delayed and hence the customer data would reach the data warehouse after few days.
How do we handle this scenario differs based on the business process being addressed, there could be two different requirements
  • Make the fact available and report under “In Process” category; commonly followed in financial reporting systems to enable reconciliation
  • Make the fact available only when the dimension is present,; commonly followed in status reporting systems
Requirement 1: Make the fact available and report under “In Process” category
For this requirement follow the below steps
  1. Insert into the dimension table a record that represents a default or ‘In Process’ status like in case of the banking example the Customer Dimension would have a ‘default record’ inserted that represents the information that the customer detail has not yet arrived

  2. In the ETL process while populating the Fact table, for the transactions that do not have a corresponding entry in the Dimension table, assign a default Dimension key and insert into the Fact. In the same process insert the Dimensions Lookup values into a ‘temporary’ or ‘error’ table

  3. Build an ETL process that checks the new records inserted into the Dimension table, queries the temporary table and identifies the records in facts for which the dimension key has to be updated and updates the respective fact’s dimension key

Requirement 2: Make the fact available only when the dimension is present
For this requirement follow the below steps
  1. Build an ETL process that populates the fact into a staging table

  2. Build an ETL process that pushes only the records that have a dimension value to the data warehouse tables

  3. At the end of ETL process delete all the processed records from the staging table making the other unprocessed records available to be pulled next time

You can read more about  Data Integration Challenge