Business Intelligence & Analytics: Data Integration

Showing posts with label Data Integration. Show all posts

Thursday 19 June 2008

Data Integration Challenge – Carrying Id-Date Fields from Source

Ids in Source System: Sometimes we would have been in a dilemma to decide whether to carry the identity (id) fields from source system as identity fields into the data warehouse as well. There is couple of situations which would push us to this state1.The business users are familiar with the product ids like 1211 , 1212 than by the product name it self and they need them in the target system

2.Why should we create an additional id field in the target table when I can have a unique identity for each record from the source system

What are Opensource Id fields, they are usually the surrogate keys or unique record keys like product id, customer id etc which the business might be more familiar than with their descriptions, descriptions of these ids are more found on report printouts. In general most of the source id fields would get mapped to the dimension tables.

Here are the reasons why we should carry the source id fields as it is

The business is comfortable talking and analyzing in terms of ids than descriptions
Having source ids fields which are usually numeric or if not of smaller length is very much lookup friendly, using ids for lookup or filter or join conditions when pulling data from source systems is much better than descriptions
Source id fields enables linking of the data from the data warehouse to the source system
Just consider the id as another attribute to the dimension and not as a unique identifier

Here are the reasons why we should create additional id (key) field in addition to the source id field

Avoiding duplicate keys if the data to be sourced from multiple systems
The source ids can merge, split, anything can happen, we would want to avoid the dependency on the source system
The id field created in the data warehouse would be index friendly
Having a unique id for each record in the data warehouse would help in determining the number of unique records in a much easier way

Dates in Source System: One other field that we usually confuse is the date field in the source systems. The date field present in the source record might provide the information when the record arrived in the source system and a date field generated in the data warehouse system would provide the information when a source record arrived into the data warehouse.

The data warehouse record date and the source record date can be same if the source record gets moved into the data warehouse the same day, certainly both date fields might represent different date values if there is a delay in the source data arrival into the data warehouse.

Why we need to store Source Date in the data warehouse, this need is very clear, we always perform date based analysis based on the arrival of source record in the source systems.

Why we need to generate a Data Warehouse Date, capturing the arrival of the record into the data warehouse answers queries related to audit, data growth and as well to determine what new records arrived into the warehouse which is especially useful for providing incremental extracts for downstream marts or other systems.

Read Brief About: Data Integration Challenge

Monday 9 June 2008

Hybrid OLAP – The Future of Information Delivery

As I get to see more Enterprise BI initiatives, it is becoming increasingly clear (atleast to me!) that when it comes to information dissemination, Hybrid Online Analytical Processing (HOLAP) is the way to go. Let me explain my position here.

As you might be aware, Relational (ROLAP), Multi-dimensional (MOLAP) and Hybrid OLAP (HOLAP) are the 3 modes of information delivery for BI systems. In an ROLAP environment, the data is stored in a relational structure and is accessed through a semantic layer (usually!). MOLAP on the other hand stores data in proprietary format providing the notion of a multi-dimensional cube to users. HOLAP combines the power of both ROLAP and MOLAP systems and with the rapid improvements made by BI tool vendors, seems to have finally arrived on the scene.

In my mind, the argument for subscribing to the HOLAP paradigm goes back to the “classic” article

by Ralph Kimball on different types of fact table grains. According to him, there are 3 types of fact tables – Transaction grained, Periodic snapshot, Accumulating snapshot and that atleast 2 of them are required to model a business situation completely. From an analytical standpoint, this means that operational data has to be analyzed along with summarized data (snapshots) for business users to take informed decisions.

Traditionally, the BI world has handled this problem in 2 ways:

1) Build everything on the ROLAP architecture. Handle the summarization either on the fly or thro’ summarized reporting tables at the database level. This is not a very elegant solution as everybody in the organization (even those analysts working with summarized information) gets penalized for the slow performance of SQL queries issued against the relational database through the semantic layer.

2) Profile users and segregate operational analysts from strategic analysts. Operational users are provided ROLAP tools while business users working primarily with summarized information are provided their “own” cubes (MOLAP) for high-performance analytics.

Both solutions are rapidly becoming passé. In many organizations now, business users wants to look at summarized information and based on what they see, needs the facility to drill down to granular level information. A good example is the case of analyzing Ledger information (Income statement & Balance Sheet) and then drilling down to Journal entries as required. All this drilling down has to happen through a common interface – either an independent BI Tool or an enterprise portal with an underlying OLAP engine.

This is the world of HOLAP and it is here to stay. The technology improvement that is making this possible is the relatively new wonder-kid, XMLA (XML for Analysis). More about XMLA in my subsequent posts.

As an example of HOLAP architecture, you can take a look at this link

to understand the integration of Essbase cubes (MOLAP at its best) with OBIEE (Siebel Analytics – ROLAP platform) to provide a common semantic model for end-user analytics.

Information Nugget: If you are interested in Oracle Business Intelligence, please do stop by at http://www.rittmanmead.com/blog/ to read his blogs. The articles are very informative and thoroughly practical.

Thanks for reading. Please do share your thoughts.

Monday 2 June 2008

Data Integration Challenge – Parent-Child Record Sets, Child Updates

There are certain special set of records like Loan & its Guarantor details in a banking system, each Loan record can have one or more Guarantor record. In a similar way for a services based industry Contracts & its contract Components exist, these sets can be called as parent-child records where in for one parent record like Loan we might have zero to many child records of Guarantor.

During data modeling we would have one table for the parent level record and its attribute, another separate table for the child records and its attributes.

As part of the data load process, have seen situations where a complete refresh (delete & insert) of the Child records is required whenever there is a change in certain attributes of a parent record. This requirement can be implemented in different ways; here we would look at one of the best ways to get this accomplished.

The following steps would be involved in the ETL process

Read the parent-child record
Determine if a change in the incoming parent record
If a change has occurred then issue a delete to the particular set of child records
Write corresponding incoming new child records into a flat file
Once step 1 to 4 is completed for all parent records have another ETL flow that would bulk load the records from the flat file to the child table

We didn’t issue an insert with a new incoming child record after the delete because the deleted record wouldn’t have got committed and an insert can lock the table. We can issue a commit after every delete and then follow it with an insert but having a commit after each delete would be costlier, writing the inserts to the files handles this situation perfectly.

Also an option to insert first with a different key and then delete the older records would be costlier in terms of locating the records that needs to the deleted.

We could have also looked at the option of updating the records in place of deletion then we would at times end up having dead records in the child tables; the records that have been deleted in the source would still exist in the target child table, also updating a record can disturb contagious memory, deletion and insert would have the pages intact.

Thursday 15 May 2008

Data Integration Challenge – Storing Timestamps

Storing timestamps along with a record indicating its new arrival or a change in its value is a must in a data warehouse. We always take it for granted, adding timestamp fields to table structures and tending to miss that the amount of storage space a timestamp field can occupy is huge, the storage occupied by timestamp is almost double against a integer data type in many databases like SQL Server, Oracle and if we have two fields one as insert timestamp and other field as update timestamp then the storage spaced required gets doubled. There are many instances where we could avoid using timestamps especially when the timestamps are being used for primarily for determining the incremental records or being stored just for audit purpose.

How to effectively manage the data storage and also leverage the benefit of a timestamp field?

One way of managing the storage of timestamp field is by introducing a process id field and a process table. Following are the steps involved in applying this method in table structures and as well as part of the ETL process.Data Structure

Consider a table name PAYMENT with two fields with timestamp data type like INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the changes for every present in the table
Create a table named PROCESS_TABLE with columns PROCESS_NAME Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
Now drop the fields of the TIMESTAMP data type from table PAYMENT
Create two fields of integer data type in the table PAYMENT like INSERT_PROCESS_ID and UPDATE_PROCESS_ID
These newly created id fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID would be logically linked with the table PROCESS_NAME and its field PROCESS_ID

ETL Process

Let us consider an ETL process called ‘Payment Process’ that loads data into the table PAYMENT
Now create a pre-process which would run before the ‘payment process’, in the pre-process build the logic by which a record is inserted with the values like (‘payment process’, SEQUNCE Number, current timestamp) into the PAYMENT table. The PROCESS_ID in the payment table could be defined as a database sequence function.
Pass the current_prcoess_id from pre-process step to the ‘payment process’ ETL process
In the ‘payment process’ if a record is to inserted into the PAYMENT table then the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT table then the current_process_id value is set to only the column UPDATE_PROCESS_ID
So now the timestamp values for the records inserted or updated in the table PAYMENT can be picked from the PROCESS_TABLE by joining by the PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID columns of the PAYMENT tableBenefits

The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less space when compared to the timestamp fields
Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are Index friendly
Its easier to handle these process id fields in terms picking the records for determining the incremental changes or for any audit reporting.

Wednesday 9 April 2008

Data Integration Challenge – Initial Load – II

The other challenges during Initial load are the

System Performance

Catching Up

System Performance is always a challenge during initial load especially when many years of history data are to be loaded; there is an increase in the usage of system resources because of huge data load which wouldn’t happen during regular incremental load. Some of the ways of handling the system performance issue during initial load are

Group the data loads by filters like years or customer and load the related data in chunks. We could load data for the month Jan, Feb, Mar or load the customers region wise from NY, followed by NJ etc. Such grouping of records for loading eliminates data surge and as well provides better way to perform data validation

As the data gets loaded into the warehouse the data required for lookup from the warehouse becomes huge, we need to decide the required lookup records based on the incoming data. For example if the data in the warehouse has data related to all the regions North, South, East, West and the incoming data currently has only North data then we need to have a override filter and access only the data pertaining to North from the warehouse

We could plan and increase the available memory for ETL server for a temporary period

Avoid sequential inserts, write data to a file, sort and bulk load

Determine and plan for more disk space requirement for initial load files that are extracted and provided by the source systems is an interesting problem where in the warehouse is not able to cope up and able to deliver latest data (or 1 day old) as in the source system. This problem would be more often due to the ETL performance issues, where even before the initial data is successfully loaded and verified in the DW, additional set of new records would have come from the source which the DW is not able to catch up.

Though at times some kind of code tuning, running things in parallel or hardware upgrade (usually a costly one at a later stage) could resolve such problems, in certain situations this problem could run into an unsolvable state where in the complete ETL architecture has to be re-looked.

One other way to manage such situations is to have the daily process of loading current data to proceed independently and in parallel through a separate of processes bring in the history data on a regular basis, in certain scenarios we might need to build a process that would run and sync up the current with the old data especially the aggregate data if any designed in the data model.

Tuesday 18 March 2008

Data Integration Challenge – Initial Load – I

In a data warehouse all tables usually go through two phases of data load process they are the initial load and the incremental load. ‘History Load’ or ‘Initial Seeding/Load’ involves a one time load of the source transaction system data of the past years into the Data Management System. The process of adding only the new records (updations or insertions) to the data warehouse tables either daily or on a predefined frequency is called ‘Incremental Load‘. Also certain tables that are of small in size and largely independent set of tables which receives full data (current data + history data) as input would be loaded by means of a ‘Full Refresh‘; this involves complete delete and reload of data.

Especially code tables would usually under go a one time initial load and may not be required for a regular incremental load, incremental load is common for fact tables. Initial Load of a data warehouse system is quite a challenge in terms of getting it completed successfully within a planned timeframe. Some of the surprises or challenges faced in completing the history load are

Handling invalid records
Data Reconciliation
System performance
Catching up

Handling Invalid Records:
The occurrence of an invalid record becomes much more prominent as we process the history data which was collected into the source system much long before and the history data might not fit into the current business rules. The records from a source system can become invalid in the data warehouse due to multiple reasons like invalid domain value for a column or null value for a non null-able field or aggregate data not matching to the detail data. The ways of handling this problem effectively are

Determine the years of data to be loaded into the data warehouse very initially and ensure that the data profiling is performed on the sample data for all the years that has to be loaded. This ensures that most of the rules of data validation are identified up front and built as part of ETL process. In certain situations we may have to build separate data validation and transformation logic based on the year and data
Especially in situations like re-platforming or migrating the existing data warehouse to a new platform, even before running the data through regular ETL process we might need to load the old data into a data validation (staging) area through which the data analysis is done, cleaned and then data loaded into the data warehouse through regular ETL process
Design the ETL process to divert all the key values of the invalid records to a separate set of tables. In some sites we see that the customer just needs to be aware for the error records and fine if these records doesn’t get aligned into the current warehouse, but at times the invalid records are corrected and reloaded
For certain scenarios like aggregate data not matching to detail data, though we might always derive aggregate from detail data at times we might also generate detail data to match aggregate data

Data Reconciliation:
Once the initial load from the source system into the data warehouse has been completed we have to validate to ensure that the data has been moved in correctly.

Having a means of loading records in groups separated by years or any logical grouping like by customer or product would give a better control in terms of data validation. In general data validations performed are like count and sum should be tied to certain business specific validation rules like all customers from region ‘A’ belonging to division ‘1’ in the source should be classified under division ‘3’ in the current warehouse.
All the validations that needs to be performed after the initial load for each data group has to prepared and verified with the business team, many a times the data is validated by the business as a adhoc query process though the same can be verified by an automated ETL process by the data warehouse team

We shall discuss further on the other challenges in Part II.

Pages

Business Intelligence & Analytics

Ads 468x60px

Labels

Thursday 19 June 2008

Data Integration Challenge – Carrying Id-Date Fields from Source

Monday 9 June 2008

Hybrid OLAP – The Future of Information Delivery

Monday 2 June 2008

Data Integration Challenge – Parent-Child Record Sets, Child Updates

Thursday 15 May 2008

Data Integration Challenge – Storing Timestamps

Wednesday 9 April 2008

Data Integration Challenge – Initial Load – II

Tuesday 18 March 2008

Data Integration Challenge – Initial Load – I

Connect Us

MBA Admission 2018

Events and News

My Favorite Links

Blog Archive

My Blog List

Popular Posts

Mamta @ Twitter

About Me

Pages

Business Intelligence & Analytics

Ads 468x60px

Labels

Thursday 19 June 2008

Data Integration Challenge – Carrying Id-Date Fields from Source

Monday 9 June 2008

Hybrid OLAP – The Future of Information Delivery

Monday 2 June 2008

Data Integration Challenge – Parent-Child Record Sets, Child Updates

Thursday 15 May 2008

Data Integration Challenge – Storing Timestamps

Wednesday 9 April 2008

Data Integration Challenge – Initial Load – II

Tuesday 18 March 2008

Data Integration Challenge – Initial Load – I

Social Icons

Connect Us

MBA Admission 2018

Events and News

My Favorite Links

Blog Archive

My Blog List

Popular Posts

Mamta @ Twitter

About Me