Business Intelligence & Analytics: Data Integration Challenge

Friday, 3 October 2008

Data Integration Challenge – Storing Timestamps

Storing timestamps along with a record indicating its new arrival or a change in its value is a must in a data warehouse. We always take it for granted, adding timestamp fields to table structures and tending to miss that the amount of storage space a timestamp field can occupy is huge, the storage occupied by timestamp is almost double against a integer data type in many databases like SQL Server, Oracle and if we have two fields one as insert timestamp and other field as update timestamp then the storage spaced required gets doubled. There are many instances where we could avoid using timestamps especially when the timestamps are being used for primarily for determining the incremental records or being stored just for audit purpose.

How to effectively manage the data storage and also leverage the benefit of a timestamp field?

One way of managing the storage of timestamp field is by introducing a process id field and a process table. Following are the steps involved in applying this method in table structures and as well as part of the ETL process.

Data Structure

Consider a table name PAYMENT with two fields with timestamp data type like INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the changes for every present in the table
Create a table named PROCESS_TABLE with columns PROCESS_NAME Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
Now drop the fields of the TIMESTAMP data type from table PAYMENT
Create two fields of integer data type in the table PAYMENT like INSERT_PROCESS_ID and UPDATE_PROCESS_ID
These newly created id fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID would be logically linked with the table PROCESS_TABLE through its field PROCESS_ID

ETL Process

Let us consider an ETL process called ‘Payment Process’ that loads data into the table PAYMENT
Now create a pre-process which would run before the ‘payment process’, in the pre-process build the logic by which a record is inserted with the values like (‘payment process’, SEQUNCE Number, current timestamp) into the PROCESS_TABLE table. The PROCESS_ID in the PROCESS_TABLE table could be defined as a database sequence function.
Pass the currently generated PROCESS_ID of PROCESS_TABLE as ‘current_process_id’ from pre-process step to the ‘payment process’ ETL process
In the ‘payment process’ if a record is to inserted into the PAYMENT table then the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT table then the current_process_id value is set to only the column UPDATE_PROCESS_ID
So now the timestamp values for the records inserted or updated in the table PAYMENT can be picked from the PROCESS_TABLE by joining by the PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID columns of the PAYMENT table

Benefits

The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less space when compared to the timestamp fields
Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are Index friendly
Its easier to handle these process id fields in terms picking the records for determining the incremental changes or for any audit reporting.

Read More About Data Integration Challenge

Friday, 18 July 2008

Data Integration Challenge – Error Handling

Determining the error and handling the errors encountered in the process of data transformation is one of the key design aspects in building a robust data integration platform. When an error occurs how do we capture the errors and use them for effective analysis. Following are the best practices related to error handling

Differentiate the error handling process into the Generic (Null, Datatype, Data format) and the Specific like the rules related to the business process. This differentiation enables to build reusable error handling code
Do not stop validations when the record fails for one of the validations; continue with the other validations on the incoming data. If we have 5 validations to be done on a record, we need to design that the incoming record is taken through all the validations, this ensures that we capture all the errors in a record in one go
Have a table Error_Info; this has the repository of all the error messages. The fields would be ErrorCode, ErrorType and the ErrorMessage. The ErrorType would carry the values ‘warning’ or ‘error’, the ErrorMessage would have a detail description of the error and the ErrorCode a numeric value which is used in place of the description.
In general each validation should have an error message, we could also see the table Error_Info as a repository of all error validations performed in the system. In case of business rules that involve multiple fields, the field ErrorMessage in the table Error_Info can have the details of the business rule applied along with the field name, we can also create an additional field Error_Category to group the error messages
Have a table Error_Details; this stores the errors captured. The fields of this table would be KeyValue, FieldName, FieldValue and ErrorCode. The KeyValue would hold the value of the primary key of the record which has an error, the FieldName would store name of the field which has an error, the FieldValue has the value of the field which has failed or is an error, the ErrorCode details the error through a link to the table Error_Info.
Write each error captured as a separate record in the table Error_Deatils i.e., if a record fails for two conditions like a NULL check on field ‘ CustomerId’ and the data format check on the field ‘Date’ then ensure we write two records one for the NULL failure and one for the data format failure
To retain the whole incoming record have a table structure Source_Datasame as the incoming data. Have a field FLAG in the Source_Data, a value of ‘1’ would say that the record has passed all the validations and ‘0’ would say that it has failed one or more validations

In summary the whole process would be to read the incoming record, validate the data, for any validation failure assign the error_code and pipe the errors captured to the Error_Details table, once all validations completed assign the FLAG value (1- Valid record, 0-Invalid record) and insert that record into the Source_data table. Having the data structure as suggested above would enable efficient analysis of the errors captured by the business and IT team.

Read More About Data Integration Challenge

Pages

Business Intelligence & Analytics

Ads 468x60px

Labels

Friday, 3 October 2008

Data Integration Challenge – Storing Timestamps

Friday, 18 July 2008

Data Integration Challenge – Error Handling

Connect Us

MBA Admission 2018

Events and News

My Favorite Links

Blog Archive

My Blog List

Popular Posts

Mamta @ Twitter

About Me

Pages

Business Intelligence & Analytics

Ads 468x60px

Labels

Friday, 3 October 2008

Data Integration Challenge – Storing Timestamps

Friday, 18 July 2008

Data Integration Challenge – Error Handling

Social Icons

Connect Us

MBA Admission 2018

Events and News

My Favorite Links

Blog Archive

My Blog List

Popular Posts

Mamta @ Twitter

About Me