Business Intelligence & Analytics: Data Integration Challenge – Understanding Lookup Process

In Part II we discussed ‘when to use’ and ‘when not to use’ the particular type of lookup process, the Direct Query lookup, Join based lookup and the Cache file based lookup. Now we shall see what are the points to be considered for better performance of these ‘lookup’ types.

In the case of Direct Query the following points are to be considered

Index on the lookup condition columns
Selecting only the required columns

In the case of Join based lookup, the following points are to be considered

Index on the columns that are used as part of Join conditions
Selecting only the required columns

In the case of Cache file based lookup, let us first try to understand the process of how these files are built and queried.

The key aspects of a Lookup Process are the

SQL that pulls the data from lookup table
Cache memory/files that holds the data
Lookup Conditions that query the cache memory/file
Output Columns that are returned back from the cache files

Cache file build process:

Based on the product Informatica or Datastage when a lookup process is being designed we would define the ‘lookup conditions’ or the ‘key fields’ and also define a list of fields that would need to be returned on lookup query. Based on these definitions the required data is pulled from lookup table and the cache file is populated with the data. The cache file structure is optimized for data retrieval assuming that the cache file would be queried based certain set of columns called ‘lookup conditions’ or ‘key fields’.

In the case of Informatica, the cache file is of separate index and data file, the index file has the fields that are part of the ‘lookup condition’ and the data file has the fields that are to be returned. Datastage cache files are called Hash files which are optimized based on the ‘key fields’.

Cache file query process:

Irrespective of the product of choice following would be the steps involved internally when a lookup process is invoked.

Process:

Get the Inputs for Lookup Query, Lookup Condition and Columns to be returned
Load the cache file to memory
Search the record(s) matching the Lookup condition values , in case of Informatica this search happens on the ‘index file’
Pull the required columns matching the condition and return, in case of Informatica with the result from ‘index file’ search, the data from the ‘data file’ is located and retrieved

In the search process, based on the memory availability there could be many disk hits and page swapping.

So in terms performance tuning we could look at two levels

how to optimize the cache file building process
how to optimize cache file query process

The following table lists the points to be considered for the better performance of a cache file based lookup

Category	Points to consider
Optimize Cache file building process	• While retrieving the records to build the cache file, sort the records by the lookup condition, this sorting would speed up the index (file) building process. This is because the search tree of the Index file would be built faster with lesser node realignment • Select only the required fields there by reducing the cache file size • Reusing the same cache file for multiple requirements for same or slightly varied lookup conditions
Optimize Cache file query process	• Sort the records that come from source to query the cache file by the lookup condition columns, this ensures less page swapping and page hits. If the subsequent input source records come in a continuous sorted order then the hits of the required index data in the memory is high and the disk swapping is reduced • Having a dedicated separate disk ensures a reserved space for the lookup cache files and also improves response of writing to the disk and reading from the disk • Avoid querying recurring lookup condition, by sorting the incoming records by the lookup condition

You might want to read these awesome related posts Data Integration Challenge

Pages

Business Intelligence & Analytics

Ads 468x60px

Labels

Friday, 2 November 2007

Data Integration Challenge – Understanding Lookup Process – III

0 comments:

Post a Comment

Connect Us

MBA Admission 2018

Events and News

My Favorite Links

Blog Archive

My Blog List

Popular Posts

Mamta @ Twitter

About Me