Pages

Ads 468x60px

Labels

Tuesday 18 September 2007

Data Integration Challenge – Understanding Lookup Process –II


Most of the leading products like Informatica, DataStage support all the three ways of lookup process in their product architecture. The following table lists ‘when to use’ and ‘when not to use’ the particular type of lookup process.
LookupWhen To UseWhen Not To Use
Direct Query (Uncached lookup in Informatica)
  • When the lookup process is to be invoked only once or a very few times
  • The ETL server and the database are co-located or well connected
  • Reading in large volume of source records and executing lookup queries for every incoming record can be costly in terms network load, query parsing, data parsing and disk hits
  • Querying the same set of records again and again
Join Query (Joiner Transformation or a Join on the Source Qualifier in Informatica)
  • When multiple records are returned by the lookup process and all the returned records are required for further processing
  • Both the source and lookup table are on the same database
  • When the source record performs a lookup based on some other ‘TRUE’ condition i.e., not all the records that are read from source do a lookup
  • When the source and lookup table columns are not indexed by the ‘lookup condition’
  • When the database memory is fully utilized and the Outer Joins are badly executed by the database
Cached Query (Cached Lookup in Informatica or Hash files in Datastage)
  • Many times the lookup process being executed
  • Presence of Large volume of data in the looked up table
  • Set of records from the lookup table used by multiple jobs
  • Disk space is a constraint
  • Multiple records from lookup required for processing
Advantage Cache Lookup:
The advantages of using cache file based lookups are that
  • Fields that are present in the cache file is only that is needed by the lookup process so when querying the cache file the return would be faster as compared to the lookup table that might have more fields present
  • The data structure of the cache file would be designed in such that the query from the ETL server is easily understood without any additional layer like SQL
Though in general it is said in user manuals that usage of cache files is best suited for low volume of lookup but in practical scenarios I have seen cache files are more valuable in terms of performance when the lookup records are huge.
Dynamic Cache: We have the concept of Dynamic Cache in Informatica and as well in Hash files of Datastage where you can Insert/Update or delete records from these cache file. The feature of  updating the cache files is useful when we want to keep the cache file and the lookup table in sync.
Handling Multiple Return Records: Handling the return of multiple records by a lookup process is still a challenge not implemented in any of the leading products – limited to my knowledge. Probably in release 9 Informatica’s lookup can have a parameter for defining the number of records to return as an array like in its Normalizer transformer.

In Part III we shall see some of the things to be considered for better performance when using the lookup process
You might want to read these awesome related posts Data Integration Challenge

Monday 10 September 2007

Business Intelligence Utopia – Enabler 3: Data Governance


The “Power of Ten” introduced earlier in this forum is a list of pre-requisites to deliver the real promise of BI. We have already seen the first two – Changes to OLTP systems and Real time Data Integration.
The third enabler in the list is ‘Data Governance’. With increasing volumes of data coupled with regulatory compliance issues, the topic of Data Governance is very much in vogue, to the extent that anybody can look intelligent (beware!) by coining new terms like Data Clarity, Data Clairvoyance etc.
Data Governance at a very fundamental level is all about understanding the data generated by business, managing the quantity / quality of data and leveraging it to make sound business decisions for the future. From my view, the steps needed in a practical data governance program are:
1) Organizational entity, headed by a Chief Data Officer (CDO), whose task is to formulate and implement decisions related to Data Management across multiple dimensions, viz. Business Operations, Regulatory compliance etc.
2) Comprehensive understanding of the data ‘value chain’ – From the source of origination to its consumption. It is important to understand that the origination and / or consumption can also be outside the organizational boundaries.
3) Understand the types of data within the enterprise by following a ‘divide-and-conquer’ strategy. One of my previous posts on this blog illustrate one way of dividing data into ‘mutually exclusive collectively exhaustive’ (MECE) categories.
4) Profile data on a regular basis to statistically measure its quality.
5) Set-up a Business Intelligence infrastructure that effectively harnesses data assets for making decisions that affects (positively, of course!) the short, medium & long-term nature of business.
6) Continuous improvement program to ensure that data is optimally leveraged across all aspects of business. A data governance maturity model like the one illustrated here ,  can be envisaged for your organization.
Competing on Analytics’ – A classic Harvard Business Review article by Thomas Davenport illustrates the power of fact-based business decisioning. For businesses to realize that power, it is important to realize that good data is a source of competitive advantage and not ‘any’ data.
Data Governance is fundamental to making organizations better and that is the reason that it figures as number 3 in my list of ten enablers for Business Intelligence Utopia. . Informative articles on Data Governance are present at the following link.