Pages

Ads 468x60px

Labels

Friday 2 November 2007

Data Integration Challenge – Understanding Lookup Process – III


In Part II we discussed ‘when to use’ and ‘when not to use’ the particular type of lookup process, the Direct Query lookup, Join based lookup and the Cache file based lookup. Now we shall see what are the points to be considered for better performance of these ‘lookup’ types.
In the case of Direct Query the following points are to be considered

  • Index on the lookup condition columns
  • Selecting only the required columns
In the case of Join based lookup, the following points are to be considered

  • Index on the columns that are used as part of Join conditions
  • Selecting only the required columns
In the case of Cache file based lookup, let us first try to understand the process of how these files are built and queried.
The key aspects of a Lookup Process are the

  • SQL that pulls the data from lookup table
  • Cache memory/files that holds the data
  • Lookup Conditions that query the cache memory/file
  • Output Columns that are returned back from the cache files
Cache file build process:
Based on the product Informatica or Datastage when a lookup process is being designed we would define the ‘lookup conditions’ or the ‘key fields’ and also define a list of fields that would need to be returned on lookup query. Based on these definitions the required data is pulled from lookup table and the cache file is populated with the data. The cache file structure is optimized for data retrieval assuming that the cache file would be queried based certain set of columns called ‘lookup conditions’ or ‘key fields’.

In the case of Informatica, the cache file is of separate index and data file, the index file has the fields that are part of the ‘lookup condition’ and the data file has the fields that are to be returned. Datastage cache files are called Hash files which are optimized based on the ‘key fields’.
Cache file query process:

Irrespective of the product of choice following would be the steps involved internally when a lookup process is invoked.

Process:
  1. Get the Inputs for Lookup Query, Lookup Condition and Columns to be returned
  2. Load the cache file to memory

  3. Search the record(s) matching the Lookup condition values , in case of Informatica this search happens on the ‘index file’

  4. Pull the required columns matching the condition and return, in case of Informatica with the result from ‘index file’ search, the data from the ‘data file’ is located and retrieved

In the search process, based on the memory availability there could be many disk hits and page swapping.
So in terms performance tuning we could look at two levels

  1. how to optimize the cache file building process

  2. how to optimize cache file query process

The following table lists the points to be considered for the better performance of a cache file based lookup
Category
Points to consider
Optimize Cache file building process
 While retrieving the records to build the cache file, sort the records by the lookup condition, this sorting would speed up the index (file) building process. This is because the search tree of the Index file would be built faster with lesser node realignment
 Select only the required fields there by reducing the cache file size
 Reusing the same cache file for multiple requirements for same or slightly varied lookup conditions
Optimize Cache file query process
 Sort the records that come from source to query the cache file by the lookup condition columns, this ensures less page swapping and page hits. If the subsequent input source records come in a continuous sorted order then the hits of the required index data in the memory is high and the disk swapping is reduced
 Having a dedicated separate disk ensures a reserved space for the lookup cache files and also improves response of writing to the disk and reading from the disk
 Avoid querying recurring lookup condition, by sorting the incoming records by the lookup condition
You might want to read these awesome related posts Data Integration Challenge

Monday 15 October 2007

Business Intelligence Utopia – Enabler 5: Extensible Data Models


Enabler 5 in my list for Business Intelligence Utopia are the ubiquitous, hard-working “Data Models”. Data Model is the heart of any software system and at a fundamental level provides placeholders for data elements to reside.
Business Intelligence systems with all its paraphernalia – Data Warehouses, Marts, Analytical & Mining systems etc. typically deals with the largest volume of data in any enterprise and hence data models are highly venerated in the Data Warehousing world.
At a high level, a good Data Warehouse data model has the following goals: (Corollary – If you are looking for a data modeler look for the following traits)
1) Understand the business domain of the organization
2) Understand at a granular level the data generated by the business processes
3) Realize that business data is an ever-changing commodity – So the placeholder provided by the data model should be relevant not only for the present but also for the future
4) Can be described at a conceptual and logical level to all relevant stakeholders
5) Should allow for non-complicated conversion to the physical world of databases or data repositories that is manipulated by software systems
Extensible Data models deal with all the 5 points mentioned above and more specifically has future-proofing as one of its stated goals. Such extensible models are also “consumption agnostic”, i.e. – it provides for comparable levels of performance irrespective of the way data is being consumed.
It is important for Business Intelligence practitioners to understand the goals of their data models before embarking to use specific techniques for implementation. Entity-Relationship & Dimensional modeling (http://www.rkimball.com) has been the lingua-franca of BI data modelers operating at the conceptual and logical levels. Newer techniques like Data Vault (http://www.danlinstedt.com/) also provides some interesting thoughts in building better logical models for Data Warehouses.
At the physical implementation level, relational databases still form the backbone of the BI infrastructure, supplemented by multi-dimensional data stores. Even in the relational world, traditionally dominated by row-major relational vendors like Oracle, SQL Server etc. there are column-major relational databases of the likes of Sybase IQ with claims of being built ground-up for data warehousing.
In this article on column major databases – http://www.databasecolumn.com/2007/09/one-size-fits-all.html, there is reference to a new DW specific database architecture called Vertica. It makes for a fascinating read – http://www.vertica.com/datawarehousing. The physical layer is also seeing a lot of action with the entry of data warehousing appliance vendors like Netezza, Datallegro etc. (http://www.dmreview.com/article_sub.cfm?articleId=1009168).
The intent of this post can be summed up as:
a) Understand the goals of building data models for your enterprise – Make it extensible and future proof
b) Know the current techniques that help envisage and build data models
c) Be on the look-out for new developments in the data modeling and database world – There is lot of interesting action happening in this area right now.
Extensible data models combined with the right technique for implementing them, lists as Enabler 5 in the “Power of Ten” for implementing Business Intelligence Utopia .





Wednesday 3 October 2007

Business Intelligence Utopia – Enabler 4: Service Oriented Architecture


Service Oriented Architecture (SOA) and its closest identifiable alter-ego “Web Services” is another example of hyped-up, much maligned technology buzzword that takes at least 2 or 3 slides in any “bleeding-edge” technology presentation. Having said that, whatever I have investigated on Service Oriented Architectural concepts till now, is enough to warrant its listing as enabler no. 4 for Business Intelligence Utopia.
There are many powerful ways through which SOA can add significant value to the BI environment. The kind of BI, performance management and data integration artifacts that can be developed and published as web services include: Queries, Reports,  OLAP slice services (MDX queries), Scoring and predictive models, Alerts, Scorecards, Budgets, Plans, BAM agents, Decisions (i.e., automated decision services), Data integration workflows, Federated queries and much more. You can get more information at the link: http://www.b-eye-network.co.uk/view-articles/4729
But the idea that fascinates me with respect to Business Intelligence on SOA, is the concept of “Analytical Smorgasbord”. Imagine a scenario where the business user can assemble their own analytical components from a mélange of available ones, resulting in complete customization of information for the user to take his/her decisions. Each of these available analytical components is self-contained and performs a particular piece of BI functionality. These components are ‘Web-Services’ and the SOA in such an enterprise is all about –
a) How are these components created?
b) How do the components interact?
c) How is the information published and consumed, in a secure manner?
The concept of “Analytical Smorgasbord” truly empowers the business users and is a powerful way to enable, what Gartner terms, as “Information Democracy” in the enterprise. It is important to note that the concept of analytical aggregation changes the Data Warehousing paradigm in a profound way – From “Pulling data” to “Seeking data”. In more simplistic terms, the end-user analytics should go and fetch data wherever it is rather than expecting all data to be consolidated into one data repository (typically a data warehouse or data mart). More on this in future posts, under the topic of “Guided Analytics”.
The true intent of this post is to encourage the BI community to start looking at SOA from the end-user analytical standpoint, so that web-services does not remain a mere technology toy but really helps in “Putting the business back in BI” – http://www.tdwi.org/Publications/display.aspx?id=7913
I have intentionally left out the technology details related to SOA. You can find wonderful resources on the web like this one: http://www.dmreview.com/portals/portal.cfm?topicId=1035908 It is becoming increasingly important for BI practitioners to acquire/develop knowledge on Web technologies, XML, SOAP, UDDI, etc. as different domains are converging at a rapid pace..
Enabler 4 in the “Power of Ten” is more precisely defined as – Service Oriented Architecture enabling the creation of BI “Analytical Smorgasbord”.

You might want to read these awesome related posts Business Intelligence Utopia


Tuesday 18 September 2007

Data Integration Challenge – Understanding Lookup Process –II


Most of the leading products like Informatica, DataStage support all the three ways of lookup process in their product architecture. The following table lists ‘when to use’ and ‘when not to use’ the particular type of lookup process.
LookupWhen To UseWhen Not To Use
Direct Query (Uncached lookup in Informatica)
  • When the lookup process is to be invoked only once or a very few times
  • The ETL server and the database are co-located or well connected
  • Reading in large volume of source records and executing lookup queries for every incoming record can be costly in terms network load, query parsing, data parsing and disk hits
  • Querying the same set of records again and again
Join Query (Joiner Transformation or a Join on the Source Qualifier in Informatica)
  • When multiple records are returned by the lookup process and all the returned records are required for further processing
  • Both the source and lookup table are on the same database
  • When the source record performs a lookup based on some other ‘TRUE’ condition i.e., not all the records that are read from source do a lookup
  • When the source and lookup table columns are not indexed by the ‘lookup condition’
  • When the database memory is fully utilized and the Outer Joins are badly executed by the database
Cached Query (Cached Lookup in Informatica or Hash files in Datastage)
  • Many times the lookup process being executed
  • Presence of Large volume of data in the looked up table
  • Set of records from the lookup table used by multiple jobs
  • Disk space is a constraint
  • Multiple records from lookup required for processing
Advantage Cache Lookup:
The advantages of using cache file based lookups are that
  • Fields that are present in the cache file is only that is needed by the lookup process so when querying the cache file the return would be faster as compared to the lookup table that might have more fields present
  • The data structure of the cache file would be designed in such that the query from the ETL server is easily understood without any additional layer like SQL
Though in general it is said in user manuals that usage of cache files is best suited for low volume of lookup but in practical scenarios I have seen cache files are more valuable in terms of performance when the lookup records are huge.
Dynamic Cache: We have the concept of Dynamic Cache in Informatica and as well in Hash files of Datastage where you can Insert/Update or delete records from these cache file. The feature of  updating the cache files is useful when we want to keep the cache file and the lookup table in sync.
Handling Multiple Return Records: Handling the return of multiple records by a lookup process is still a challenge not implemented in any of the leading products – limited to my knowledge. Probably in release 9 Informatica’s lookup can have a parameter for defining the number of records to return as an array like in its Normalizer transformer.

In Part III we shall see some of the things to be considered for better performance when using the lookup process
You might want to read these awesome related posts Data Integration Challenge

Monday 10 September 2007

Business Intelligence Utopia – Enabler 3: Data Governance


The “Power of Ten” introduced earlier in this forum is a list of pre-requisites to deliver the real promise of BI. We have already seen the first two – Changes to OLTP systems and Real time Data Integration.
The third enabler in the list is ‘Data Governance’. With increasing volumes of data coupled with regulatory compliance issues, the topic of Data Governance is very much in vogue, to the extent that anybody can look intelligent (beware!) by coining new terms like Data Clarity, Data Clairvoyance etc.
Data Governance at a very fundamental level is all about understanding the data generated by business, managing the quantity / quality of data and leveraging it to make sound business decisions for the future. From my view, the steps needed in a practical data governance program are:
1) Organizational entity, headed by a Chief Data Officer (CDO), whose task is to formulate and implement decisions related to Data Management across multiple dimensions, viz. Business Operations, Regulatory compliance etc.
2) Comprehensive understanding of the data ‘value chain’ – From the source of origination to its consumption. It is important to understand that the origination and / or consumption can also be outside the organizational boundaries.
3) Understand the types of data within the enterprise by following a ‘divide-and-conquer’ strategy. One of my previous posts on this blog illustrate one way of dividing data into ‘mutually exclusive collectively exhaustive’ (MECE) categories.
4) Profile data on a regular basis to statistically measure its quality.
5) Set-up a Business Intelligence infrastructure that effectively harnesses data assets for making decisions that affects (positively, of course!) the short, medium & long-term nature of business.
6) Continuous improvement program to ensure that data is optimally leveraged across all aspects of business. A data governance maturity model like the one illustrated here ,  can be envisaged for your organization.
Competing on Analytics’ – A classic Harvard Business Review article by Thomas Davenport illustrates the power of fact-based business decisioning. For businesses to realize that power, it is important to realize that good data is a source of competitive advantage and not ‘any’ data.
Data Governance is fundamental to making organizations better and that is the reason that it figures as number 3 in my list of ten enablers for Business Intelligence Utopia. . Informative articles on Data Governance are present at the following link.