Pages

Ads 468x60px

Labels

Thursday, 15 May 2008

Data Integration Challenge – Storing Timestamps

Storing timestamps along with a record indicating its new arrival or a change in its value is a must in a data warehouse. We always take it for granted, adding timestamp fields to table structures and tending to miss that the amount of storage space a timestamp field can occupy is huge, the storage occupied by timestamp is almost double against a integer data type in many databases like SQL Server, Oracle and if we have two fields one as insert timestamp and other field as update timestamp then the storage spaced required gets doubled. There are many instances where we could avoid using timestamps especially when the timestamps are being used for primarily for determining the incremental records or being stored just for audit purpose.
How to effectively manage the data storage and also leverage the benefit of a timestamp field?
One way of managing the storage of timestamp field is by introducing a process id field and a process table. Following are the steps involved in applying this method in table structures and as well as part of the ETL process.Data Structure
  1. Consider a table name PAYMENT with two fields with timestamp data type like INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the changes for every present in the table
  2. Create a table named PROCESS_TABLE with columns PROCESS_NAME Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
  3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
  4. Create two fields of integer data type in the table PAYMENT like INSERT_PROCESS_ID and UPDATE_PROCESS_ID
  5. These newly created id fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID would be logically linked with the table PROCESS_NAME and its field PROCESS_ID
ETL Process
  1. Let us consider an ETL process called ‘Payment Process’ that loads data into the table PAYMENT
  2. Now create a pre-process which would run before the ‘payment process’, in the pre-process build the logic by which a record is inserted with the values like (‘payment process’, SEQUNCE Number, current timestamp) into the PAYMENT table. The PROCESS_ID in the payment table could be defined as a database sequence function.
  3. Pass the current_prcoess_id from pre-process step to the ‘payment process’ ETL process
  4. In the ‘payment process’ if a record is to inserted into the PAYMENT table then the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT table then the current_process_id value is set to only the column UPDATE_PROCESS_ID
  5. So now the timestamp values for the records inserted or updated in the table PAYMENT can be picked from the PROCESS_TABLE by joining by the PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID columns of the PAYMENT tableBenefits
  6.  
  • The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less space when compared to the timestamp fields
  • Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are Index friendly
  • Its easier to handle these process id fields in terms picking the records for determining the incremental changes or for any audit reporting.
Read More about Data Integration

Monday, 28 April 2008

Let’s talk EPM – Part 1

Welcome to the world of Enterprise Performance Management (EPM), considered the Holy Grail of Business Intelligence. EPM and its various manifestations creatively named as Business Performance Management (BPM), Corporate Performance Management (CPM) etc. is a set of processes that help organizations optimize their business performance.
Does it sound good? – Ofcourse, Yes! Show me an organization that does not want to optimize!!
Does it sound practical? – Not really! Don’t know where to start!!
EPM means many things to many people – Optimization of business performance can mean optimization at the business processes level (local optima), can also mean optimization at the organizational level (global optima) and can also have many flavors in between.
With many BI vendors jumping into the EPM bandwagon, the problem is that EPM is immediately equated to the solutions provided by tools like Business Objects, Cognos, Hyperion etc. That view, in my opinion, is far removed from the truth.
In this series of posts, I would like to share some thoughts on making EPM a practical reality in organizations. To start with, let me enumerate the components of an EPM strategy:
  1. Business Process Maps – Understand the business process
  2. Metrics Identification – Get hold of the metrics
  3. Metrics Profiling – Understand the metrics in depth
  4. Metrics Maps – Understand the cause and effect relationships between metrics
  5. Metrics Visualization – Implementation of Metric Maps on BI Tools
  6. Watch and Improve – Monitor Metrics and Improve business process as required
A keen observer will immediately realize that implementing EPM has lot more of pen & paper work (substitute your favorite analysis tool here!) before technology can come into the picture. Also, in my opinion, there is no silver bullet – No single metric map can fit companies across industries or even within same industry. EPM framework for an organization has to evolve in phases based on company’s growth, its corporate vision, and the important numbers at different stages etc. or in other words ‘EPM is very personal to an organization’.
EPM, for a BI practitioner, represents a convergence of many things –
  • Domain Understanding
  • Quantitative Play
  • BI Tool capability
  • Closed-Loop BI Architecture
  • Knowledge of proven methodologies like Six Sigma, Balanced Scorecard etc.
will try and explain some of the interesting aspects of an EPM strategy like Metrics Profiling, Metrics Maps etc. in the next few posts. Meanwhile, you can take a look at resources like this one (http://www.dmreview.com/issues/20050501/1026062-1.html) to understand the ‘big picture’ with respect to Enterprise Performance Management.
Thanks for reading!

Tuesday, 15 April 2008

Using Analytic Hierarchy Process (AHP) for BI Tool Evaluation

Enterprise wide BI architecture utilizes a multitude of tools within its landscape, each serving a specific functionality – Extract, Transform and Load (ETL), Data Cleansing, Metadata Management, Databases (both relational and multidimensional), Reporting and Analytics (OLAP), Data Mining etc. For example, just taking the OLAP area alone, there are more than 40 different products that can potentially solve a customer problem. You can imagine the number of combinations possible when all the tool options are combined across the overall landscape. This establishes the fact that one of the most challenging and vexing problems in Business Intelligence domain is Tool Evaluation.
Tool Evaluation and selection has become strategic to the implementation of enterprise wide Business Intelligence. Traditionally, tool selection involved comparing the technical features of the tools, looking at demos by product vendors, reading up industry reports, get word-of-mouth referrals and then taking a final decision. In my humble opinion – that is not sufficient any more.
Technical features, though important, cannot be the definitive criteria for selecting a particular tool. More crucial than technical features is what I term as the “Business Fitment Index”. The selected tool should fit with the characteristics of the business process prevalent in the organization and should take into account the requirements of different classes of users. The concept of Business Fitment can be classified as a Multi Criteria Decision Making (MCDM) problem and one of the powerful tools in this category is the Analytic Hierarchy Process (AHP).
AHP is a systematic procedure that helps to:
  1. Represent the elements of any problem, breaking it down into smaller constituents
  2. Assign weightages to each constituent by following a pairwise comparison technique
  3. Leverage expert judgment and intuitive feel into a coherent framework for problem solving
Though AHP can be used in many situations, Hexaware’s BI practice has perfected the art of leveraging its power in the realm of “BI Tools Evaluation”. There are 3 steps to calculating the Business Fitment Index using AHP.
Step 1 – Pair-wise comparison of business parameters by customer stakeholders is done in this step. The parameters can be things like – Real Time Data Integration, Data Volumes, Data Quality, Business Rules Flexibility etc.
Step 2 – Relative ranking of Business Parameters based on the AHP (Analytic Hierarchy Process) technique
Step 3 – Each of the short-listed tools are evaluated against the business parameters and a final rating is arrived at taking into account the organization readiness factors
Bottom-line is that the technical features of the tools have to be taken in conjunction with the fitment level of tool to the characteristics of the business. That alone would ensure the success of the tool for enterprise wide BI initiatives.
AHP is a simple yet powerful way of arriving at a decision by consensus. There are wide ranging applications of AHP in BI and this is a great area for practitioners to get interested. If you have some thoughts on other applications of AHP in the BI world, please do share it with us. Thanks for reading!

Wednesday, 9 April 2008

Data Integration Challenge – Initial Load – II

The other challenges during Initial load are the
System Performance
Catching Up
System Performance is always a challenge during initial load especially when many years of history data are to be loaded; there is an increase in the usage of system resources because of huge data load which wouldn’t happen during regular incremental load. Some of the ways of handling the system performance issue during initial load are
Group the data loads by filters like years or customer and load the related data in chunks. We could load data for the month Jan, Feb, Mar or load the customers region wise from NY, followed by NJ etc. Such grouping of records for loading eliminates data surge and as well provides better way to perform data validation
As the data gets loaded into the warehouse the data required for lookup from the warehouse becomes huge, we need to decide the required lookup records based on the incoming data. For example if the data in the warehouse has data related to all the regions North, South, East, West and the incoming data currently has only North data then we need to have a override filter and access only the data pertaining to North  from the warehouse
We could plan and increase the available memory for ETL server for a temporary period
Avoid sequential inserts, write data to a file, sort and bulk load
Determine and plan for more disk space requirement for initial load files that are extracted and provided by the source systems is an interesting problem where in the warehouse is not able to cope up and able to deliver latest data (or 1 day old) as in the source system. This problem would be more often due to the ETL performance issues, where even before the initial data is successfully loaded and verified in the DW, additional set of new records would have come from the source which the DW is not able to catch up.
Though at times some kind of code tuning, running things in parallel or hardware upgrade (usually a costly one at a later stage) could resolve such problems, in certain situations this problem could run into an unsolvable state where in the complete ETL architecture has to be re-looked.
One other way to manage such situations is to have the daily process of loading current data to proceed independently and in parallel through a separate of processes bring in the history data on a regular basis, in certain scenarios we might need to build a process that would run and sync up the current with the old data especially the aggregate data if any designed in the data model.
Read More about  

Monday, 24 March 2008

Metadata 101 – For BI Practitioners

For as long as I can remember, the definition given for Metadata is “Data about Data”. We have all said this in interviews, heard it from candidates, seen it on presentations, and (almost) always nodded our heads in agreement.
In the transaction processing world, where “data-in” is the paradigm, the definition is precise. The databases store the business data in the relational format and the system tables / catalogs describe the structure of that data – the columns, type, size, etc. This data about the structure of business data is “Metadata”.
In the Business Intelligence world, that definition of metadata is incomplete. A more precise definition of metadata has two components:
Metadata in BI = “Data about Data” + “Information about Information”
The first component “Data about data” is “Technical Metadata” and is similar to the metadata in the OLTP world. Having said that, the technical metadata in BI is arguably more complex, as it not only encompasses the databases but needs to cover the ETL and Reporting tools as well. Each of the tools in the overall BI landscape has its own metadata and this data has to be looked at in a comprehensive fashion to understand data lineage etc.
Even among BI tools, there are different categories – Tools that expose its metadata completely, tools that gives an handle to its metadata thro’ pre-defined APIs and tools that do not allow any access to the metadata. Given the industry direction and the evolution of Common Warehouse Metamodel (CWM) compliance standards, it is only a matter of time before the tool architecture is designed to expose the technical metadata. CWM is a fascinating topic of its own and you can get a feel for it by visiting this website: http://www.omg.org/technology/cwm/
To me, as a BI practitioner, the second piece of the metadata puzzle is more interesting. “Information about information” aspect of metadata is “Business Metadata” and understanding it is crucial to implementing the BI vision in any enterprise.
As an analytical information consumer, there are 2 important requirements:
  1. Need direction to access the required analytical content
    Example:
    • Where can I get Sales by Product for different locations over the last 2 years?
    • Am interested in Customer related Analytics. Where do I access it?
  2. Once the content is retrieved, need guidance on how to make sense of it
    Example:
    • Report shows Forecasted Sales for next quarter in the chart. How is this value calculated?
    • Does the total inventory value displayed in the report include the Raw material inventory or does it exclude it?
Business metadata when properly organized should provide direction to both the points mentioned above.
Metadata management in BI deals with integration of technical and business data in a way that is useful for the organization. The challenge of metadata management becomes even more daunting when one considers both structured and unstructured data. Having said that, it is important for BI practitioners to understand the true nature of BI metadata and provide implementable solutions in their specific organizational context.
In my future posts, I would discuss this fascinating area of Metadata management, with its manifestation as “Technical and Business Metadata” in both structured and unstructured data domains.
Read More about  Metadata 101