Pages

Ads 468x60px

Labels

Monday 15 October 2007

Business Intelligence Utopia – Enabler 5: Extensible Data Models


Enabler 5 in my list for Business Intelligence Utopia are the ubiquitous, hard-working “Data Models”. Data Model is the heart of any software system and at a fundamental level provides placeholders for data elements to reside.
Business Intelligence systems with all its paraphernalia – Data Warehouses, Marts, Analytical & Mining systems etc. typically deals with the largest volume of data in any enterprise and hence data models are highly venerated in the Data Warehousing world.
At a high level, a good Data Warehouse data model has the following goals: (Corollary – If you are looking for a data modeler look for the following traits)
1) Understand the business domain of the organization
2) Understand at a granular level the data generated by the business processes
3) Realize that business data is an ever-changing commodity – So the placeholder provided by the data model should be relevant not only for the present but also for the future
4) Can be described at a conceptual and logical level to all relevant stakeholders
5) Should allow for non-complicated conversion to the physical world of databases or data repositories that is manipulated by software systems
Extensible Data models deal with all the 5 points mentioned above and more specifically has future-proofing as one of its stated goals. Such extensible models are also “consumption agnostic”, i.e. – it provides for comparable levels of performance irrespective of the way data is being consumed.
It is important for Business Intelligence practitioners to understand the goals of their data models before embarking to use specific techniques for implementation. Entity-Relationship & Dimensional modeling (http://www.rkimball.com) has been the lingua-franca of BI data modelers operating at the conceptual and logical levels. Newer techniques like Data Vault (http://www.danlinstedt.com/) also provides some interesting thoughts in building better logical models for Data Warehouses.
At the physical implementation level, relational databases still form the backbone of the BI infrastructure, supplemented by multi-dimensional data stores. Even in the relational world, traditionally dominated by row-major relational vendors like Oracle, SQL Server etc. there are column-major relational databases of the likes of Sybase IQ with claims of being built ground-up for data warehousing.
In this article on column major databases – http://www.databasecolumn.com/2007/09/one-size-fits-all.html, there is reference to a new DW specific database architecture called Vertica. It makes for a fascinating read – http://www.vertica.com/datawarehousing. The physical layer is also seeing a lot of action with the entry of data warehousing appliance vendors like Netezza, Datallegro etc. (http://www.dmreview.com/article_sub.cfm?articleId=1009168).
The intent of this post can be summed up as:
a) Understand the goals of building data models for your enterprise – Make it extensible and future proof
b) Know the current techniques that help envisage and build data models
c) Be on the look-out for new developments in the data modeling and database world – There is lot of interesting action happening in this area right now.
Extensible data models combined with the right technique for implementing them, lists as Enabler 5 in the “Power of Ten” for implementing Business Intelligence Utopia .





Wednesday 3 October 2007

Business Intelligence Utopia – Enabler 4: Service Oriented Architecture


Service Oriented Architecture (SOA) and its closest identifiable alter-ego “Web Services” is another example of hyped-up, much maligned technology buzzword that takes at least 2 or 3 slides in any “bleeding-edge” technology presentation. Having said that, whatever I have investigated on Service Oriented Architectural concepts till now, is enough to warrant its listing as enabler no. 4 for Business Intelligence Utopia.
There are many powerful ways through which SOA can add significant value to the BI environment. The kind of BI, performance management and data integration artifacts that can be developed and published as web services include: Queries, Reports,  OLAP slice services (MDX queries), Scoring and predictive models, Alerts, Scorecards, Budgets, Plans, BAM agents, Decisions (i.e., automated decision services), Data integration workflows, Federated queries and much more. You can get more information at the link: http://www.b-eye-network.co.uk/view-articles/4729
But the idea that fascinates me with respect to Business Intelligence on SOA, is the concept of “Analytical Smorgasbord”. Imagine a scenario where the business user can assemble their own analytical components from a mélange of available ones, resulting in complete customization of information for the user to take his/her decisions. Each of these available analytical components is self-contained and performs a particular piece of BI functionality. These components are ‘Web-Services’ and the SOA in such an enterprise is all about –
a) How are these components created?
b) How do the components interact?
c) How is the information published and consumed, in a secure manner?
The concept of “Analytical Smorgasbord” truly empowers the business users and is a powerful way to enable, what Gartner terms, as “Information Democracy” in the enterprise. It is important to note that the concept of analytical aggregation changes the Data Warehousing paradigm in a profound way – From “Pulling data” to “Seeking data”. In more simplistic terms, the end-user analytics should go and fetch data wherever it is rather than expecting all data to be consolidated into one data repository (typically a data warehouse or data mart). More on this in future posts, under the topic of “Guided Analytics”.
The true intent of this post is to encourage the BI community to start looking at SOA from the end-user analytical standpoint, so that web-services does not remain a mere technology toy but really helps in “Putting the business back in BI” – http://www.tdwi.org/Publications/display.aspx?id=7913
I have intentionally left out the technology details related to SOA. You can find wonderful resources on the web like this one: http://www.dmreview.com/portals/portal.cfm?topicId=1035908 It is becoming increasingly important for BI practitioners to acquire/develop knowledge on Web technologies, XML, SOAP, UDDI, etc. as different domains are converging at a rapid pace..
Enabler 4 in the “Power of Ten” is more precisely defined as – Service Oriented Architecture enabling the creation of BI “Analytical Smorgasbord”.

You might want to read these awesome related posts Business Intelligence Utopia


Tuesday 18 September 2007

Data Integration Challenge – Understanding Lookup Process –II


Most of the leading products like Informatica, DataStage support all the three ways of lookup process in their product architecture. The following table lists ‘when to use’ and ‘when not to use’ the particular type of lookup process.
LookupWhen To UseWhen Not To Use
Direct Query (Uncached lookup in Informatica)
  • When the lookup process is to be invoked only once or a very few times
  • The ETL server and the database are co-located or well connected
  • Reading in large volume of source records and executing lookup queries for every incoming record can be costly in terms network load, query parsing, data parsing and disk hits
  • Querying the same set of records again and again
Join Query (Joiner Transformation or a Join on the Source Qualifier in Informatica)
  • When multiple records are returned by the lookup process and all the returned records are required for further processing
  • Both the source and lookup table are on the same database
  • When the source record performs a lookup based on some other ‘TRUE’ condition i.e., not all the records that are read from source do a lookup
  • When the source and lookup table columns are not indexed by the ‘lookup condition’
  • When the database memory is fully utilized and the Outer Joins are badly executed by the database
Cached Query (Cached Lookup in Informatica or Hash files in Datastage)
  • Many times the lookup process being executed
  • Presence of Large volume of data in the looked up table
  • Set of records from the lookup table used by multiple jobs
  • Disk space is a constraint
  • Multiple records from lookup required for processing
Advantage Cache Lookup:
The advantages of using cache file based lookups are that
  • Fields that are present in the cache file is only that is needed by the lookup process so when querying the cache file the return would be faster as compared to the lookup table that might have more fields present
  • The data structure of the cache file would be designed in such that the query from the ETL server is easily understood without any additional layer like SQL
Though in general it is said in user manuals that usage of cache files is best suited for low volume of lookup but in practical scenarios I have seen cache files are more valuable in terms of performance when the lookup records are huge.
Dynamic Cache: We have the concept of Dynamic Cache in Informatica and as well in Hash files of Datastage where you can Insert/Update or delete records from these cache file. The feature of  updating the cache files is useful when we want to keep the cache file and the lookup table in sync.
Handling Multiple Return Records: Handling the return of multiple records by a lookup process is still a challenge not implemented in any of the leading products – limited to my knowledge. Probably in release 9 Informatica’s lookup can have a parameter for defining the number of records to return as an array like in its Normalizer transformer.

In Part III we shall see some of the things to be considered for better performance when using the lookup process
You might want to read these awesome related posts Data Integration Challenge

Monday 10 September 2007

Business Intelligence Utopia – Enabler 3: Data Governance


The “Power of Ten” introduced earlier in this forum is a list of pre-requisites to deliver the real promise of BI. We have already seen the first two – Changes to OLTP systems and Real time Data Integration.
The third enabler in the list is ‘Data Governance’. With increasing volumes of data coupled with regulatory compliance issues, the topic of Data Governance is very much in vogue, to the extent that anybody can look intelligent (beware!) by coining new terms like Data Clarity, Data Clairvoyance etc.
Data Governance at a very fundamental level is all about understanding the data generated by business, managing the quantity / quality of data and leveraging it to make sound business decisions for the future. From my view, the steps needed in a practical data governance program are:
1) Organizational entity, headed by a Chief Data Officer (CDO), whose task is to formulate and implement decisions related to Data Management across multiple dimensions, viz. Business Operations, Regulatory compliance etc.
2) Comprehensive understanding of the data ‘value chain’ – From the source of origination to its consumption. It is important to understand that the origination and / or consumption can also be outside the organizational boundaries.
3) Understand the types of data within the enterprise by following a ‘divide-and-conquer’ strategy. One of my previous posts on this blog illustrate one way of dividing data into ‘mutually exclusive collectively exhaustive’ (MECE) categories.
4) Profile data on a regular basis to statistically measure its quality.
5) Set-up a Business Intelligence infrastructure that effectively harnesses data assets for making decisions that affects (positively, of course!) the short, medium & long-term nature of business.
6) Continuous improvement program to ensure that data is optimally leveraged across all aspects of business. A data governance maturity model like the one illustrated here ,  can be envisaged for your organization.
Competing on Analytics’ – A classic Harvard Business Review article by Thomas Davenport illustrates the power of fact-based business decisioning. For businesses to realize that power, it is important to realize that good data is a source of competitive advantage and not ‘any’ data.
Data Governance is fundamental to making organizations better and that is the reason that it figures as number 3 in my list of ten enablers for Business Intelligence Utopia. . Informative articles on Data Governance are present at the following link.

Wednesday 29 August 2007

Data Integration Challenge – Understanding Lookup Process–I


One of the basic ETL steps that we would use in most of the ETL jobs during development is ‘Lookup’. We shall discuss further on what lookup is? when to use? how it works ? and some points to be considered while using a lookup process.
What is lookup process?
During the process of reading records from a source system and loading into a target table if we query another table or file (called ‘lookup table’ or ‘lookup file’) for retrieving additional data then its called a ‘lookup process’. The ‘lookup table or file’ can reside on the target or the source system. Usually we pass one or more column values that has been read from the source system to the lookup process in order to filter and get the required data.
How ETL products implement lookup process?
There are three ways ETL products perform ‘lookup process’
  • Direct Query: Run the required query against the table or file whenever the ‘lookup process’ is called up
  • Join Query: Run a query joining the source and the lookup table/file before starting to read the records from the source.
  • Cached Query: Run a query to cache the data from the lookup table/file local to the ETL server as a cache file. When the data flow from source then run the required query against the cache file whenever the ‘lookup process’ is called up
Most of the leading products like Informatica, Data stage support all the three ways in their product architecture. We shall see the pros and cons of this process and how these work in part II.
Read more about Data Integration Challenge

Thursday 16 August 2007

Business Intelligence Utopia – Enabler 2: Real Time Data Integration


Business Intelligence practitioners tend to have lot of respect and reverence for transaction processing systems (OLTP), for without them the world of analytical apps simply does not exist. That explains my previous blog in introducing the first enabler for BI Utopia – The Evolution of OLTP systems to support Operational BI.
In this post, I introduce the second enabler in the “Power of Ten” – Real Time Data Integration
Data Integration in the BI sense, is all about, extracting data from multiple source systems, transforming them using business rules and loading it back into data repositories built to facilitate analysis, reporting, etc.
Given that the raw data has to be converted to a different form more amenable for analysis & decision-making, there are 2 basic questions to be answered:
  1. From a business standpoint, how fast should the ‘data-information’ conversion happen?
  2. From a technology standpoint, how fast can the ‘data-information’ conversion happen?

Traditionally, BI being used more for strategic decision-making,  batch mode of data integration with periodicity of a day or later, was acceptable. But increasingly, businesses demand that the conversion has to happen much faster and technology has to support it. This leads to the concept of “Real Time BI” or more correctly "RightTime Data Integration"
Since the answer to the first question “How Fast” is fast becoming “as fast as possible”, the focus has shifted to the technology side. One area where I foresee a lot of activity, from a Data Warehouse architectural standpoint, is in the close interaction of messaging tools like IBM Websphere MQ etc. with data integration tools. At this point in time, though the technology is available, there aren’t too many places where messaging is embedded into the BI architectural landscape.
Bottom-line is that there is significant value gained by ensuring that raw business data is transformed to information by the BI infrastructure, as fast as possible – the limits being prescribed by business imperatives. The best explanation I have come across to explain the value of information latency is the article by Richard Hackathorn .
Active Data Warehousing is another topic closely related to Real Time Data Integration and you can get some perspective on it thro’ the blog on Decision management by James Taylor:

Wednesday 1 August 2007

Data Integration Challenge – Identifying changes from a table by a Scratch


In scenarios when a table in the staging area or in the data warehouse needs to be queried to find the changed records (inserted or updated), we can use the Scratch table design. Scratch table is a temporary table that can be designed to hold the changes happening against a table, once the changes are noted by the required application or process then the Scratch table can be cleaned-off.

The process to capture the changes and the clean up would be designed as part of ETL process. The scenarios where to use this concept and the steps to use the Scratch table is discussed below:
Steps to use Scratch table
  1. Create a Scratch table ‘S’ of structure to hold the Primary Key column value from the table ‘T’ that needs to the monitored for changes

  2. In the ETL process that loads the table ‘T’ add the logic in such way that while inserting or updating a record into table ‘T’ we insert the Primary Key column values of the record into the Scratch table ‘S’

  3. If required while inserting the record into the Scratch table ‘S’ have a flag column that says ‘Insert’ or ‘Update’

  4. Any process that needs to find the changes would join the Scratch table ‘S’ and the table ‘T’ to pull the changed records, if it just needs the key directly access ‘S’
  5. Once the changes have been pulled and processed, have a process that would clean up the Scratch table

  6. We can also bind the Scratch table ‘S’ to be always loaded to the memory for higher performance

When to use Scratch table
  1. When we have a persistent staging area, using Scratch table would be ideal choice to move the changes to the data warehouse

  2. When the base table ‘T’ is really huge and only few changes happen

  3. When the changes (or the Primary Key values) in table ‘T’ are required by multiple processes

  4. When the changes in table ‘T’ is to be joined with other tables i.e., now the Scratch Table ‘S’ can be used as the driving table in joins with other tables which would give better performance since the Scratch table would be thinner with few records

Alternate Option: Having a flag or a timestamp column in the table ‘T’ and having an index on it. Having an index on Timestamp is costly and a bit map index on the flag column may be seen as an option, but the aspect of updating the column during updates, huge volume and in scenarios of joining with other tables this would be a disadvantage, have seen Scratch table to be a best option. Let me know the other options you have used to handle such situations…
To add more variety to your thoughts, you can read it More Data Integration Challenge

Friday 13 July 2007

Data Integration Challenge – Capturing Changes


When we receive the data from source systems, the data file will not carry a flag indicating whether the record provided is new or has it changed. We would need to build process to determine the changes and then push them to the target table.

There are two steps to it
  1. Pull the incremental data from the source file or table

  2. Process the pulled incremental data and determine the impact of it on the target table as Insert or Update or Delete

Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we will not be able to find the new records and have to consider the complete data
For source system’s file or table that has audit columns, we would follow the below steps
  1. While reading the source records for a day (session), find the maximum value of date(audit filed) and store in a persistent variable or a temporary table
  2. Use this persistent variable value as a filter in the next day to pull the incremental data from the source table

Step 2: Determine the impact of the record on target table as Insert/Update/ Delete 
Following are the scenarios that we would face and the suggested approach
  1. Data file has only incremental data from Step 1 or the source itself provide only incremental data

    • do a lookup on the target table and determine whether it’s a new record or an existing record
    • if an existing record then compare the required fields to determine whether it’s an updated record
    • have a process to find the aged records in the target table and do a clean up for ‘deletes’

  2. Data file has full complete data because no audit columns are present

    • The data is of higher

      • have a back up of the previously received file
      • perform a comparison of the current file and prior file; create a ‘change file’ by determining the inserts, updates and deletes. Ensure both the ‘current’ and ‘prior’ file are sorted by key fields
      • have a process that reads the ‘change file’ and loads the data into the target table
      • based on the ‘change file’ volume, we could decide whether to do a ‘truncate & load’
    • The data is of lower volume

      • do a lookup on the target table and determine whether it’s a new record or an existing record
      • if an existing record then compare the required fields to determine whether it’s an updated record
      • have a process to find the aged records in the target table and do a clean up or delete


Friday 6 July 2007

Business Intelligence Utopia – Dream to Reality: Key Enablers


In the last post, I discussed my view of BI Utopia in which information is available to all stakeholders at the right time, in the right format enabling them to make actionable decisions at both strategic & operational levels. Having said that, the BI street is not paved with gold.

I consider the following key enablers as pre-requisites to achieve true ‘Information Democracy’ in an enterprise. The “Power of Ten” includes:
  1. Proliferation of agile, modular & robust transaction processing systems.

  2. Real Time Data Integration Components

  3. Strong Data Governance structure

  4. Service Oriented Architecture

  5. Extensible Business centric Data Models

  6. Flexible business rules repositories surrounded by clean metadata/reference data environments

  7. Ability to integrate unstructured information into the BI architectural landscape

  8. Guided context-sensitive, user-oriented analytics

  9. Analytical models powered by Simulations

  10. Closed loop Business Intelligence Utopia

Each of these units comprising the “Power of Ten” is a fascinating topic on its own. We will drill-down and focus on some of the salient features of each of these areas in the coming weeks.

Friday 29 June 2007

Data Integration Challenge – Facts Arrive Earlier than Dimension


The fact transactions that come in earlier than the dimension (master) records are not bad data, such fact records needs to be handled in our ETL process as a special case. Such situations of facts coming in before dimensions can occur quite commonly like in case of a customer opening a bank account and his transactions starting to flow into the data warehouse immediately.
But the customer id creation process from the Customer Reconciliation System can get delayed and hence the customer data would reach the data warehouse after few days.
How do we handle this scenario differs based on the business process being addressed, there could be two different requirements
  • Make the fact available and report under “In Process” category; commonly followed in financial reporting systems to enable reconciliation
  • Make the fact available only when the dimension is present,; commonly followed in status reporting systems
Requirement 1: Make the fact available and report under “In Process” category
For this requirement follow the below steps
  1. Insert into the dimension table a record that represents a default or ‘In Process’ status like in case of the banking example the Customer Dimension would have a ‘default record’ inserted that represents the information that the customer detail has not yet arrived

  2. In the ETL process while populating the Fact table, for the transactions that do not have a corresponding entry in the Dimension table, assign a default Dimension key and insert into the Fact. In the same process insert the Dimensions Lookup values into a ‘temporary’ or ‘error’ table

  3. Build an ETL process that checks the new records inserted into the Dimension table, queries the temporary table and identifies the records in facts for which the dimension key has to be updated and updates the respective fact’s dimension key

Requirement 2: Make the fact available only when the dimension is present
For this requirement follow the below steps
  1. Build an ETL process that populates the fact into a staging table

  2. Build an ETL process that pushes only the records that have a dimension value to the data warehouse tables

  3. At the end of ETL process delete all the processed records from the staging table making the other unprocessed records available to be pulled next time

You can read more about  Data Integration Challenge

Monday 25 June 2007

Business Intelligence: Gazing at the Crystal Ball


Circa 2015 – 8 years from now
CEO of a multinational organization enters the corner office overlooking the busy city down below. On flicking a switch near the seat, the wall in front is illuminated with a colorful dashboard, what is known in CEO circles then, as the Rainbow Chart.
The Rainbow Chart is the CEO’s lifeline as it gives a snapshot of the current business position (the left portion) and also figures/colors that serves as a premonition of the company’s future (the right portion).
The current state/left portion of the dashboard, on closer examination, reveals 4 sub-parts. On the extreme left is the Balance Sheet of the business and next to it is the Income statement. The Income statement has more colors that are changing dynamically as compared to the Balance sheet. Each line item has links to it, using which the CEO can drill down further to specific geographies, business units and even further to individual operating units. The third part has the cash flow details (the colors are changing far more rapidly here) and the fourth one gives the details on inventory, raw materials position and other operational details.
The business future state/right portion of the dashboard has a lot of numbers that can be categorized into two. The first category is specific to the business – Sales in pipeline, Revenue & Cost projections, Top 5 initiatives, Strategy Maps etc. and the second category are the macroeconomic indicators across the world. At the bottom of the dashboard is a stock ticker (what else?) with the company’s stock prices shown in bold.
All these numbers & colors change in real-time and the CEO can drill up/down/across/through all the line items. Similar such dashboards are present across the organization and each one covers details that are relevant for the person’s level and position in the company.
This in essence is the real promise of BI.
Whether it happens in 2015 or earlier (hopefully not later!) can be speculated but the focus of the next few blogs from my side will zero-in on some of the pre-requisites for such a scenario – The  Business Intelligence Utopia!

Business Intelligence @ Crossroads


Business Intelligence (BI) is well & truly at the crossroads and so are BI practitioners like me. On one hand there is tremendous improvement in BI tools & techniques almost on a daily basis but on the other hand there is still a big expectation gap among business users on Business Intelligence’s usage/value to drive core business decisions.

This ensures that every Business Intelligence (BI) practitioner develops a ’split’ personality – a la Jekyll and Hyde, getting fascinated by the awesome power of databases, smart techniques in data integration tools etc. and the very next moment getting into trouble with a business user on why ‘that’ particular metric cannot be captured in an analytical report.
For the BI technologists, there is never going to be a dull moment in the near future. With all the big product vendors like Microsoft, Oracle, SAP etc. throwing their might behind BI and with all the specialty BI product vendors showing no signs of slowing down, just get ready to join the big swinging party.
For the business users, there is still the promise of BI that is very enticing – ‘Data to Information to Knowledge to Actions that drive business decisions’. But they are not giving the verdict as of now. Operational folks are really not getting anything out of BI right now (wait for BI 2.0) and the strategic thinkers are not completely satisfied with what they get to see.
The techno-functional managers, the split personality types are the ones in the middle trying to grapple with increasing complexity on the technology side and the ever increasing clamor for insights from the business side.
Take sides right away – there is more coming from this space on the fascinating world of Business Intelligence.

Thursday 14 June 2007

DI Challenge – Handling Files of different format with same subject content


In a Data Integration environment which has multiple OLTP systems existing for same business functionality one of the scenarios that occur quite common is that of these systems ‘providing files of different formats with same subject content’.
Different OLTP systems with same functionality may arise in organizations like in case of a bank having its core banking systems running on different products due to acquisition, merger or in a simple case of same application with multiple instances with country specific customizations.
For example data about same subject like ‘loan payment details’ would be received on a monthly basis from different OLTP systems in different layouts and formats. These files might arrive in different frequency and may be incremental or full files.
Always files having same subject content reach the same set of target tables in the data warehouse.
How do we handle such scenarios effectively and build a scalable Data Integration process?
The following steps help in handling such situations effectively
• Since all the files provide data related to one common subject content, prepare a Universal Set of fields that would represent that subject. For e.g., for any loan payment subject we would have a set of fields identified as a Universal Set representing details about the guarantors, borrower, loan account etc. This Universal Field list is called Common Standard layout (CSL)
• Define the CSL fields with a Business Domain specialist and define certain fields in the CSL as mandatory or NOT NULL fields, which all source files should provide
• Build a set of ETL process that process the data based on the CSL layout and populates the target tables. The CSL layout could be a Table or Flat File. In case the CSL is table define the fields as character. All validations that are common to that subject are built in this layer.
• Build individual ETL process for each file which maps the source files fields to the CSL structure. All file specific validations are built in this layer.
Benefits of this approach
• Conversion of all source file formats to CSL ensured that all the common rules are developed as reusable components
• Addition of a new file that provides same subject content is easier, we need to just build a process to map the new file to the CSL structure
Read more about :Data Integration Challenge