Pages

Ads 468x60px

Labels

Thursday 14 June 2007

DI Challenge – Handling Files of different format with same subject content


In a Data Integration environment which has multiple OLTP systems existing for same business functionality one of the scenarios that occur quite common is that of these systems ‘providing files of different formats with same subject content’.
Different OLTP systems with same functionality may arise in organizations like in case of a bank having its core banking systems running on different products due to acquisition, merger or in a simple case of same application with multiple instances with country specific customizations.
For example data about same subject like ‘loan payment details’ would be received on a monthly basis from different OLTP systems in different layouts and formats. These files might arrive in different frequency and may be incremental or full files.
Always files having same subject content reach the same set of target tables in the data warehouse.
How do we handle such scenarios effectively and build a scalable Data Integration process?
The following steps help in handling such situations effectively
• Since all the files provide data related to one common subject content, prepare a Universal Set of fields that would represent that subject. For e.g., for any loan payment subject we would have a set of fields identified as a Universal Set representing details about the guarantors, borrower, loan account etc. This Universal Field list is called Common Standard layout (CSL)
• Define the CSL fields with a Business Domain specialist and define certain fields in the CSL as mandatory or NOT NULL fields, which all source files should provide
• Build a set of ETL process that process the data based on the CSL layout and populates the target tables. The CSL layout could be a Table or Flat File. In case the CSL is table define the fields as character. All validations that are common to that subject are built in this layer.
• Build individual ETL process for each file which maps the source files fields to the CSL structure. All file specific validations are built in this layer.
Benefits of this approach
• Conversion of all source file formats to CSL ensured that all the common rules are developed as reusable components
• Addition of a new file that provides same subject content is easier, we need to just build a process to map the new file to the CSL structure
Read more about :Data Integration Challenge

Monday 11 June 2007

First Step in Knowing your Data – ‘Profile It’


Chief Data Officer (CDO), the protagonist, who was introduced before on this blog has the unenviable task of understanding the data that is within the organization boundaries. Having categorized the data into 6 MECE sets (read the post dated May 29 on this blog), the data reconnaissance team starts its mission with the first step – ‘Profiling’.
Data Profiling at the most fundamental level involves understanding of:
1) How is the data defined?
2) What is the range of values that the data element can take?
3) How is the data element related to others?
4) What is the frequency of occurrence of certain values, etc.
A slightly more sophisticated definition of Data Profiling would include analysis of data elements in terms of:
  • Basic statistics, frequencies, ranges and outliers
  • Numeric range analysis
  • Identify duplicate name and address and non-name and address information
  • Identify multiple spellings of the same content
  • Identify and validate redundant data and primary/foreign key relationships across data sources
  • Validate data specific business rules within a single record or across sources
  • Discover and validate data patterns and formats
Armed with statistical information about critical data present in enterprise wide systems, the CDO’s team can devise specific strategies to improve the quality of data and hence the improve http://blogs.hexaware.com/business-intelligence/first-step-in-knowing-your-data-profile-it-2/ the quality of information and business decisioning.

To add more variety to your thoughts on Operational BI, you can read it More  Data Profiling 


Friday 1 June 2007

What is Data Integration or ETL ?


ETL represents the three basic steps:
  1. Extraction of data from a source system

  2. Transformation of the extracted data and

  3. Loading the transformed data into a target environment

In general ‘ETL’ represented more of batch process and that of gathering data from either flat files or relational structure. When ETL systems started supporting data from wider sources like XML, industry standard format like SWIFT, unstructured data, real time feeds like message queues etc ‘ETL’ got evolved to ‘Data Integration’. That’s the reason why now all ETL product vendors are called Data Integrators.
Now let us see how Data Integration or ETL has evolved over the period. The ways of performing DI…
  • Write Code
  • Generate Code
  • Configure Engine
Write Code: Write a piece of code in a programming language, compile and execute
Generate Code: Use a Graphical User Interface to input the requirements of data movement, generate the code in a programming language, compile and execute
Configure Engine: Use a Graphical User Interface to input the requirements, save the inputs (Metadata) in a data store (repository). Use the generic pre compiled Engine to interpret the metadata from the repository and execute.
Pros and Cons of each approach
ProsWrite CodeGenerate CodeConfigure Engine
  • Easy to get started for smaller tasks
  • Complex data handling requirements can be met
  • Developer friendly to design the requirements
  • Metadata of requirements captured
  • Developer friendly to design the requirements
  • Metadata of requirements captured
  • Easier code maintenance
  • Flexibility to access any type of data source
  • Scalable for huge data volume supports architectures like SMP, MPP, NUMA – Q,GRID etc
Cons
  • Large effort in maintenance of the code
  • Labor-intensive development, error prone and time consuming
  • Large effort in maintenance of the code
  • Metadata and code deployed can be out of sync
  • Certain data handling requirements might require adding a ‘hand written code’
  • Dedicated environment, servers and the initial configuration process

To add more variety to your thoughts on Data , you can read it More Data Integration

Wednesday 30 May 2007

New Face on the Corporate Board – The CDO


People who read the last post on this blog “What is a Data Warehouse” would probably accept my view that for an organization to get better at anything worthwhile, “data” is everything. If you accept this notion, I propose the immediate creation of a new ‘C’ level organizational position – Chief Data Officer (CDO).

To me, the CDO is a more important position than the more glamorous CIO (Chief Information Officer). After all, the input to any strategic information is raw data and many organizations don’t have a comprehensive focus on data that is present within its boundaries. It is important to realize that ‘Good data is a source of competitive advantage and not just any data’.
Let us for a moment assume that there is an organization with the CDO structure in place. The next question is – How should the CDO go about doing the job, given the massive amount of data generated by organizations? – Answer: Divide & Conquer!
The 6 mutually exclusive, collectively exhaustive (MECE) types of organizational data are given below:
Type 1) Transaction Structure Data – Business processes are a series of never-ending transactions. All these transactions has a context and this is defined by this category of data. Examples are: Products, Customers, Departments etc.
Type 2) Transaction Activity Data – These are the transactions themselves. Ex: Purchase Order data, Sales Invoice data etc.
Type 3) Enterprise Structure Data – These data elements are unique to each organization and the inter-relationships between data elements are important. Ex: Chart of Accounts, Org Structure, Bill of materials, etc.
Type 4) Reference Data – Set of codes, typically name-value pairs that drives business rules. Ex: Region Codes, Customer Types etc.
Type 5) Metadata – Data that defines other data thus making the collection a self-defining entity
Type 6) Audit Data – With so much focus on regulatory compliance, this is the data that tracks all the operations within a data store
Type 1,3 & 4 together is defined as Master Data and its management is the subject of numerous BI articles and white papers.
Our CDO would do well to understand all these 6 types of data in the organization and have some specific strategies to improve their quality. This & many other data management strategies will be the focus of this blog – Please do keep reading.

Thursday 24 May 2007

What is a Data Warehouse (DW) ?


To define the term Data Warehouse (DW) especially to software developers who are new to the industry, have tried asking them a few simple questions before getting to the classic definition in the words of Bill Inmon. Some of the questions which leads to defining a Data Warehouse are:

Q: What is Data?
A: ‘Data’ is a collection of facts which are captured as it happens.
E.g., the content present in a Survey Sheet is ‘
Data

Q: What is information?
A: The details that are derived by processing the ‘Data’ are called Information.
E.g., the details that are arrived from the survey data like total, average etc are called Information

Q: What is a system that collects ‘Data‘ called?
A: A computer system that collects ‘Data’ is usually called an OLTP (Online Transaction Processing System) system. This system is designed to collect data in a much more rapid way.
E.g., The survey data could be captured into a laptop using a software application,An ATM machine or a Core banking system for deposit/debit interaction…

Q: How is ‘Information’ derived from ‘Data’?
A: The ‘Data’ is pulled out from the OLTP system and moved to a separate data store/ system and then processed to derive Information. A computer system that acts as a platform for processing the ‘Data’ to derive ‘Information’ is called a Data warehouse.

The ‘Information’ gathered from DW system helps an Organization in gaining more Knowledge about their business. This gained Knowledge helps the Organization in Decision making hence the DW system which supports decision making is part of the “Decision Support System”bi-data-warehouse-11

Q: What are the key characteristics of a Data Warehouse?
A: A DW is designed to
1. store large quantity of data across years
2. push out ‘Data’ faster from its storage to the Information processing engine

Q: Why is a Data Warehouse required?
A: The OLTP system is usually used by many people to collect (push) data from the outside world into its storage where as the DW system is usually used by few people to pull the data out from its storage. Volume of data lying inside a DW system is very much higher that that in an OLTP system. The purpose of each system is different so designing a separate OLTP and DW system to cater to their unique requirement became imperative.

But this segregation between OLTP and DW has happened gradually. During the initial years the DW related activities were more done on OLTP systems and it still happens before an organization or department feels the need for a DW system.

The need for a DW system is felt due to issues related to
1. Performance
2. Maintenance
3. Data Integration
To add more variety to your thoughts on Operational BI, you can read it More Data Warehouse


-Pandian C M