Data Warehouse ETL Extraction Data Warehouse ETL Loading
Knowledge-Base Enterprise Intelligence Data-Warehouse/Mart DW Design & Architecture
Data Warehouse ETL Transformation
Designing the transformation process including standardizing, integrating, cleansing, augmenting, aggregating and creating the data sets for loading into the repository.

Data Warehouse Transformation Design- Concept

NOTE- While this page is part of Data Warehouse section, the concept and process can be equally applied in the areas of Master Data Management and Operational Data Store. Both MDM and ODS involve the use of Extraction, Transformation and Loading. The extent and intensity of transformation varies depending upon the purpose.

Once the Data is extracted, a range and myriad of grooming actions are performed to make to presentable for Data-warehouse loaded/presentation server. In my view, data Transformation is 40% of the project in terms of work. It’s the most complex and biggest challenge in data-warehouse. If digging out iron ore is akin to Extraction, Transformation is equivalent to end to end process of the steel mill, churning out the stainless steel billets having the right mix of iron, nickel and chromium, and each of its molecules aligned in the desired crystalline form. The Loading and end user applications are equivalent to steel fabricating units.

The above example may not be the best one. While stainless steel production is well-defined given a standard input of iron ore, the data inputs varies infinitely in terms of its, content, quality and the requirements of the users keep on changing.

Following are the types of ETL Transformations done in the Data:

Integration

This step links the data from myriad sources into a well-linked and integrated universe. 'Transformation Data Integration' is a subject in it-self and you can write volumes only on this topic. Here are some key elements of integration:

Creating common keys:

Different systems have different keys in different systems representing the same entity. As an example, the core production system could have the agent code different from the sales system. Same applies for vendor, customer and other codes. This is a tedious process, but worth doing. Ideally, this change, should be first done in the production systems it-self or through Master Data Management.

Creating surrogate keys

Creating common keys is linked to production keys. As mentioned in dimensional model, one should use the surrogate keys in the loaded data and star-schemas. However, one can keep the production key as one of the attributes in the dimension table. Creation of surrogate keys has complex, but straight forward methods. At a high level:

  • Pick-up the dimensions against which you have to create the keys.
  • Against each dimension, list out the maximum possible instances.
  • Give a series of numbers (Or numbers + Characters) surrogate key range of for the dimensions.
  • Map the dimension production keys to the surrogate key range, with dimensions being in sorted, chronological OR some other order.

· Standardizing the descriptions, textual attributes:

This is another of those Transformations, which will serve a great operational purpose, if it is done in the source system itself. A product, a plan, a channel and host of other entities carry different descriptions in different systems. The Transformation process should place the same textual and descriptive attributes for all master and codes.

One may go in for many shocks, when one actually start querying the extracted data for congruence of the keys and descriptions. In one way the Integration linked Transformation throws lot of data quality urgencies needed in the source systems itself.

Translation and standardization across organization standards & structures

  • Normalizing different accounting conventions.
  • Different operating entities having different year-end OR period closing OR reporting cut-offs.
  • The departments could be organized differently (Customer service may be part of operations in one unit and 'customer relations unit' in the other)
  • Currency conversion (typically between local and head-office currency)
  • Transfers, transactions, charge-out may have to be re-organized.

The work for most of these is done in the analyze phase where 'As is' and ' To be ' values for the above-said areas are define. This standardization and normalization is more of a business decision and policy challenge than a technical issue.

Transformation for common dimension attributes

  • Classifying a product under a common product group. (For example, one unit could be having the Scanner machine as 'office Automation' product and other could be ' Paperless office' segment).
  • Classifying customer groups under the same customer segments. (One has to decide on if 'ABC Bank' is 'Banking Segment' OR 'Corporate Segment' OR 'Financial Services Segment' OR more than one of these)

Data Quality Checks and associated Transformations

Data Quality

Data Quality is the typically the first step in the Transformation process. One generally does not find that many issues with Data Quality, where the transaction directly impacting the product delivery is concerned like Credit card transaction OR renewal fee data. However, all other data develops quality issues. Refer Data Cleansing and customer data challenges

Data Relevance

Data to be placed for analysis should be relevant and stripped off of any elements not required for analysis OR not included in 'Dimensional Modeling'. Some part of this objective is done through Extraction process, where only the relevant fields are picked up. Nonetheless, the complex 'elimination/Filtering' is done post Data Extraction. This Transformation is still happening on detailed data. This includes:

  • Removal of un-necessary attributes/fields: If you are analyzing your data on states and PIN codes you may not need to include 'street name'.
  • Removal of whole entities: If you are not analyzing the '
  • Truncating OR eliminating the codes: For example, your customer code may have the combination of segment code+ year of customer+ running serial number. As you are not going by individual level details, you may eliminate running serial numbers.

De-Duping, Merging and data cleansing –

Refer Customer Data Correction to understand various customer data correcting techniques.

If you have addresses from various sources in different fields, you may go for merging. For example, (H.No. + Street No. to 'Address 1'). There are other examples of merging as well, where you are to create the minimum grain with some lowest level grain. For example you can have the product code + Product model as two separate fields, and together they make the primary key of the source product table. Though you may choose otherwise, it could be better to merge the two and make them reside as 'Source system key reference' attribute in the dimension table having surrogate Key.

There is another type of merging where, many to one Transformation happens, when the variety in the source table is more than the one required in target tables. For example source system could have the profession of the customer as the Student, Apprentice, Trainee, Research assistant, which you can merge as 'student'

Data Augmentation and enrichment

Refer Customer Augmentation and Enrichment to understand various transformations possible..

Data Type conversion

There are many fields in production data systems, which are not used in the processing logics and therefore not reflecting their genesis. For example you may find Agents DOB, existing credit cards with the customer to be in the character forms. There is some data type conversion, which happens in the Extraction process due to database model differences in the source and target system.

Data Transformation into the Data Sets for Presentation Servers

De-normalization

Dimensional Modeling is standard in its approach, resulting in varied designs in production data models to undergo a change to meet the Data Warehouse model. This results in both de-normalizing and normalizing. De-normalizing typically is more often compared to normalizing as a dimension model inherently heavily skewed towards being de-normalized. For example, you may have the city, zone, region, state and country separate tables in a production data model, whereas they will be de-normalized in the dimension model where the dimension is location with the lowest level grain of City (say)

Normalization

Though normalization instances are far less compared to de-normalization, but they are not 'rare'. The reason is that in real world the production database are not normalized to copybook expectations. The reasons are that highly de-normalized production tables leads to response time issues due to too many joins and the technical design team as well the programming teams find it easier to handle fewer tables.

For example, a Policy table could have the 'Sign-date', 'Application received date', 'underwritten date', 'issuance date', 'dispatch date' and 'policy acknowledgement date'. All these dates could be individual fields in the same row of 'application flow' table. However, in Dimensional Model, they are separate dimensions, meaning that they will represent the individual dimension keys and have separate dimension tables.

Create Derived Attributes

All the dimension attributes are not present in the production systems, but important for analysis. Period, Date Attributes are good examples- You may have the date field in the Source System, but you have to create the following by Transformation rules-

  • Week of the years
  • Month of the Year
  • Quarter of the year
  • Rolling quarter of the year

Similarly Location could have the city, state and the country, but not the region and the zone attributes.

Calculation, Derivative, Allocation

This is an important component, especially for population of derived facts. You can refer to the 'Facts+ Derived Facts matrix'. The derivation happens as per the derivation rules. There is always a question that derivation should be dynamic OR pre-stored. The logic is- if it is complex and used quite often, pre-store.

Aggregation

Aggregation is mainly for the purpose of creating aggregated schemas in the Data Warehouse. Referring to the Dimensional Model matrix of Data Marts, Fact Grains and the derived fact table, you would be able to set of routines of the aggregation needed.

Data Warehouse Transformation Design- Process

Given the type of Transformations required, there are hundred different ways you can do the same, and there are no fixed rules and very few guidelines for this purpose. The process of Transformation step is as follows:

Data Transformation Specifics – Pre Data Set Creation

Create the headings of all the data-groups (prior to the Transformation for being able to convert it into the Dimensional model format).

Against each heading, create:

  • The list of the Transformations required.
  • Transformation Script Code.
  • T he input fields and tables impacted.
  • The Transformations Activity Done.
  • The Output Fields, Tables and their Specifications.

The above is applicable for the Data Integration, Data Quality, Customer Data Augmentation and Enrichment and special situations.

Creation of Final Data Sets for Loading into the presentation Server

Once the Data is ready post quality, integration and augmentation point of view, the data is now ready to undergo Transformations needed for Creating Data-Sets to be loaded into the presentation/loading servers.

Data Quality and completeness checks post completion of Data Transformation

Refer to 'Data Warehouse data Quality Assurance' for various quality checks.

Data Warehouse ETL Extraction Data Warehouse ETL Loading


All Topics in :- " DW Design & Architecture "+ Chapter


About Us       Contact Us       Privacy Policy       Terms and Conditions       Copyright        Disclaimer        Site Map