Data Cleansing and Correction
So far in data profiling and data monitoring, we have covered the domain of 'what's the state of data?'. As we moved forward the DQ tools should also be helping in fixing the state of data to improve the data quality.
De-duping- This include an ability to:
- Merge the duplicate or near duplicate records to have the best of multiple records. This means that you can pick-up name and address from one record and PIN-code from the other. The merging can be driven automatically based upon the business rules or should be able to be done manually. The other aspect of merging is numeric data like financial transaction values.
- Select the best record: This is opposite of merging. The tool should be able to select the most suitable record and delete the others.
- Select the best record and fix it: After selecting the best record, the system should be able to do the corrections as mentioned in the following points.
Standardizing
The tool should be able to refer to a standards database or use business rules to standardize the target data. This standardization can have following shades:
- The generic standardization on names, locations and pin codes
- The context specific standardizations like standardizing the customer IDs, product-IDs, product names etc...
Spelling corrections
This is fairly simple, and works like the spell check of any productivity tool. However, given that it’s an enterprise tool, it will have more robust capability. It should also be able to fix the spellings in the batch-mode.
Standard databases available for locations, names, pin-codes, geo-codes etc..
This is linked to data quality rules and cleansing capabilities. For the purpose of standardization, a DI will have the databases of standard names, locations and addresses. For example, it will have a mapping which says that 'NY, N.Y., Newyork' will standardized to 'New York'.
Data Cleansing with localizations for wide range of languages and locations
This applies on names, locations, addresses etc.
Data Augmentation and Enrichment Capabilities
Most of the augmented and enriched data is not used for production processing purposes as it is not expected to be an accurate data. The augmentation and enrichment is generally used for analytics and data-mining.
- Ability to fill-in the missing data using extrapolation technique: Extrapolation technique is used to update the customer data based upon some heuristics. example is to extrapolate the current salary of a customer, based upon his salary five years ago.
- Ability to share house-holding information across various components of a BI platform: House-holding is a technique by which you tag multiple customer record to a common group. The example is customers belonging to the same family or same association.
- Applying cluster, averages and means: The tools should be able to apply different aggregation functions to estimate the value of blank fields.
- Using most probable value: The tool should be able to deduct the probable value based on statistical analysis or heuristics. For example, If the customer has an income above USD 25000, and he is above 30 years, the field 'whether taken mortgage' will have most probable value as 'yes'.
- Ability to derive the data: A tool should be able to derive the age based on date of birth and state on the basis of city.
|