Data Standardization for the abbreviations and titles
Make all Rob, Robbie and Bob to Robert. Make all Av, Avn etc. to Avenue.
Standardizing the patterns
Once the possible matches are found with high confidence, one can standardize their pattern in the parsed data. For example- two records having a possible match, but having different patterns can be placed in the same pattern.
Correcting the spellings
Based on N-gramming and other techniques, all 'Fenmark' can be changed to 'Denmark'
Data correction by standardizing all the locations, cities, ZIP codes…Convert all NY, NewYork, NY State to 'New York State'
De-Duplicate the records for data cleansing and correction
The starting point of data cleansing is, when one knows on type and extent of the data quality issues. Before we come to Data Cleansing, it will be worth-while to refer Reasons for Bad Data Quality, Data Quality Assurance, and Data Mapping & Assessment.
Removing the record Just pick-up the record, which is best in shape, and remove the others.
Record merging This is done, when there are typically different records for the same customer has got their own plusses. So one picks best of all the worlds. For example- one record has the right name and address, while the other record has the right ZIP, telephone and Fax. We merge the two records to have all the elements filled-up. Non Name & Address Data There are many challenges of de-duping the records , when there is data beyond the names and addresses, like income group, profession etc. There is very limited automated way to do this activity. Mostly it is judgment OR some level of modeling you can do through clustering OR association rules.
TIP- Do not over-configure your automated data correction (if you are using an automated tool to manage this). The recommended approach will be to have 'balanced' configuration, whereby you specify on the following lines:
- Conditions in which records should be de-duped or corrected without asking.
- Conditions in which records should be identified for corrections (along with the recommended corrections), but system should ask for permission before correction.
- Conditions in which record should not be corrected and should be allowed manual correction.
A typical experience you will have with the automated tools will be that they will look like fairly intelligence systems for basic correction rules. However, if one tried to make complex configurations, these systems are not able to handle it well.
TIP- If you are going for automated correction, you should test your configurations, with some reference samples, before you apply a larger data-set. |