This page is an extract from BIDS KDD Methodology authored by Kamlesh Mhashilkar-Head, Execution-MiH Services of Tata Consultancy Services
Data Mining applications rely on databases to supply the raw data for input. The issues in the databases / data (e.g. volatility, incompleteness, noise, and volume) augment the issues by the time it reaches Data Mining task. Other problems arise as a result of the adequacy and relevance of the information stored.
Limited Information
A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. For example cannot diagnose malaria from a patient database if that database does not contain the patient’s red blood cell count.
Noise and missing values
Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct. Attributes which rely on subjective or measurement judgments can give rise to errors such that some examples may even be mis-classified. Errors in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules.
Missing data can be treated by discovery systems in a number of ways such as;
• simply disregard missing values
• omit the corresponding records
• infer missing values from known values
• treat missing data as a special value to be included additionally in the attribute domain
• or average over the missing values using Bayesian techniques.
Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise.
Uncertainty
Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.
Size, updates, and irrelevant fields
Databases tend to be large and dynamic in that their contents are ever-changing as information is added, modified or removed. The problem with this from the data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the `timeliness' of the data.
Another issue is the relevance or irrelevance of the fields in the database to the current focus of discovery for example post codes are fundamental to any studies trying to establish a geographical connection to an item of interest such as the sales of a product.
Note- BIDS Solutions encompass the proprietary solutions from TCS covering Business Intelligence and Data Warehousing landscape.
|