Building Making It Happen
Establishing Making-it-Happen as ‘Formal & Measurable’ Business Discipline
  Sign-in         Register
    
   Data Mining Technology Knowledge Discovery in Databases Methodology  

Execution-MiH Encyclopedia  →   Enterprise Intelligence  →  SECTION -  KDD-Data Mining  →  CHAPTER -  KDD- Data Mining Overview  → 

KDD- Data Mining Issues & Challenges

Key issues around KDD-Data Mining are around limited information, noisy & missing data, level of uncertainty and dynamically & fast-changing data reference.

This page is an extract from BIDS KDD Methodology authored by Kamlesh Mhashilkar-Head, Execution-MiH Services of Tata Consultancy Services

Data Mining applications rely on databases to supply the raw data for input. The issues in the databases / data (e.g. volatility, incompleteness, noise, and volume) augment the issues by the time it reaches Data Mining task. Other problems arise as a result of the adequacy and relevance of the information stored.

Limited Information

A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. For example cannot diagnose malaria from a patient database if that database does not contain the patient’s red blood cell count.

Noise and missing values

Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct. Attributes which rely on subjective or measurement judgments can give rise to errors such that some examples may even be mis-classified. Errors in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules.

Missing data can be treated by discovery systems in a number of ways such as;

•           simply disregard missing values
•           omit the corresponding records
•           infer missing values from known values
•           treat missing data as a special value to be included additionally in the attribute domain
•           or average over the missing values using Bayesian techniques.

Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise.

Uncertainty

Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.

Size, updates, and irrelevant fields

Databases tend to be large and dynamic in that their contents are ever-changing as information is added, modified or removed. The problem with this from the data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the `timeliness' of the data.

Another issue is the relevance or irrelevance of the fields in the database to the current focus of discovery for example post codes are fundamental to any studies trying to establish a geographical connection to an item of interest such as the sales of a product.

Note- BIDS Solutions encompass the proprietary solutions from TCS covering Business Intelligence and Data Warehousing landscape.

 

   Data Mining Technology Knowledge Discovery in Databases Methodology  
 
All Topics in: "KDD- Data Mining Overview" Chapter
 What is KDD- Data Mining? →  Knowledge Discovery in Databases Program →  Knowledge Discovery in Databases Process →  Data Mining Technology →  KDD- Data Mining Issues & Challenges →  Knowledge Discovery in Databases Methodology →  Data Mining Techniques- Propensity Modeling →  Data Mining Techniques- Predictive Modeling → 
 

Was this page helpful?
If you like it ? share it !
Digg
Digg
Reddit
Reddit
Del.icio.us
Delicious
Google
Google
Live
Live
Facebook
Facebook
Slashdot
Slashdot
Netscape
Netscape
Technorati
Technorati
Stumbleupon
Stumbleupon
Spurl
Spurl
Furl
Furl
Blogmarks
Blogmarks
Yahoo
Yahoo
Plugim
Plugim
Squidoo
Squidoo
BlinkBits
BlinkBits
 
CONTENT ZONE
KDD-Data Mining
Featured Pages
Dimensional non Strict Hierarchy
Data Warehouse Design and Architecture Overview
OLAP in Business Intelligence- What is OLAP?
Data Domain and Data Standards Controls

Make 'Executable' Strategy
Maximize Results
Maximize People
Manage Execution

Featured Pages
Master-Data-Management CDI Usage pattern
Customer Data Challenges
Implementing Conformed Dimensions
Knowledge Discovery in Databases Process