Building Making It Happen
Establishing Making-it-Happen as ‘Formal & Measurable’ Business Discipline
  Sign-in         Register
    
   Knowledge Discovery in Databases Program Data Mining Technology  

Execution-MiH Encyclopedia  →   Enterprise Intelligence  →  SECTION -  KDD-Data Mining  →  CHAPTER -  KDD- Data Mining Overview  → 

Knowledge Discovery in Databases Process

Knowledge Discovery in Databases comprises four key stages in an iterative flow- Business Case Definition, Data Preparation, Data Mining and Evaluation. Data Mining has no value on a stand-alone basis. Its success depends on how well you define the problem and on the level of diligence in data preparation.

This page is an extract from BIDS KDD Methodology authored by Kamlesh Mhashilkar-Head, Execution-MiH Services of Tata Consultancy Services

The following diagram presents the high level processes in the KDD. 

high level KDD process

 

 

 

 

 

 

 

 

 

 

KDD begins with the Business Case Definition and proceeds with Data Preparation, Data Mining and Evaluation processes in cyclic order. But the processes are very iterative in nature. Any issues or configuration settings in Data Preparation may result into revisiting and fine-tuning the Business Case Definition. Findings or non-interpretable results from Data Mining process may fallback on the Data Preparation or back to Business Case Definition. Same is the case with Evaluation process.

The Knowledge Base is merely a representation of the database where the business case model, data, metadata, data preparation rules, data mining algorithms, results and evaluation information is kept. It acts as a common pool of information / knowledge, which facilitates the iterations and improves the quality of the model for better results.

Following table gives summary of activities involved in each of the KDD processes

Business Case Definition      

  • Business Goals, Objectives, Critical Success Factors
  • High level business cases / issues
  • Gap analysis with respect to the current business processes and IT systems
  • Framework for the complete Data Mining process

Data Preparation

  • Data (as well as Metadata) Quality Analysis
  • Data Mining input parameter specification
  • Data selection and preparation

Data Mining         

  • Data Management
  • Data Mining Model Build
  • Output construction in form of Visualization and Interfaces

Evaluation 

  • Utilization of data mining output in business processes
  • Collection of data from the business processes after data mining
  • Assessment / interpretation of Data Mining output

Business Case Definition for KDD

Case Study 1:

Once upon a time there was a king who was interested in conducting mass marriages to his community people.  There were more than 50,000 (assume equal distribution of male and female) who would be benefited by this mass marriage scheme.  King entrusted the job of identifying possible matches his desire to his Minster of internal affairs to identify potential pairs from a population of 50,000 (25,000 pairs). 

Case Study 2:

A manufacturing company was having good sales revenue over number of years. In a year they observed that their sales are going down on quarterly basis. The Managing Director of the company wanted to know how to improve the sales and at least bring back the earlier sales trend.

These case studies give typical example of business problem definition. The business problems are always open ended. It is necessary to frame those to limit the number of variables, which need to be analyzed to solve the business problem. The Business Case Definition phase concentrates more on framing the open-ended business problem by using a method for resolution or by aligning a set of variables to it.

The Case Study 1 can be framed using ‘Horoscope Matching’ method for identifying compatible pairs. Hence the Business Case can be framed as “Identification of possible matches from population of 50,000 using Horoscope Matching.”

The Case Study 2 can be framed by aligning variables. Is the sale low because of Reduced Production, Low Availability at Store etc? During the preliminary business data analysis it may appear that the sales are low due to availability of goods at stores in time as production is meeting the targets. Hence the Business Case can be framed as “Analysis of availability of goods at stores which is causing the drop in the sale.”

Data Preparation as pre-requisite for data-mining

Data preparation is the key to KDD process. A prerequisite for any data mining algorithm is a set of selective, clean and transformed data. It is estimated that more than 50% of work in KDD is in getting the data to the point where data mining tools can actually start running. The business must have the understanding of what preparations are necessary for a data mining analysis.

The following figure displays the typical activities in Data Preparation.

activities in data preparation

 

 

 

 

 

 

 

 

Metadata is the most important component of a Business Intelligence system. KDD also highly relies on the availability and quality (consistency, accuracy and sufficiency) of metadata. For ensuring use of appropriate data elements from source, source system metadata needs to be analyzed from quality perspective. Also business rules for integration and transformation need to be captured in form of process metadata to facilitate the Data Preparation. With the help of the metadata “all of the possibly required” input data elements are selected from various source systems.

After selecting the data elements from various sources, it’s necessary to analyze the quality of the data. For this, data samples are collected from the sources and data profiling is performed to understand the Physical Data Quality issues. The Logical and Unmanaged / Unstructured Data Quality issues will need to be captured from the business users. The outcome of Data Quality Analysis will primarily aid in building the integration, transformation and cleansing rules repository (i.e. Process Metadata).

Input Data Interface will need to be developed using the selected data elements and the process metadata. These interfaces will be used to fetch data from the source systems in required format. The data acquired from the sources will be needed for three prime purposes during Data Mining process i.e. Training the data mining model, Testing the finalized model and Applying the model on complete data.

Data Mining

Data Mining is the heart of KDD process aiming at “identification of valid, novel, potentially useful and interpretable patterns and relationships in data”.

  • Novel: Not yet known (to KDD system)
  • Potentially Useful: Should lead to potentially useful actions (improved revenue, lower costs, increased profit, improved business processes etc.)
  • Pattern: Facts about trends in data sets
  • Relation: Expression describing dependencies between data and/or patterns
  • Interpretable: Provide knowledge that is understandable to users, or that leads to a better understanding of the data set.

For Case Study 1, it was decided that ‘Horoscope Matching’ is the criteria for identifying compatible pairs. What is the complexity of the problem? Can data mining help the minister to save his life? Data mining has many issues Following are a few issues which are apparent for Case Study 1.

  • What is the process / algorithm the minister will use to identify 25,000 compatible pairs from the population?
  • What is the criterion for finalizing the match? Is it the highest level / degree of match in available candidates OR first suitable match satisfying minimum points for match.
  • What is the effort that is required from the minister and his team to solve this problem?
  • What is the guarantee that 25,000 compatible pairs exist in a population of 50,000?  If it does not exist how will the minister prove that it indeed does not exist?

Definitely, Data Mining can manage this business issue. One of the solution is by way of negative association (elimination process) wherein discard those candidates where the new candidates cannot be associated (alignment factor is much below the threshold). But how much time will it take to process the horoscopes of these candidates and find the matches. If it takes longer time, then few other business issues may surface e.g. finding next suitable horoscopic time (i.e. Muhurat) for the pairs to get married. Hence it is also important to address and solve the business issue as soon as possible before the business scenario changes. This is possible through better business understanding and selection of appropriate data mining technique algorithm for a situation.

Data Mining primarily involves following steps.

  • Management of the input data in form of sets for model training, model testing and model deployment as explained below.
    • Training the Data Mining Model: During the development of the Data Mining Model for the business case, it’s necessary to have a set of data, which will be different from the final test data set. This dataset will be used to iteratively fine tune and test the model components by changing various parameters used in the model. The training data set might be further broken into multiple data sets to cater to the requirements of iterations in the modeling.
    • Testing the finalized model.
    • Applying the model on complete data to get the data mining results.
  • Building the Data Mining Model as per the mode required i.e. Discovery Mode or Verification Mode.
  • Constructing interfaces and / or front end applications for visualization, which aid in interpreting the Data Mining results.

Evaluation

It’s necessary to evaluate the Data Mining results and hence the model to ensure its deployment in the business process. It also yields the acceptance of model and its results in the business community.

This process involves application of the Data Mining Model to the business case using complete data sets. The results of the model need to be analyzed. These are analyzed using existing business case information or by analyzing surveyed / market study information after model deployment into business process (based on the business case). E.g. the association of Beer and Diaper products was highlighted by Data Mining Model and later it was analyzed / interpreted using market study yielding a surprising reasoning behind it.

The methods and criteria for the model assessment, interpretation and evaluation also depend on the type of model used in Data Mining. E.g. coincidence matrix with classification models, mean error rate with regression models.

The time taken for evaluating a model depends on the Business Case, Complexity of the model / algorithm and the method used for interpretation.

Note- BIDS Solutions encompass the proprietary solutions from TCS covering Business Intelligence and Data Warehousing landscape.

 

   Knowledge Discovery in Databases Program Data Mining Technology  
 
All Topics in: "KDD- Data Mining Overview" Chapter
 What is KDD- Data Mining? →  Knowledge Discovery in Databases Program →  Knowledge Discovery in Databases Process →  Data Mining Technology →  KDD- Data Mining Issues & Challenges →  Knowledge Discovery in Databases Methodology →  Data Mining Techniques- Propensity Modeling →  Data Mining Techniques- Predictive Modeling → 
 

Was this page helpful?
If you like it ? share it !
Digg
Digg
Reddit
Reddit
Del.icio.us
Delicious
Google
Google
Live
Live
Facebook
Facebook
Slashdot
Slashdot
Netscape
Netscape
Technorati
Technorati
Stumbleupon
Stumbleupon
Spurl
Spurl
Furl
Furl
Blogmarks
Blogmarks
Yahoo
Yahoo
Plugim
Plugim
Squidoo
Squidoo
BlinkBits
BlinkBits
 
CONTENT ZONE
KDD-Data Mining
Featured Pages
Data Warehouse Project Initiation
Pivoting, and Slicing & Dicing Analysis
Integrate stand-alone BI
Data Warehouse Information Systems Assessment

Make 'Executable' Strategy
Maximize Results
Maximize People
Manage Execution

Featured Pages
Data Monitoring Request Form
Data Warehouse Testing Categories
Data Quality Assurance Checklist
Source system mapping matrix