Data Mining Architecture
The technological objective in KDD process is to design architecture for Data Mining. In addition to the architecture, it is also intended to address the process-related issues. It is assumed that the implementation of the Data Mining Technology would be a processing, memory and data intensive task as against one that requires continuous interaction with the database.
It is also assumed that the Data Preparation (Data Extraction, Transformation, Cleansing and Loading) is outside the scope of the Data Mining architecture. To preserve the accuracy of the data mining results, the Data Preparation process must be addressed before the Data Mining process as explained in the earlier topic.
The following diagram depicts generic 3-tier architecture for Data Mining.
The first tier is the database tier where data and metadata is prepared and stored. The second tier is called Data Mining Application where the algorithms process the data and store the results in the database. The third tier is the Front-End layer, which facilitates the parameter settings for Data Mining Application and visualization of the results in interpretable form.
It is not necessary that the Database tier is hosted on an RDBMS. It can be mixture of RDBMS and Files System or a file system only. E.g. the data from source systems may be staged on a files system and then loaded onto an RDBMS. The Database tier consists of various layers. The data in these layers interface with multiple systems based on the activities in which it participates. Following diagram represents various layers in the Database tier.
The Metadata layer is the common and most frequently used layer. It contains information about sources, transformations and cleansing rules and the Data Mining Results. It forms the backbone for the data in entire Data Mining Architecture.
This layer comprises of Staging Area, Prepared / Processed Data and Data Mining Results.
The Staging Area is used for temporarily holding the data sourced from various source systems. It can be held in any form e.g. flat files, tables in RDBMS. This data is transformed, cleansed, consolidated and loaded into a structured schema during Data Preparation process. This prepared data is used as Input Data for Data Mining. The base data may undergo summarization or derivation based on the business case before it’s presented to the Data Mining Application.
The Data Mining output can be captured in the Data Mining Results layer so that it can be made available to the users for visualization and analysis.
Data Mining Application
Data Mining Application has two primary components as shown in the figure.
- Data Manager
- Data Mining Tools / Algorithms
As the name suggests, this layer manages the data in the Database Tier and controls the data flow for data mining purpose. It has following functionality.
- Manage Data Sets: Classification of input data will be necessary for Building the Data Mining Model, Final Testing and Deployment tasks. The data manager layer will aid in dividing the data into multiple set so that it can be utilized during various stages of the Data Mining task. Same is the case with results of the Data Mining task, which might be utilized for further processing.
- Input Data Flow: The data need to be extracted from the database in the required format of the Data Mining task. Also the data flow needs to be controlled as per the Data Mining task requirements i.e. row by row or bulk load. The Data Mining task may also require data in specific format (like itemized data for Associations). A few transformation routines will be necessary to transform the data from Database tier into the required format as per the specifications. Another option of transforming the data at database can be considered.
- Output Data Flow: The results generated by the Data Mining task will need to be managed and facilitated to target systems (Front End or other systems like CRM) in required data format and data flow specifications.
The Data Manager layer needs to be portable depending on the database from which data has to be extracted and the Data Mining tool.
Data Mining Tools / Algorithms
This is the heart of the complete architecture. The Data Mining Tool will contain different tasks. The prime functionality of the task will be analyzing the data and generate the results. Various techniques / algorithms can be utilized depending upon the business case. These are described in the data mining techniques.
Numerous tools are available in the market to give best possible result as output e.g. SAS, SPSS, Teradata Miner and IBM Intelligent Miner. These tools merely facilitate the application of algorithms on the input data. But the most important task, which is always aligned to the specific business case, is setting the parameters for the algorithms and
Front End is the user interface layer. It has following prime functionalities.
- Input Parameter Settings
- Data Mining Results / Visualization
Administration screens for the ETL and Data Mining tasks are usually provided as a part of the products / tools. These are utilized to administer the following primary tasks
- Data flow processes (e.g. Extracts, Loads)
- Data Mining routines
- Error reporting and correction is also handled through the administration screens.
- User security settings
Input Parameter Settings
During the Data Mining Model build, iterations are inevitable. These iterations are needed to fine-tune the model by changing various parameters involved in the model. For executing a Data Mining task, the user needs to provide respective input parameters. Then observe the effect on the results and change the parameters if needed based on the interpretation and understanding of the results. This facility is provided in the Front End.
Data Mining Results
The results of the data mining task need formatting, conversion to user understandable form and reporting to the user. The front-end caters to the predefined formats of the out files generated by the respective Data Mining technique. The user will have the flexibility to view and analyze the results of Data Mining. Reporting utility performs the job of displaying the report, charts and smart reports (e.g. Clusters, Trees, and Networks).