This page is an extract from BIDS KDD Methodology authored by Kamlesh Mhashilkar-Head, Execution-MiH Services of Tata Consultancy Services
KDD process is aligned to the BIDS™ Methodology ensuring a robust framework for delivering KDD programs. This chapter gives details of the tasks to be executed during various modules of BIDS™ methodology for KDD program.
Definition Phase
This phase focuses on analyzing the readiness of the organization and developing data mining application. The data mining application framework definition involves analyzing the current state of business and decision support processes and developing a blueprint for data mining application.
Strategic Vision
During the definition of the data mining, identify and define the strategic vision of the organization. This will form the high level goal of the organization. In the process of defining the strategic vision, identify the organizations current decision support processes and its benefits. Identify the organizations awareness of data mining and its process. Define how the data mining will fulfill the strategic vision of the organization.
Business Cases/Problems
After defining the strategic vision of the organization, capture the ‘AS IS’ Business and IT processes (including the organizational structures, the business and IT architecture, operational and analytical processes, decision support programs and interfaces to the external environment). The business analysts involved in this process will also analyze the state of the organization.
The focus while arriving at the ‘TO BE’ processes will be to base the processes on the Company Vision and by gathering and consolidating the high level business cases/problems across the organizational functions / Units / Departments.
The ‘AS IS’ process will be then compared to the ‘TO BE’ processes. The comparison will help in defining the business cases/problems with respect to the various processes, architectures and structures currently in place and the high level requirements as identified for this initiative.
The examples of the business cases/problems for data mining are:
- Increasing business unit and overall profitability
- Understanding customer desires and needs
- Identifying profitable customers and acquiring new ones
- Retaining customers and increasing loyalty
- Increasing ROI and reducing costs on promotions
- Cross-selling and up-selling
- Detecting fraud, waste and abuse
- Determining credit risks
- Increasing Web site profitability
- Increasing store traffic and optimizing layouts for increased sales
- Monitoring business performance
Tools and Technologies
Technology and tools evaluation and recommendation will also be taken up at this stage. A functional Proof of Concept (POC) development may optionally be performed to validate the technology recommendation.
A Data Mining roadmap will be formulated, that will include identification of opportunities to provide results as a series of deliverables achieved within 90 days cycles and prioritization of the subsequent phases of the overall initiative. This will clearly state the high-level work plan for the complete data mining project cycle.
Specifications
The Specifications phase focuses on the capturing detailed requirements for the data mining application for each business case. This phase expects that the framework be clearly defined for the Data Mining Application, which includes
- High level business cases/problems
- Gap analysis with respect to the current business processes and IT systems,
- Framework for the complete Data Mining Application Processes.
The typical steps during the Specifications phase are shown in the following flow chart.
Requirements Analysis
The functional and business requirements will be captured for each business case/problem captured during Framework Definition phase. Business Requirements Specifications (BRS) document will be prepared for each business case and the BRS will be categorized by the business case/problem.
- Visualization: The results of the data mining task need formatting, conversion to user understandable form and reporting to the user. The visualization specifications need to be defined in this phase to cater to provide user defined formats of the out files generated by the respective Data Mining techniques. The user requirement specifications will be captured for the flexibility to view and analyze the results of Data Mining including displaying the report, charts and smart reports.
- Exception Handling: In case of any error in the data mining process the error will be reported.
Data Mining Techniques
Based on the business case/problem, the suitable data mining techniques will be identified. The data mining tasks includes the following:
- Associations
- Segmentation (clustering)
- Classification
- Rule discovery
- Regression
- Deviation Detection
Data Analysis
-
Data Quality : Data quality is critical for the data mining application. Data Mining has a critical dependency on clean, well-maintained data. Hence, in the absence of a data warehouse, some amount of pre-processing of data would be needed before deploying data mining. Data quality will affect the data preparation required for the data mining application. If the data is not extracted from the data warehouse, the data quality assessment will be conducted to identify the data quality issues.
-
Metadata Assessment: Metadata is important for data mining. Based on the metadata information, the data preparation will be carried out for data mining tasks. Meta data assessment will be conducted to measure the completeness of the metadata information. In case the metadata information is not available, a data dictionary will be prepared with high-level metadata information.
The data mining application requires the different templates to conduct data analysis. In the specifications phase these templates specifications will be captured.
- Data Dictionary/Template>: At the onset of the Data Mining task, it is essential to identify the appropriate data elements that need to be analyzed. There has to be a Data Dictionary/template T made available to the user as a pre-requisite for selecting data elements from the available data sources. The Data Dictionary/template T contains the information about the available data elements in the respective database/views.
The Meta Data M of the selected data elements is captured from the Meta Data in the data server. And this Meta Data M is made available to the user.
- Data Preparation : Data preparation is the key to any data mining application. It is a prerequisite for any data mining technique is preparing a set of selective, clean and transformed data. It is estimated that more than 50% of work in data mining is in preparing data to the point where data mining algorithms/techniques can actually start running. The line of business must have the understanding of what preparations are necessary for a data mining analysis
Based on the BRS, the data elements required for each business case/problem will be identified. The date preparation requirements will be identified based on the categorizations and classifications of data elements for each business case/problem.
- Specification: A specification file S is required detailing how to transform data as per the user requirement. The specification file will contain the information of selected data elements, Data Mining Task and user categorization/itemization of the each data element. This specification information will be send to Data Mining Application.
- Data Mining Input Parameters: Based on the data mining task need to be executed, the required input parameters for data mining task should be defined in the Data Mining Input Parameters Specifications .
Design
The Design phase addresses the technical design phase of the data mining application. In order to start this phase, it is essential that the detailed business and technical are complete and approved and in addition the technology platform is finalized.
Application design focuses on developing detailed System Design Specifications (SDS), comprising the following:
Data Preparation Design
- Data Manager Layer : Once the specification S is available, the mining task can be performed. The data manager layer will be designed to extract the data in the database and make it available to the Data Mining Task. The data manager layer will be designed to extract the data in specified format as per the Data Mining Task Input Parameter Specification.
-
Manage Data : Based on the business cases, the data may need to be extracted from the database in the required format of the Data Mining Task and the way in which the Data Mining Task requires. In such scenario, separate database will be created and the data and loaded into database for the Data Mining Application (Optional).
Business Case Model Design
Based on the business case, process model will be designed. The process will have following steps:
- Manage Data Process : In manage data process; the design of extracting data will defined based the format.
-
Data Mining Process : In this process, the design will be focused on performing identified data mining tasks for the business case.
-
Visualization Process : In this process, the primary focus will be on designing to generate reports and Alerts if required based on user defined patterns
Development/refinement of the application prototype is an optional activity within the design phase. This prototype will involve development of components, which will be reused during the development activities. The development of prototype helps in gaining appreciation from business users in functional, technical and reusability perspectives.
In preparation for the build phase, standards for various development components (coding, testing etc.) will be prepared and reviewed.
Build
The ‘Build’ phase addresses the development and testing phase of the KDD Program.
This module will essentially concentrate on development, unit testing and system testing of the Data Mining application. In addition to activities related to coding and unit testing, quality assurance activities such as code review, test plan review will be included.
At end of building the data mining application, the results of the data mining application will be validated. Based on the validation, the data mining application may go to specifications phase to change the business and data understanding and subsequently if required the data mining model will be changed. The build process of the data mining process is iterative in nature and it will go through lot of iterations before going to deployment.
Preparation and review of the Unit Test Plan (UTP) and System Integration Test Plan (SITP) and executing the same will be activities in this module. The UTP will comprise of the test cases for each individual component of the application. During unit testing, it’s necessary to have the complete system integration view so that errors are minimized during the SITP. This is done by highlighting various components and test cases in UTP, which will play a role in the SITP.
The SITP will comprise of the system test cases. Also the test cases can be based on the comparison with existing reporting / DSS systems and reconciliation requirements with respect to the source systems. SITP will form the basis for system testing.
User Acceptance Testing (UAT) criteria and plan will also be prepared in this module. Wherever applicable, stress testing (this includes volume testing and performance testing) will be addressed during the System Testing and UAT.
User related documentation, such as the Operations Guide and the User Manual will be prepared during this module.
The training plan addressing details such as the training modules, the contents, schedules will be prepared. The plan will also state pre-requisites for each of the sessions. Training material including case studies will be a further set of deliverables from this module.
The site rollout plan will be prepared to state the activities involved during the deployment of the data mining application. It will also highlight the risks involved and precautions to be taken during rollout activities.
To summarize, the key deliverable from this module will be a system-tested application that is ready for user acceptance testing.
Deployment Phase
The ‘Deployment’ phase focuses on the effective implementation of the Data Mining Application in order to provide an easy information access to the end users. In order to start this phase, it is essential that the Data Mining Application has undergone System Testing. The data mining application will move from development environment to test (UAT) environment to Production environment.
The test environment will be set and configured for UAT. The test cases in the UAT plan, prepared in the Application Build phase along with the SITP, will be used during UAT. The UAT will comprise of three main tests.
Volume Testing
This will consist of Volume Testing and Performance Testing. Volume testing will be carried out to test the scalability and durability of the data mining application, with respect to the base (historical) data volume and the growth in data volume.
Stress testing will be carried out in the testing environment provided the environment is scalable to the production environment.
End User Acceptance Testing
Business users will carry out the report and navigational testing. They will primarily test the data, which will be displayed in the presentation layer
.
Users will also carry out performance checks and tuning to ensure the expected performance, robustness and scalability over a period of time.
Once the users accept the developed data mining application, the training programs and Site Rollout activities will begin in parallel. The users who conduct the acceptance testing might need to be trained prior to the acceptance testing. The training program will be customized as per the user requirements. It can be tool specific training or data mining application specific training or mix of both. Various training sessions will be carried out for different types of users. E.g. Administrators, Operational Process Monitoring Staff, Application / Component Developers, Forecasting Reports and separate Statistical Tools.
While rolling out the data mining application to the production environment, first the setup will be completed to ensure that the application can be effectively deployed. The environment set-up includes the hardware setup, server configurations, operating systems, software installation and configuration. Utmost care needs to be taken, if the production environment is already live with other applications. The impact of the changes in the production system due to this application needs to be monitored and controlled.
Once the production environment setup is complete the various application software and codes will be deployed and made operational.
Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and as such can work well along side data mining. Data Mining allows the analyst to focus on certain patterns and trends and explore in-depth using visualization. On its own data visualization can be overwhelmed by the volume of data in a database, but in conjunction with data mining can help with exploration.
Note- BIDS Solutions encompass the proprietary solutions from TCS covering Business Intelligence and Data Warehousing landscape.
|