|
Overview
It was recognized that information is at the heart of business operations and that decision-makers could make use of the data stored to gain valuable insight into the business. Database Management systems gave access to the data stored but this was only a small part of what could be gained from the data. Traditional OLTP systems are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where Knowledge Discovery in Database (KDD) has obvious benefits for any enterprise. It involves processes like Business Case Definition, Data Preparation, Data Mining and Evaluation.
The term data mining has been stretched beyond its limits to apply to any form of data analysis and is used interchangeably with KDD. But in true sense data mining is just a step in KDD process focusing on data analysis with minimum user intervention. Some of the numerous definitions of Data Mining are:
- “Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database.” Marcel Holshemier and Arno Siebes (1994).
- The analogy with the mining process is described as “Data mining refers to ‘using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful’." Clementine User Guide, a data mining toolkit from SPSS.
- “Data Mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency net works, analyzing changes, and detecting anomalies.” William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus.
Basically Data Mining is concerned with the analysis of data and the use of tools and techniques for finding patterns and regularities in sets of data. It is the computer system, which is responsible for finding the patterns by identifying the underlying rules and features in the data. The idea is that it is possible to strike gold in unexpected places as the system mines deep into the data to extract patterns not previously discernable or so obvious that no one has noticed them before. It is not simple queries for validating facts. The objective is to find patterns and rules automatically with minimal user input.
In the evolution from business data to business information to business knowledge, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps, listed in the table below, were revolutionary because they allowed new business questions to be answered accurately and quickly.
Evolutionary Step |
Business Question |
Enabling Technologies |
Product Providers |
Characteristics |
Data Collection
(1960s) |
"What was my total revenue in the last five years?" |
Computers, tapes, disks |
IBM, CDC |
Retrospective, static data delivery |
Data Access
(1980s) |
"What were unit sales in New England last March?" |
RDBMS, SQL, ODBC |
Oracle, IBM, Microsoft |
Retrospective, dynamic data delivery at record level |
Data Warehousing
(1990s) |
"What were unit sales in New England last March? Drill down to Boston." |
Relational Data Warehouse, OLAP, MDDB |
NCR, Business Objects, COGNOS, Hyperion |
Retrospective, dynamic data delivery at multiple levels |
Data Mining
(2000s) |
"What’s likely to happen to Boston unit sales next month? Why?" |
Advanced algorithms, Very Large Databases |
SAS, SPSS, IBM, Oracle, NCR |
Prospective, proactive information as well as knowledge delivery |
|