Data mining is a branch of computer science that involves sorting through large amounts of electronic data to find patterns, establish relationships, and discover knowledge. It is a step in the process known as knowledge discovery in databases (KDD). Computer programmers who work in data mining seek to design software (programs) that can analyze large databases (collections of data) automatically.
Data mining uses techniques developed in artificial intelligence (AI), statistics, and other branches of computer science and mathematics. AI is the effort to design systems that process information in a manner similar to the way a person thinks. AI includes the study and application of methods for finding rules and patterns in data. Statistics involves constructing mathematical models of data and determining correlations. Correlation is a term used to describe the relation between two or more variable quantities that increase or decrease together. Data-mining specialists strive to apply these techniques to the analysis of billions of bits of computer data.
Data-mining techniques are commonly used to analyze consumer purchases and predict future buying behavior. For example, researchers mine large supermarket databases to discover what combinations of items often appear in the same shopping basket. In other words, they seek to determine what groups of items certain consumers tend to purchase. Supermarkets can then use this “market basket analysis” to try to increase their sales by putting linked items next to each other on store shelves.
Data mining is also applied to scientific data. For example, experiments in molecular biology generate large amounts of information on how hundreds of genes behave under various conditions. Researchers mine this data to try to discover the function of individual genes.
A number of World Wide Web search engines use data-mining techniques. For example, the Google search engine analyzes data on the billions of links (interactive connections) between Web pages to determine which pages are the most popular. This discovered knowledge is used to rank search results.
During the 1970’s and 1980’s, increasing amounts of commercial, industrial, and scientific data were being stored in computer databases. In the 1990’s, programmers began to develop data-mining software as a means of discovering interesting patterns in these large databases.