Information retrieval is a branch of computer and information science that develops systems to help people find the information they need. Information retrieval has become especially important since the 1990’s, when the Internet began to greatly increase the amount of information available to computer users.
The Internet is a system of interconnected computers from all over the world linked to one another. It includes millions of websites with information on nearly any topic. Many of these websites include multimedia data—that is, sounds, pictures, and moving images, as well as text. The vast amounts of text and multimedia data available on the Internet far exceed the management capabilities of manual organization systems. Search engines, such as Google and Yahoo!, are a popular type of information retrieval system.
Information retrieval systems are concerned with the cataloging and retrieval of relatively unstructured information. In contrast, database management systems deal with the storage and retrieval of information organized in highly structured databases.
How information retrieval works.
The main functions of an information retrieval system are (1) indexing, (2) formulating queries, and (3) ranking. Indexing is the creation of an organized guide to the contents of a file, document, or group of documents. In the case of an Internet search engine, indexing involves extracting representative words and phrases from the contents of web pages. The extracted words and phrases are stored in an index structure designed to make searching millions of pages extremely fast. Web indexing also involves web crawling, also called spidering, a process of discovering new pages by following links (interactive connections).
Most queries in search engines are just one or more words. Modern search engine programs can often tell the context of the words in a query and offer users options to refine their searches.
In the ranking function, the search engine uses the index structure to compare queries and web page contents. A retrieval algorithm (mathematical process) calculates a score for each web page that has matching words and phrases. The score is based on such information as the number and frequency of matching words, the number of links pointing to a page, and where matches occur in a page. Such factors usually help determine how relevant the page is to the query. Search engines display a list of links to pages in order of their scores.
Research.
Researchers in information retrieval constantly seek ways to improve the ability of the systems to identify relevant information. One area of research involves the development of question-answering systems. The goal of question answering is to provide specific answers to questions typed in by users. In addition to traditional ranking techniques, a question-answering approach must involve natural language processing, which helps computers respond to questions phrased in ordinary language. A key part of question answering is information extraction—that is, automatically finding particular types of information, such as names, addresses, and dates, in the text.
Information extraction is an example of text data mining. Researchers are working to improve a number of data-mining techniques used in information retrieval. These techniques include automatic clustering (grouping) of related documents, categorizing documents into directories that people have created, and analyzing relationships and patterns in scientific literature.
There are a number of other research areas in information retrieval. They include developing programs that summarize text content automatically and designing systems to deal with text in multiple languages.
A technique called filtering involves creating profiles of people’s interests or preferences, and comparing those profiles to information from newswires, product announcements, or other sources. A service can then alert people to information of particular interest.
History.
People began to use computers for information retrieval in the 1960’s, but manual systems for information retrieval date to ancient times. In the 200’s B.C., the Greek scholar Callimachus developed a catalog for the Alexandrian Library in Alexandria, Egypt. The catalog classified the library’s writings into the works of poets, lawmakers, philosophers, historians, orators, and miscellaneous writers. During the A.D. 1200’s, monks developed concordances for the Bible—that is, lists of the book’s principal words with references to the passages where they occurred. In 1876, the American librarian Melvil Dewey developed the widely used Dewey Decimal Classification system for organizing library materials.