Information Retrieval


In information retrieval, unstructured data is evaluated similar to what search engines do with the World Wide Web.

Main principle

It is based on a large amount of data. For example, information about a particular topic needs to be found. For this purpose, the amount of data needs to be searched and the data found must be assessed as to whether it is important information and if certain information is more important than other information found. There is no clear answer for all this and there is no perfect order for the data to be delivered as a result. The evaluation of the data is supposed to be useful for humans. It’s not about recreating data, but managing existing data. Large amounts of data are searched for, not individual words.

Areas of application

Currently, the largest area of ​​application is the Internet search, whereby search engines such as Google or Bing search the data of the Internet based on desired words (search requests). The searcher is provided a results list of relevant webpages which contain information with regard to the entered search term. Search results that are not relevant to the user, but still appear on the SERPs are called false drops. Moreover, information retrieval is used when searching for literature in digital libraries, in image search engines, and spam filters.

Difficulties

Users can usually only enter very vague requests. They often also do not know exactly what they are looking for. It is also uncertain if the information is correct, for example, because a word has different meanings or there are synonyms that mean the same thing.

Models

There are different models for indexing of found documents which do not exclude each other as a rule. The aim is to present many relevant documents and to omit those which are not relevant.

Boolean model

Based on Boolean algebra, requests with exact syntax are posed with Boolean operators such as "and", "or", "not" etc. This is quite easy and a clear. The disadvantage is that partial results and weighting of terms are not possible. Thus, the result is not a ranking, because a document is either relevant or not.

Vector space model

A model, often used by search engines, is the vector space model, since both ranking and a similarity search will be taken into account. A document is transformed into a vector and in this format can be compared to other documents or the search request. The vectors can be sorted according to their similarity to the search request. The downside of this model is that Boolean operators cannot be used and terms cannot be excluded. Terms, term frequency, and IDF are used in this model. The location of the documents in the vector space is calculated through these.

Probabilistic model

This model specifies a probability value for each document to determine whether it is a relevant result. In this case, the number of occurrences of the search terms in the document is critical. The result is a list that is sorted according to the probabilities. This model is no better than others and is hardly used in practice.