In information retrieval, unstructured data is evaluated similar to what search engines do with the World Wide Web.
It is based on a large amount of data. For example, information about a particular topic needs to be found. For this purpose, the amount of data needs to be searched and the data found must be assessed as to whether it is important information and if certain information is more important than other information found. There is no clear answer for all this and there is no perfect order for the data to be delivered as a result. The evaluation of the data is supposed to be useful for humans. It’s not about recreating data, but managing existing data. Large amounts of data are searched for, not individual words.
Currently, the largest area of application is the Internet search, whereby search engines such as Google or Bing search the data of the Internet based on desired words (search requests). The searcher is provided a results list of relevant webpages which contain information with regard to the entered search term. Search results that are not relevant to the user, but still appear on the SERPs are called false drops. Moreover, information retrieval is used when searching for literature in digital libraries, in image search engines, and spam filters.
Users can usually only enter very vague requests. They often also do not know exactly what they are looking for. It is also uncertain if the information is correct, for example, because a word has different meanings or there are synonyms that mean the same thing.
There are different models for indexing of found documents which do not exclude each other as a rule. The aim is to present many relevant documents and to omit those which are not relevant.
Based on Boolean algebra, requests with exact syntax are posed with Boolean operators such as "and", "or", "not" etc. This is quite easy and a clear. The disadvantage is that partial results and weighting of terms are not possible. Thus, the result is not a ranking, because a document is either relevant or not.
A model, often used by search engines, is the vector space model, since both ranking and a similarity search will be taken into account. A document is transformed into a vector and in this format can be compared to other documents or the search request. The vectors can be sorted according to their similarity to the search request. The downside of this model is that Boolean operators cannot be used and terms cannot be excluded. Terms, term frequency, and IDF are used in this model. The location of the documents in the vector space is calculated through these.
This model specifies a probability value for each document to determine whether it is a relevant result. In this case, the number of occurrences of the search terms in the document is critical. The result is a list that is sorted according to the probabilities. This model is no better than others and is hardly used in practice.