Tuesday, October 30, 2007

Full-text search nomenclature

I've been processing another reference document dealing with Full-text searching that gave me Andrew Aksyonoff: "Recommended Reading for IR Research Students" being Justin Zobel one of its editors and IR meaning "Information Retrieval".

This is a list of important papers about Information Retrieval, and Full-text searching is an important area of study in this field (IR).

Reading their abstracts I have found there are several and distinct ways to refer to one concept. I expose them here. Nomenclature:

* Scoring function: equal to Ranking function, or weight function?

* Term: equal to word, or keyword.

* Document: equal to data, row, record, message.

* Index =? weighted index, inverted index, reverse index, inverted list, inverted file, document list.

* Term-document matrix =? document-term frequency weights.

* Phrase =? compound terms, several words among quotes.

* Vector Space Model (VSM) = Term Vector Model

* Latent Semantic Analysis = Latent Semantic Indexing (LSI) = Vector reduction


Part of this terminology could probably be ordered in this (maybe candid) way:

1.- Words, terms or keywords belongs to a given document.

2.- A document is more or less relevant respect to a word according to the weight/rank/score function.

3.- One weight for a given word is just a vector component, an scalar value.

4.- The weights of all the words in a document defines one vector. This vector represents the document.

5.- All the documents in the database are vectors in a vector space. To compare them, they should firstly be normalized to avoid that longest documents always stay at first result positions.

This is supposed to be the so-called Vector Space Method (VSM), I guess. If so, I have to say that this is a bit strange vector space because the vector components never get a negative value. All the vectors/documents resides in the same multidimensional quadrant. Could be that a problem?

No comments: