I've been processing another reference document dealing with Full-text searching that gave me Andrew Aksyonoff: "Recommended Reading for IR Research Students" being Justin Zobel one of its editors and IR meaning "Information Retrieval".
This is a list of important papers about Information Retrieval, and Full-text searching is an important area of study in this field (IR).
Reading their abstracts I have found there are several and distinct ways to refer to one concept. I expose them here. Nomenclature:
* Scoring function: equal to Ranking function, or weight function?
* Term: equal to word, or keyword.
* Document: equal to data, row, record, message.
* Index =? weighted index, inverted index, reverse index, inverted list, inverted file, document list.
* Term-document matrix =? document-term frequency weights.
* Phrase =? compound terms, several words among quotes.
* Vector Space Model (VSM) = Term Vector Model
* Latent Semantic Analysis = Latent Semantic Indexing (LSI) = Vector reduction
Part of this terminology could probably be ordered in this (maybe candid) way:
1.- Words, terms or keywords belongs to a given document.
2.- A document is more or less relevant respect to a word according to the weight/rank/score function.
3.- One weight for a given word is just a vector component, an scalar value.
4.- The weights of all the words in a document defines one vector. This vector represents the document.
5.- All the documents in the database are vectors in a vector space. To compare them, they should firstly be normalized to avoid that longest documents always stay at first result positions.
This is supposed to be the so-called Vector Space Method (VSM), I guess. If so, I have to say that this is a bit strange vector space because the vector components never get a negative value. All the vectors/documents resides in the same multidimensional quadrant. Could be that a problem?
Tuesday, October 30, 2007
Full-text search nomenclature
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment