Monday, October 29, 2007

Building fast search engines and Zettair

I am following Andrew Aksyonoff guidelines to go on with the full-text investigations.
First of all I am reading carefully "Building fast search engines" by Hugh E. Williams at http://www.hughwilliams.com/t1.pdf

It is a very interesting paper and has driven me to Zettair project ( http://www.seg.rmit.edu.au/zettair/ ) by Justin Zobel and others at Melbourne University. This is a powerful indexer and search engine designed to work with big text repositories.

Hugh E. Williams also works at Melbourne University, and has settled very clearly the search engine basis in his document. I have learned several new concepts:

* Ranked query: Query formed by 2 or more words.

* Phrase search: Query formed by 2 or more words among quotes.

* Term Frequency (TF): The number of appearances of a given word in a document.

* Inverse Document Frequency (IDF): A word oddness, that is, if a word is more o less frequent in a text in general.

* Ranking function: The formula that measures the weight of a word in a document. It is useful to find out which document is closer to the a keyword from the search query.

* Okapi BM25: Very well crafted formula that gives a good and fair word weight measure.

* Inverted index: It is what I should call an "weighted index". I don't know why they call it "inverted". Inverted index shows the database under a "word point of view". That is where the documents containing that word are placed.

* Compressed indexes: A way lo reduce I/O operations. The same for the data (or rows).

No comments: