Wednesday, October 31, 2007

First conclusions on Text Indexing

After reading several papers on Information Retrieving and Search Engines I conclude that:

* Sphinx engine only deals with indexes, not with data. So it delivers solely the indexes pointing to the documents that match certain criteria. Is up to the database to fetch this documents and up to the web application to show them on screen.

* There are two ways to index full-text: Inverted Files and Signature Files. First ones seem to be more efficient. Second ones work better when matching common words, but it is very uncommon to search only common words. So, for real life, better to apply Inverted Files.

* Inverted Files are lists of documents where a word appears, storing also the normalized weight to be able to do reliable matching.

* Signature Files are bitmaps that indicates if a document contains a word or not. Just this bitwise information, yes or not, extended for all the documents indexed.

* When a document hasn't got a certain keyword inside, Signature Files contain a bit value (1 or 0. In this case, 0), meanwhile Inverted Files never contain a reference to the document in such a case.

* Sphinx uses Inverted Files. On the contrary, TBGsearch uses Signature Files.

* I guess MySQL Full-text default engine is also using Inverted Files, but less efficiently by far than Sphinx.

* Inverted Files may be compressed heavily, because it is an ordered sequence of integers, usually homogeneous. This property saves I/O disk reading operations.

* In Sphinx, when doing two or more word searches, phrasal or just Boolean AND'ed, respective Inverted Files are read from disk, uncompressed and intersected in RAM, and then, order by search criteria (matching weight, date, or others).

No comments: