Wednesday, October 31, 2007

Reading IF in 256 Kb chunks

Andrew Aksyonoff has been explaining me the chunk-oriented read process when Sphinx require the Inverted Files (IF). And these are the doubts and comments I have replied:

> ... Sphinx will read it in small (256 KB) chunks ...

So, if the query is a 2 word phrase, Sphinx will have open 2 windows simultaneously, that is, 2 buffers of 256 Kb at the same time. Is that right?

If more words in the query, more simultaneous 256 Kb chunks, right?

And the major CPU consuming process are 2:
1.- Performing the chunks intersections.
2.- Sort the intersected docId's by the user criteria.
Right?

And this is my final doubt: Is always necessary to read from the beginning to the end the corresponding IF indexes when doing the intersection?

Google in mind spurs this question. I mean, when searching in Google 2 common words like "Internet" and "WWW" we obtain this results:

* 2.110 million results for "Internet".
* 9.310 million results for "WWW".

When intersecting them Sphix would have to read and intersect 2 billion (docId+weight+probably other ranks) hits against 9 billion. Right?

Isn't that too slow? The time spent in doing it, is log(N)? Or log(2 billion) or log(9 billion) or log (2 billion x 9 billion) or any other?

Thanks again!

No comments: