Monday, October 29, 2007

Andrew Aksyonoff has replied me again. Thanks!

> Selecting 1 million random rows from disk-based database of 100
> million rows will be slow as hell, too.

According to my short experience, reading spread rows in disk takes 0.01 seconds per record, and from RAM takes 0.00001 seconds, that is 1000 times less than from disk (without cache).

So selecting 1 million rows would take 10.000 seconds from disk and 10 seconds from RAM. Do you agree with it?

I have entered at boardreader.com and looked for the keyword "internet". And I get this stats: "Results 1 - 10 of 1,367,788 for internet (0.385 seconds)".

Website boardreader.com is Sphinx powered, and is selecting more than 1 Million rows in less than a second. According to my rules above, it should take around 13 seconds long.

Maybe boardreader.com is using parallel processing of some kind or their RAM memory is faster than standard.

These numbers are getting me a bit confused ...





> Eg. Sphinx takes mutual keyword positions into account and ranks
> closer phrase matches higher. Therefore ordering per-keyword document
> list by anything but document ID will in fact slow down processing
> a lot, because intersecting unsorted lists will me much slower.

I think you are giving me here very valuable information about the Sphinx internals and I really appreciate it.

You will agree with me if I say that understanding the engine of a powerful search engine like Sphinx is not easy. This is why I am trying to simplify as much as possible the situation. You talk me about intersecting indexes, but I guess this is only needed when doing phrase searches. I am also very interested in phrase searches but maybe it would be easier to understand firstly single keyword searches.

I suppose that in a single keyword search only 1 index should be consulted, and that index is ranked or ordered under a given criteria. If that order is equal to the page results order, then there is no need to process all the matching rows to find the first 10 that will be showed in the page. So directly the first 10 better ranked pointers in the index should lead me to the 10 data rows to extract from the database. I assume that extracting 10 spread data rows is a very fast operation even though the database is huge or not cached.

So the question is, which criteria or order is Sphinx using in a given keyword index? What is that rank function? Is it just a counter (or its logarithm) of the number of appearances of the keyword in the document? Is that simple?




> No. The key tricks are 1) to read and process a lot of data in linear
> manner, friendly with modern HDDs and CPUs; and 2) to distribute
> searching across many nodes, and search on all of them in parallel.

I don't still understand how to order the indexes and the data so as to do linear readings ... I don't realize any way to order data that let us to read it linearly in any search. If you search "internet" the data associated to this keyword might be localized, but if you then search "Russia" the corresponding data rows couldn't be contiguous because "technology order" is different and incompatible to "country order".




> I've seen a paper describing a similar idea several years ago. But it
> would only work with statistical approaches, not for phrase matches,
> so I did not bother even to memorize the title :)

What is an statistical approach? Are you referring to fuzzy logic or related?



I wouldn't like to bother you with my questions. I guess you have much work developing Sphinx. Maybe there is some kind of document in Internet where all this concepts are explained. I have spent many hours in Google looking for it unsuccessfully. If you know about such a valuable source, please, tell me.

Thanks a lot again, Andrew.

No comments: