Friday, October 26, 2007

Andrew Aksyonoff gave me a hint

Andrew Aksyonoff is the main developer of Sphinx. I asked him about the reason for MySQL being too much slow that Sphinx: "I would like you to give me a small hint about the Sphinx index architecture. I have the thesis that MySQL is slow because its full-text indexes are not weight-ordered."

He has told me that MySQL suffers from internal fragmentation rather than wrong ordering: It needs to perform too much semi-random I/O task on huge requests.

Thanks, Andrew. As much as I understand (not so far) the index architecture of a database engine, the query leads to the indexes and these lead to the data. If the data to recover is spread, the result will take long. And that seems to be what happens in MySQL.

Indeed, I have observed that MySQL satisfies roughly the rule of "0.01 sec per result found" no matter how big or small the database is. Precisely 0.01 is a coarse measure of the time usually taken to do a I/O random read to a disk. If the data to read wouldn't be so spread or would be stored in cache, in only 0.01 seconds MySQL could retrieve a lot of registers (maybe 100 or more, instead of just one).

Therefore, my benchmarks satisfy Andrew's theory. Now let's observe a Google typical search result: Suppose we search the keyword "MySQL": 10 URLs in screen among 163 million matches. It took 0.06 seconds to deliver it.

If MySQL would have made this search operation it would have taken probably 163.000.000 * 0.01 seconds, and that is 18 days. Too much time for a query. No doubt. Meanwhile Google or Sphinx can resolve the query in less than 1 second. Where is the miracle?

This is my point: If the index could tell us directly which are the 10 first and most significant results for this search, we could avoid 163 million I/O operations. Only 10 I/O operations would have been necessary. And that is fast, just 0.10 seconds according to the rule above.

MySQL needs to perform this 163 million I/O operations to find out the grade of matching of each register found against the search keyword. Google and Sphinx do not need it, probably because their indexes contain that matching weight information inside, someway.

Hence the key factor to develop a fast search engine is an index with its pointers ordered by matching weight decreasingly. That is what I call a "weighted index". Maybe this nomenclature is stupid, I don't know. I am sorry in such a case.

No comments: