Wednesday, October 24, 2007

Caching only data file

MYD is the associated signature to MySQL MyISAM data files. We are going to cache only this file and measure query timings:

# reboot

# free
total used free shared buffers cached
Mem: 255504 50216 205288 0 6020 26988
-/+ buffers/cache: 17208 238296
Swap: 0 0 0

# cat /var/lib/mysql/full_text_investigations/*.MYD > /dev/null

# free
total used free shared buffers cached
Mem: 255504 152156 103348 0 6148 125504
-/+ buffers/cache: 20504 235000
Swap: 0 0 0

Filesystem cache has read MyISAM data (MYD) file and has increased in 100 Mb size, meanwhile data file is 97 Mb in size. Roughly enough good matching.

65 rows in set (0.19 sec)
165 rows in set (0.09 sec)
256 rows in set (0.13 sec)
27 rows in set (0.05 sec)
5 rows in set (0.03 sec)
Empty set (0.00 sec)
28 rows in set (0.04 sec)
236 rows in set (0.09 sec)
30 rows in set (0.05 sec)
75 rows in set (0.07 sec)

Much better now. This timings are good, fast responses. Therefore it seems that caching data file is essential to obtain fast replies meanwhile caching index file is not necessary. Interesting conclusions ...

We deserve an explanation, to comprehend why data must be cached meanwhile index is unnecessary. I am afraid first of all is compulsory to understand the internal mechanism of the query development in the relational database.

Here is my theory: When a MySQL full-text query asks for registers it is giving only one clue to find them: the keyword. MySQL is going to search in the index file for that keyword. It is not needed to read the full index to find one keyword. Index is ordered in such a way that finding a keyword is really fast. Imagine it is ordered alphabetically. It is easy to guess where is going to be located the keyword index with certain margin of error. So it is no needed to read the full index file, only just the chunk where it is expected to find it. Therefore finding one keyword in index file may take one random access read I/O operation in the filesystem. You may do some benchmarks and find out that a typical I/O not cached read operation use to take around 0.01 seconds (supposing we are reading reasonable chunk sizes as for example 100 Kb)

The second part to resolve the query is reading data records from the data MYD file according to the index values MySQL just have found. And that is the most I/O resource consuming part. Let explore it in depth: We choose the following case:

256 rows in set (0.13 sec)


The index value is pointing to 256 records that contain the keyword asked. Now, remember the query:

SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

MySQL needs to find out not only the id's of the matched records but also the title and body contents. This information is kept in the MYD file. The good news are that only 1 more file MySQL needs to read to answer the query. The bad news are that this file may be very big (97 Mb in our case) and the information we need is not found sequentially, but in a spread way. MySQL have its MYD file ordered by the id field of the table to accelerate as much as possible the extraction of certain register. We cannot read in a block all the registers needed. Much on the contrary it is going to be needed to make one read I/O access per each register. So, if MySQL has to extract 256 records and each I/O read operating takes 0.01 seconds, we may expect to take 2.56 seconds to get all the requested records. And that is precisely the benchmark obtained in last experiment. By contrast, current experiment resolves the query in just only 0.13 seconds. This can be explained just remembering that we have first cached the full MYD file into RAM. So I/O operations are not needed to get the 256 registers. RAM read operations are much faster than I/O disk read process. Pretty much faster, about 3 magnitude orders (1000-fold faster).

Main conclusion therefore: "Querying data fields considerably slows down full-text searches when unable to cache MYD file".

Then what about doing only index full-text queries? It should be much faster. Well, that is precisely next experiment.

No comments: