Saturday, October 27, 2007

RAM is 1000 times faster than disk

It has me taken 4 days to convert the english Wkipedia content into a SQL dump file. The original XML file occupy 13 Gb uncompressed. Once filtered and converted to SQL it occupies 4.3 Gb:

-rw-r--r-- 1 root root 13G Oct 23 06:27 enwiki-20071018-pages-articles.xml
-rw-r--r-- 1 root root 4.3G Oct 27 01:18 enwiki-20071018-pages-articles.sql


Having enough RAM insert the SQL dump into MySQL doesn't take too long: around 5 minutes. Creating the full-text index takes much more: around 2 hours.

# ls -laSh /var/lib/mysql/full_text_investigations/
total 4.1G
-rw-rw---- 1 ;mysql mysql 4.1G Oct 27 18:16 articles.MYD
-rw-rw---- 1 ;mysql mysql 42M Oct 27 18:16 articles.MYI

This is the file sizes before the index creation, and next figures are afterwards:

# ls -laSh /var/lib/mysql/full_text_investigations/
total 6.4G
-rw-rw---- 1 ;mysql mysql 4.1G Oct 27 18:31 articles.MYD
-rw-rw---- 1 ;mysql mysql 2.3G Oct 27 19:07 articles.MYI

We may verify once more the rule "Index size = Data size = Dump size". Here index size is a lower than the rest sizes, but not too far.

So we have created a database with 4 million records:

mysql> SELECT COUNT(*) FROM articles;
+----------+
| COUNT(*) |
+----------+
| 4203633 |
+----------+

Doing some maths we find out that 1 record is 1 Kb long on average.

This test has been done under a 16 Gb RAM server, so the MYD and the MYI files are cached in RAM. Here are some benchmarks:

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Einstein');
+----------+
| COUNT(*) |
+----------+
| 2744 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Linux');
+----------+
| COUNT(*) |
+----------+
| 5907 |
+----------+
1 row in set (0.05 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('America');
+----------+
| COUNT(*) |
+----------+
| 107015 |
+----------+
1 row in set (0.88 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('world');
+----------+
| COUNT(*) |
+----------+
| 324369 |
+----------+
1 row in set (2.40 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Google');
+----------+
| COUNT(*) |
+----------+
| 30881 |
+----------+
1 row in set (0.22 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('history');
+----------+
| COUNT(*) |
+----------+
| 182735 |
+----------+
1 row in set (1.47 sec)

There is a clear correlation among registers found and time consumed:

MYD cached => 100.000 records per second
MYD NOT cached => 100 records per second

So, RAM is 1000 times faster than disk.

1 comment:

Anonymous said...

Hi,

I am very interested in your findings and thoughts, in general.

I have a question: In http://mysql-full-text.blogspot.com/2007/10/first-conclusion-good-performance-under.html you say that use MySQL server 3.28. Are you still using that in the tests in this posts too?

What worries me is that the index file is almost half as big as the data file. Did you actually expect that. I would hope that after removing stopwords etc the index file would be much much smaller.