Monday, July 2, 2012

Back again

Back again. Sorry but I had lost my access to Blogger. Now it seems to work again as expected.

In the meantime, I could accomplish a great enterprise, my own search engine. Have a look at

Here it is a video explaining how it works: Search Engine & Crawlers

Monday, February 18, 2008

Improved tag clouds

Tag clouds are 2-dimensional keyword mappings. It is a concept similar to my goal of "Mapping semantic correlations".

The search results of are a good example of 2D tag clouds. Also Flickr and Technoraty engines use it.

At we find at figure 2 an improved tag cloud. This paper from Hassan-Montero, Y. and Herrero-Solana, V. titled "Improving Tag-Clouds as Visual Information Retrieval Interfaces" explains some techniques to define an order in tag clouds.

A "semantically improved tag cloud" is a simple way to express semantic relations among terms. In them, distance means semantic similarity.

Friday, February 15, 2008

Mapping semantic correlations

First of all I have to thank Shlomo Swidler comment in this blog. It seems he has found a way (a hack) to avoid MySQL read the MYD file when not needed. Nevertheless it would be even much interesting to find out why MySQL always reads this file when there must be a lot of cases in which the search can be accomplished just using MYI file. Precisely these searches are those to be more time consuming and the use of indexes would optimize them a lot.

Much time have past since my last post. I have been thinking about several subjects, but overall on Latent Semantic Indexes. One of my goals has been dismissed: I wanted to patch MySQL to accomplish fast full text searches on huge databases. But now Sun is bidding on MySQL. That changes things severily. I think MySQL code openess may be in danger.

On the other hand, Andrew Aksyonoff (with his Sphinx search engine) has possibly found the best solution ever to this problem. If MySQL/Sun wants it to acquire or not is an open question for the future and the fate.

So, from now on, I would like to center mainly on one data-mining and search engine new trend: "mapping semantic correlations".

LSI, Latent Semantic Indexing seems to be a cutting edge method in the Artificial Intlligence (AI) branch of Natural Language Processing (NLP). The main goal obtained through LSI is get some correlations among terms in a given document. Similar terms, semantically speaking, got correllated once LSI is applied. This improves search engine matching results.

Nevertheless I think, LSI and similar techniques may give us more interesting results than just a better matching. I am focusing on "Mapping semantic correlations", that is to draw them. If we get concepts and relations drawn in a plane (2D-space) it will be very easy to develop a search engine. No more indexes. Just a low dimensional spatial search. Fast and simple.

Another advantage of this kimera would be to represent the knowledge at a very tough level (sintaxis and grammar would be ignored in this process). A good mapping would place similar terms near from each other and dissimilar ones far among them.

Thursday, November 8, 2007

Yann Neuhaus and MyISAM index pointers

Yann Neuhaus has replied one of my help requesting emails:

My understanding is that you have 2 main storage strategies for MyISAM :

Variable length : tables with at least one varchar-like type (variable). Then in the MYI file we store the Offset of the row in the .MYD file

Fixed length : (only char or fixed length data types in the table) Then in the MYI file we store the row number which gives multiplied by the row length the Offset of the row in the MYD file.

Thanks for this information. I didn't know about those two kinds of pointers. Now I have a more exact idea. Thanks.

I do not really understand why multiplying the row number with the row length is faster (or better) than storing the offset.

I guess it is more economic in space to store a counter (fixed length row number) than an offset (variable length tables), because the offset is always a bigger number. Much bigger in some cases, where the record length is big. To store a big number requires more bits than a small number, so the index is smaller when using the counter and the general resource usage (buffers, I/O operations) is lower in this case.

It is also important to consider that we also have the compressed storage.

Thanks, Yann. But is the index compressed, the data or both of them?

Nevertheless, I go on my investigations, and unfortunately I still haven't found the reason for MySQL to read the MYD file when just asking for the number of results in a simple query, in which it is supposed not to be needed any information stored in the MYD file.


Thursday, November 1, 2007

Vector Reduction

* The Vector Space Method (VSM) may be useful to understand and optimize indexing techniques for full-text databases.

* One trivial optimization is to convert all words to lower case. This way the index might be reduced nearly to the 50%. There is no reason to maintain two or more indexes for the same word with different cases, if we accept that the case mode is not reporting any additional meaning to the word.

* Another proven application is the synonyms reduction, also called Latent Semantic Indexing (LSI). All synonyms of 1 word are reduced to this word. In this way, index size is reduced as long as distinct words due to this equivalence relation (that defines a class). Through the perspective of Vector Space Method, this is equivalent to a reduction in the dimension of the Vector Space. Conceptually, to reduce synonyms to one parent word or class word is like adding an external information (the thesaurus) to establish a kind of linear dependence among the synonyms and afterwards, doing a projection in the major significant axis. The projection would remove the dimensions representing peculiarities that make differences among same class synonyms, and preserve the dimensions that made these words different from other terms out of the class.

* The approach to establish one dimension in the Vector Space for each distinct word in the document is good, but could be better, I think. Certain words are related and others are not. If we associate a dimension to a word, implicitly we are agreeing that every word in the index is linearly independent to each other. And that is not true. For example, the word "text" and the word "hypertext" are not independent. Actually "hypertext" is a mixture of "text" and the word "link", conceptually. So vector "text" plus vector "link" should sum vector "hypertext". We them have 3 words, but only 2 dimensions, because hypertext don't need a new dimension to be represented. "Text" and "link" are orthogonal words, but "hypertext" is a linear combination of them. That is the way, I think, of understanding the vector reduction.

* Every word related to one another should be expressed through any kind of linear dependence. Accomplish that work in a given language would be equivalent to find the smallest vector space that preserves the meaning of the words and at the same time avoids redundant information. If we could find such extraordinary vector base, we should find that all the remaining class words or base vectors are orthogonal, that is, not related in any way.

* Not being satisfied with that chimera, I have the impression that finding this orthogonal and universal base could be done through statistic methods instead of thesaurus application. That means, the sole in-depth studying of a corpus could determine which are the most suitable words in the vector base and which to be just linear combinations of them. The fact that certain words belongs to a given document is establishing topological relations among them. The very fact that staying close each other is telling us some kind of useful information, for example, that they belong to a given context. The Inverted File (IF) index is also considering these information when storing the document pointer (docId) and the offset for each word in order to calculate the relevance of a search. That proves this topological information is related to the meaning of the words: "a door to the semantic universe".

* This way we could extract the semantic relations among words and use it to establish best orthogonal base vectors.

* I have thought of a recurrent method to apply in order to get the best reduced vector base. I have to spend time on it ...

* Next goal is to establish a solid theoretical framework for these ideas, investigate if anyone out there has developed something similar, and carry out some computer experiments trying to obtain some kind of tangible results. This personal brainstorm is not more that a roadmap to steer my following investigations.

* I have heard about Singular Value Decomposition (SVD), and I think this is the formal name to the Vector reduction process I am trying to describe in here. According to Wikipedia:

the Singular Value Decomposition implementation can be done by 2 methods: Lanczos Methods and via Neural Networks. I think there exists a third method, a recurrent approach that could be rougher but quicker.

* When forcing the Vector reduction process to its limits, we could get a final vector space of not too many dimensions (10, for example). What is the meaning of these 10 dimensions? I expect they were the Canonical Topics that give best taxonomy to the corpus used. If the corpus is the WWW, we should get a final base vector corresponding probably or roughly to the main level-1 topics of the Yahoo directory. So, if that is true (I will have to get some positive experimental results before assure it is true ...) we could obtain an elegant scheme to do Text Categorization, including their taxonomy.

Wednesday, October 31, 2007

Reading IF in 256 Kb chunks

Andrew Aksyonoff has been explaining me the chunk-oriented read process when Sphinx require the Inverted Files (IF). And these are the doubts and comments I have replied:

> ... Sphinx will read it in small (256 KB) chunks ...

So, if the query is a 2 word phrase, Sphinx will have open 2 windows simultaneously, that is, 2 buffers of 256 Kb at the same time. Is that right?

If more words in the query, more simultaneous 256 Kb chunks, right?

And the major CPU consuming process are 2:
1.- Performing the chunks intersections.
2.- Sort the intersected docId's by the user criteria.

And this is my final doubt: Is always necessary to read from the beginning to the end the corresponding IF indexes when doing the intersection?

Google in mind spurs this question. I mean, when searching in Google 2 common words like "Internet" and "WWW" we obtain this results:

* 2.110 million results for "Internet".
* 9.310 million results for "WWW".

When intersecting them Sphix would have to read and intersect 2 billion (docId+weight+probably other ranks) hits against 9 billion. Right?

Isn't that too slow? The time spent in doing it, is log(N)? Or log(2 billion) or log(9 billion) or log (2 billion x 9 billion) or any other?

Thanks again!

First conclusions on Text Indexing

After reading several papers on Information Retrieving and Search Engines I conclude that:

* Sphinx engine only deals with indexes, not with data. So it delivers solely the indexes pointing to the documents that match certain criteria. Is up to the database to fetch this documents and up to the web application to show them on screen.

* There are two ways to index full-text: Inverted Files and Signature Files. First ones seem to be more efficient. Second ones work better when matching common words, but it is very uncommon to search only common words. So, for real life, better to apply Inverted Files.

* Inverted Files are lists of documents where a word appears, storing also the normalized weight to be able to do reliable matching.

* Signature Files are bitmaps that indicates if a document contains a word or not. Just this bitwise information, yes or not, extended for all the documents indexed.

* When a document hasn't got a certain keyword inside, Signature Files contain a bit value (1 or 0. In this case, 0), meanwhile Inverted Files never contain a reference to the document in such a case.

* Sphinx uses Inverted Files. On the contrary, TBGsearch uses Signature Files.

* I guess MySQL Full-text default engine is also using Inverted Files, but less efficiently by far than Sphinx.

* Inverted Files may be compressed heavily, because it is an ordered sequence of integers, usually homogeneous. This property saves I/O disk reading operations.

* In Sphinx, when doing two or more word searches, phrasal or just Boolean AND'ed, respective Inverted Files are read from disk, uncompressed and intersected in RAM, and then, order by search criteria (matching weight, date, or others).

Tuesday, October 30, 2007

Vector treatment - Computing term weights

Looking for more information about the term-document matrix I have found an interesting and well-explained web page:

Here, the author explains how to understand the vector space and the matrix meaning when treating the problem of assigning weights to terms in documents, in order to create a weighted index.

Three factors seem to be important to establish a well-thought-out weighting function:

1.- Local weight or Term Frequency (TF).

2.- Global Weight or Inverse Document Frequency (IDF): point 5 in last post.

3.- Normalization factor: in order to use the same scale among different term weights.

Third point (I guess) is merely decorative but not functional, because the absolute weight value has nothing to do with the order of the results in the search. The order is important; the absolute value is not. So, third point could be ignored and then we had only 2 factors: TF and IDF. This is precisely the so-called tf.idf formula.

The 3 factors (according to the URL above) may be integrated in what is called the SVD algorithm. SVD stands for Singular Value Decomposition, a linear algebra factorization, but I still ignore their relation with the 3 factors that define the term's weighting function. I will appreciate any hint about this ...

Full-text search nomenclature

I've been processing another reference document dealing with Full-text searching that gave me Andrew Aksyonoff: "Recommended Reading for IR Research Students" being Justin Zobel one of its editors and IR meaning "Information Retrieval".

This is a list of important papers about Information Retrieval, and Full-text searching is an important area of study in this field (IR).

Reading their abstracts I have found there are several and distinct ways to refer to one concept. I expose them here. Nomenclature:

* Scoring function: equal to Ranking function, or weight function?

* Term: equal to word, or keyword.

* Document: equal to data, row, record, message.

* Index =? weighted index, inverted index, reverse index, inverted list, inverted file, document list.

* Term-document matrix =? document-term frequency weights.

* Phrase =? compound terms, several words among quotes.

* Vector Space Model (VSM) = Term Vector Model

* Latent Semantic Analysis = Latent Semantic Indexing (LSI) = Vector reduction

Part of this terminology could probably be ordered in this (maybe candid) way:

1.- Words, terms or keywords belongs to a given document.

2.- A document is more or less relevant respect to a word according to the weight/rank/score function.

3.- One weight for a given word is just a vector component, an scalar value.

4.- The weights of all the words in a document defines one vector. This vector represents the document.

5.- All the documents in the database are vectors in a vector space. To compare them, they should firstly be normalized to avoid that longest documents always stay at first result positions.

This is supposed to be the so-called Vector Space Method (VSM), I guess. If so, I have to say that this is a bit strange vector space because the vector components never get a negative value. All the vectors/documents resides in the same multidimensional quadrant. Could be that a problem?

Monday, October 29, 2007

Building fast search engines and Zettair

I am following Andrew Aksyonoff guidelines to go on with the full-text investigations.
First of all I am reading carefully "Building fast search engines" by Hugh E. Williams at

It is a very interesting paper and has driven me to Zettair project ( ) by Justin Zobel and others at Melbourne University. This is a powerful indexer and search engine designed to work with big text repositories.

Hugh E. Williams also works at Melbourne University, and has settled very clearly the search engine basis in his document. I have learned several new concepts:

* Ranked query: Query formed by 2 or more words.

* Phrase search: Query formed by 2 or more words among quotes.

* Term Frequency (TF): The number of appearances of a given word in a document.

* Inverse Document Frequency (IDF): A word oddness, that is, if a word is more o less frequent in a text in general.

* Ranking function: The formula that measures the weight of a word in a document. It is useful to find out which document is closer to the a keyword from the search query.

* Okapi BM25: Very well crafted formula that gives a good and fair word weight measure.

* Inverted index: It is what I should call an "weighted index". I don't know why they call it "inverted". Inverted index shows the database under a "word point of view". That is where the documents containing that word are placed.

* Compressed indexes: A way lo reduce I/O operations. The same for the data (or rows).

Andrew Aksyonoff has replied me again. Thanks!

> Selecting 1 million random rows from disk-based database of 100
> million rows will be slow as hell, too.

According to my short experience, reading spread rows in disk takes 0.01 seconds per record, and from RAM takes 0.00001 seconds, that is 1000 times less than from disk (without cache).

So selecting 1 million rows would take 10.000 seconds from disk and 10 seconds from RAM. Do you agree with it?

I have entered at and looked for the keyword "internet". And I get this stats: "Results 1 - 10 of 1,367,788 for internet (0.385 seconds)".

Website is Sphinx powered, and is selecting more than 1 Million rows in less than a second. According to my rules above, it should take around 13 seconds long.

Maybe is using parallel processing of some kind or their RAM memory is faster than standard.

These numbers are getting me a bit confused ...

> Eg. Sphinx takes mutual keyword positions into account and ranks
> closer phrase matches higher. Therefore ordering per-keyword document
> list by anything but document ID will in fact slow down processing
> a lot, because intersecting unsorted lists will me much slower.

I think you are giving me here very valuable information about the Sphinx internals and I really appreciate it.

You will agree with me if I say that understanding the engine of a powerful search engine like Sphinx is not easy. This is why I am trying to simplify as much as possible the situation. You talk me about intersecting indexes, but I guess this is only needed when doing phrase searches. I am also very interested in phrase searches but maybe it would be easier to understand firstly single keyword searches.

I suppose that in a single keyword search only 1 index should be consulted, and that index is ranked or ordered under a given criteria. If that order is equal to the page results order, then there is no need to process all the matching rows to find the first 10 that will be showed in the page. So directly the first 10 better ranked pointers in the index should lead me to the 10 data rows to extract from the database. I assume that extracting 10 spread data rows is a very fast operation even though the database is huge or not cached.

So the question is, which criteria or order is Sphinx using in a given keyword index? What is that rank function? Is it just a counter (or its logarithm) of the number of appearances of the keyword in the document? Is that simple?

> No. The key tricks are 1) to read and process a lot of data in linear
> manner, friendly with modern HDDs and CPUs; and 2) to distribute
> searching across many nodes, and search on all of them in parallel.

I don't still understand how to order the indexes and the data so as to do linear readings ... I don't realize any way to order data that let us to read it linearly in any search. If you search "internet" the data associated to this keyword might be localized, but if you then search "Russia" the corresponding data rows couldn't be contiguous because "technology order" is different and incompatible to "country order".

> I've seen a paper describing a similar idea several years ago. But it
> would only work with statistical approaches, not for phrase matches,
> so I did not bother even to memorize the title :)

What is an statistical approach? Are you referring to fuzzy logic or related?

I wouldn't like to bother you with my questions. I guess you have much work developing Sphinx. Maybe there is some kind of document in Internet where all this concepts are explained. I have spent many hours in Google looking for it unsuccessfully. If you know about such a valuable source, please, tell me.

Thanks a lot again, Andrew.

Sunday, October 28, 2007

B-tree MyISAM index structure

Interesting references still pending to investigate are:

* (section "The .MYI Key Values ")


* "Pro MySQL" by Michael Kruckenberg, Jay Pipes, Chad Russell, page 160:

After the header section, the MyISAM index blocks compose the remainder of the .MYI file. The index blocks are 1KB on-disk pages of data, representing the B-tree leaf and non-leaf nodes. A single index block contains only key values pertaining to one index in the table. The header section (detailed in the previous section) contains information about how the MyISAM storage engine should find the root node of each index by supplying the offset for the root node’s index block in the keydef elements.

Each index block contains the following:

A single 2-byte block header. The first bit of the 16 bits in the header indicates whether the block is a leaf node (0 for leaf; 1 for non-leaf). The remaining 15 bits contain the total length of bytes used in the block (non free space).

Following the header, index keys and record identifiers are laid out in a balanced organization (the B-tree format). With each key is stored the key value (of a length equal to the data type of the indexed field) and a 4-byte record pointer.

The remainder of the index block is junk bytes (filler bytes), which become used as the B-tree index “fills out” with inserts. This is the “fill factor” for MyISAM B-tree index pages, and typically represents between 65% and 80% of the data used within the index block under normal operations, to allow for split-free growth along with the insertions.

MyISAM full-text indexed have weight

As exposed at Usenet I still haven't found the cause of MySQL reading MYD files when querying only index information, such as number of results, or id fields. I consider that next queries shouldn't force MySQL to extract information from the MYD file because the index MYI content should be enough:

SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');
SELECT id FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

But the fact is that on several versions of MySQL I have tested, MYD files are being reading exhaustively when performing these queries. If anybody could give me a hint on this issue, I would appreciate it.

Studying the MyISAM index structure when doing full-text matching, I have found that Jeremy D. Zawodny and Derek J. Balling hold that full-text indexes are implemented with a two-part MyISAM B-tree index. First field is the VARCHAR that stores the keyword and second one is a FLOAT that measures his weight.

I am confused. If the index is weighted, why does MySQL reads MYD file to find out the MATCHING order in a query like this:

SELECT id, MATCH (title,body) AGAINST ('keyword') AS score FROM articles ORDER BY score DESC LIMIT 0,10;

These kind of queries are slow under MySQL because MYD files are consulted, and that is the paradox. Why to consult them if it isn't required? According to Jeremy D. Zawodny and Derek J. Balling the index know the keyword and its weight. Isn't it enough?

Saturday, October 27, 2007

RAM is 1000 times faster than disk

It has me taken 4 days to convert the english Wkipedia content into a SQL dump file. The original XML file occupy 13 Gb uncompressed. Once filtered and converted to SQL it occupies 4.3 Gb:

-rw-r--r-- 1 root root 13G Oct 23 06:27 enwiki-20071018-pages-articles.xml
-rw-r--r-- 1 root root 4.3G Oct 27 01:18 enwiki-20071018-pages-articles.sql

Having enough RAM insert the SQL dump into MySQL doesn't take too long: around 5 minutes. Creating the full-text index takes much more: around 2 hours.

# ls -laSh /var/lib/mysql/full_text_investigations/
total 4.1G
-rw-rw---- 1 ;mysql mysql 4.1G Oct 27 18:16 articles.MYD
-rw-rw---- 1 ;mysql mysql 42M Oct 27 18:16 articles.MYI

This is the file sizes before the index creation, and next figures are afterwards:

# ls -laSh /var/lib/mysql/full_text_investigations/
total 6.4G
-rw-rw---- 1 ;mysql mysql 4.1G Oct 27 18:31 articles.MYD
-rw-rw---- 1 ;mysql mysql 2.3G Oct 27 19:07 articles.MYI

We may verify once more the rule "Index size = Data size = Dump size". Here index size is a lower than the rest sizes, but not too far.

So we have created a database with 4 million records:

mysql> SELECT COUNT(*) FROM articles;
| COUNT(*) |
| 4203633 |

Doing some maths we find out that 1 record is 1 Kb long on average.

This test has been done under a 16 Gb RAM server, so the MYD and the MYI files are cached in RAM. Here are some benchmarks:

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Einstein');
| COUNT(*) |
| 2744 |
1 row in set (0.02 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Linux');
| COUNT(*) |
| 5907 |
1 row in set (0.05 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('America');
| COUNT(*) |
| 107015 |
1 row in set (0.88 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('world');
| COUNT(*) |
| 324369 |
1 row in set (2.40 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Google');
| COUNT(*) |
| 30881 |
1 row in set (0.22 sec)

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('history');
| COUNT(*) |
| 182735 |
1 row in set (1.47 sec)

There is a clear correlation among registers found and time consumed:

MYD cached => 100.000 records per second
MYD NOT cached => 100 records per second

So, RAM is 1000 times faster than disk.

Friday, October 26, 2007

Andrew Aksyonoff gave me a hint

Andrew Aksyonoff is the main developer of Sphinx. I asked him about the reason for MySQL being too much slow that Sphinx: "I would like you to give me a small hint about the Sphinx index architecture. I have the thesis that MySQL is slow because its full-text indexes are not weight-ordered."

He has told me that MySQL suffers from internal fragmentation rather than wrong ordering: It needs to perform too much semi-random I/O task on huge requests.

Thanks, Andrew. As much as I understand (not so far) the index architecture of a database engine, the query leads to the indexes and these lead to the data. If the data to recover is spread, the result will take long. And that seems to be what happens in MySQL.

Indeed, I have observed that MySQL satisfies roughly the rule of "0.01 sec per result found" no matter how big or small the database is. Precisely 0.01 is a coarse measure of the time usually taken to do a I/O random read to a disk. If the data to read wouldn't be so spread or would be stored in cache, in only 0.01 seconds MySQL could retrieve a lot of registers (maybe 100 or more, instead of just one).

Therefore, my benchmarks satisfy Andrew's theory. Now let's observe a Google typical search result: Suppose we search the keyword "MySQL": 10 URLs in screen among 163 million matches. It took 0.06 seconds to deliver it.

If MySQL would have made this search operation it would have taken probably 163.000.000 * 0.01 seconds, and that is 18 days. Too much time for a query. No doubt. Meanwhile Google or Sphinx can resolve the query in less than 1 second. Where is the miracle?

This is my point: If the index could tell us directly which are the 10 first and most significant results for this search, we could avoid 163 million I/O operations. Only 10 I/O operations would have been necessary. And that is fast, just 0.10 seconds according to the rule above.

MySQL needs to perform this 163 million I/O operations to find out the grade of matching of each register found against the search keyword. Google and Sphinx do not need it, probably because their indexes contain that matching weight information inside, someway.

Hence the key factor to develop a fast search engine is an index with its pointers ordered by matching weight decreasingly. That is what I call a "weighted index". Maybe this nomenclature is stupid, I don't know. I am sorry in such a case.

Discovery of Sphinx and others

I have written several people, forums and a Usenet group about the issue of "MySQL reading MYD when not necessary", but I haven't still gathered replies enough that gives me some light to solve this problem. So I will carry on and be patient.

Meanwhile, while trying to find out something on my own about this subject, I have discovered some important projects I didn't know till now. This projects change the focus of this text-full investigations.

Remember that the main goal in these investigations is to achieve a MySQL search engine performance acceptable when using huge databases. I thought MySQL could be tweaked or patched in some way that resolves search queries in 0.1 seconds instead of 1 minute or 10 minutes. These are real benchmarks when a table holds 1 million records and the server has nor enough RAM to cache the whole data file.

I still think it can be done if we could (I don't know how) create a weighted index and could constrain the query to be fulfilled by reading uniquely the index MYI file. This way, it would be necessary to read and process the index value for every keyword of the full-text query and read only the 10 (for example) records in data MYD file that best match the search. This reading I/O process could take 0.01 seconds for each record, so the query could be resolved in 0.12 seconds taking into account that the main bottleneck is I/O reading.

My recent discoveries are the existence of Sphinx project driven by Andrew Aksyonoff, and also other fast full-text engines like Mnogosearch, Lucene/DBSight, tbgSearch, Sienna, Zettair, Sphider, TSEP, PHPDig, iSearch, RiSearch, SiteSearch, Simple Web Search, IndexServer and Xapian.

There is a good review of them at:

Wednesday, October 24, 2007

Brian Wakem has also replied

Brian Wakem has replied:

Forcing a filesystem cache of the table is probably loading the index into
memory too. In the second case you are not, so the extra time is probably
spent reading the index from disk, not the data file.

and we have replied him as soon:

But I am not loading the full table in cache, just only the corresponding MYD file as follows:

# cat /var/lib/mysql/full_text_investigations/*.MYD > /dev/null

The increment in cache size obtained through the "free" command matches with the MYD file size.

Thence, we can be sure that the query needs to read the MYD file, in order to explain the penalty in the benchmarks when not caching the file.


Axel Schwenke helped us. Thanks!

Axel Schwenke, a Support Engineer at MySQL AB has replied my question posted at comp.databases.mysql in this way:

1. an index on a MyISAM table does not refer to the PK, but to the
physical address (or row number) of the row.

2. how do you *know* MySQL is reading from the MYD file?

3. use EXPLAIN to see how your query will be executed.
If it shows "using index" then no datafile reads will be done.

This is what I have replied to Axel Schwenke:

1.- What is the PK?

1b.- The query is a count of registers, so it doesn´t matter what kind of pointer is the index using, isn't it?

2.- I know MySQL is reading the MYD file because I reboot the server and then I force a reading of MYD file in order to be cached by the filesystem. Then I measure the query reply speed. Secondly I do the same process (including reboot) but not caching MYD file. The timings are extremely diffrent in each case. In the first case query responds in 0.1 seconds. In the second case query takes up to 5.0 seconds. So I conclude: MySQL is reading MYD file. Take into account that the MYD file is 100Mb in size.

3.- The query should use the full-text index. Look at the EXPLAIN result: (key = title)

mysql> EXPLAIN SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');
| table | type | possible_keys | key | key_len | ref | rows | Extra |
| articles | fulltext | title | title | 0 | | 1 | where used |
1 row in set (0.02 sec)

Thanks a lot for your help. Any further hint will also make me very thankful.

Index full-text queries are not fast

We are going to flush the filesystem cache by rebooting the server and ask for a series of index only full-text queries. We suppose they have to be very fast (around 0.01 seconds each one) because it won't be necessary that MySQL reads MYD file. It will read only certain chunks of the MYI file. Now the queries are this kind:

SELECT id FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

It is evident we are not asking for any information stored in the MYD file. Just the id that is supposed to be found in the MYI file.

65 rows in set (1.68 sec)
165 rows in set (3.08 sec)
256 rows in set (3.57 sec)
27 rows in set (0.57 sec)
5 rows in set (0.12 sec)
Empty set (0.00 sec)
28 rows in set (0.54 sec)
236 rows in set (3.36 sec)
30 rows in set (0.57 sec)
75 rows in set (1.32 sec)

Oh, oh! What is happening here? This benchmarks make a nonsense of our theory. I have no explanation. Maybe index is not actually stored in MYI file, but another kind of pointer. Let's do another test. Lets query only the sum of registers. This way id won't be asked, just the number of index references found in the MYI file:

SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

And the results are:

1 row in set (3.09 sec)
1 row in set (5.72 sec)
1 row in set (4.04 sec)
1 row in set (0.55 sec)
1 row in set (0.13 sec)
1 row in set (0.00 sec)
1 row in set (0.62 sec)
1 row in set (4.18 sec)
1 row in set (0.58 sec)
1 row in set (1.33 sec)

Bad results again. So the problem is that the real internals in this MySQL queries are different from what we thought. I am lost. Any ideas?

Caching only data file

MYD is the associated signature to MySQL MyISAM data files. We are going to cache only this file and measure query timings:

# reboot

# free
total used free shared buffers cached
Mem: 255504 50216 205288 0 6020 26988
-/+ buffers/cache: 17208 238296
Swap: 0 0 0

# cat /var/lib/mysql/full_text_investigations/*.MYD > /dev/null

# free
total used free shared buffers cached
Mem: 255504 152156 103348 0 6148 125504
-/+ buffers/cache: 20504 235000
Swap: 0 0 0

Filesystem cache has read MyISAM data (MYD) file and has increased in 100 Mb size, meanwhile data file is 97 Mb in size. Roughly enough good matching.

65 rows in set (0.19 sec)
165 rows in set (0.09 sec)
256 rows in set (0.13 sec)
27 rows in set (0.05 sec)
5 rows in set (0.03 sec)
Empty set (0.00 sec)
28 rows in set (0.04 sec)
236 rows in set (0.09 sec)
30 rows in set (0.05 sec)
75 rows in set (0.07 sec)

Much better now. This timings are good, fast responses. Therefore it seems that caching data file is essential to obtain fast replies meanwhile caching index file is not necessary. Interesting conclusions ...

We deserve an explanation, to comprehend why data must be cached meanwhile index is unnecessary. I am afraid first of all is compulsory to understand the internal mechanism of the query development in the relational database.

Here is my theory: When a MySQL full-text query asks for registers it is giving only one clue to find them: the keyword. MySQL is going to search in the index file for that keyword. It is not needed to read the full index to find one keyword. Index is ordered in such a way that finding a keyword is really fast. Imagine it is ordered alphabetically. It is easy to guess where is going to be located the keyword index with certain margin of error. So it is no needed to read the full index file, only just the chunk where it is expected to find it. Therefore finding one keyword in index file may take one random access read I/O operation in the filesystem. You may do some benchmarks and find out that a typical I/O not cached read operation use to take around 0.01 seconds (supposing we are reading reasonable chunk sizes as for example 100 Kb)

The second part to resolve the query is reading data records from the data MYD file according to the index values MySQL just have found. And that is the most I/O resource consuming part. Let explore it in depth: We choose the following case:

256 rows in set (0.13 sec)

The index value is pointing to 256 records that contain the keyword asked. Now, remember the query:

SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

MySQL needs to find out not only the id's of the matched records but also the title and body contents. This information is kept in the MYD file. The good news are that only 1 more file MySQL needs to read to answer the query. The bad news are that this file may be very big (97 Mb in our case) and the information we need is not found sequentially, but in a spread way. MySQL have its MYD file ordered by the id field of the table to accelerate as much as possible the extraction of certain register. We cannot read in a block all the registers needed. Much on the contrary it is going to be needed to make one read I/O access per each register. So, if MySQL has to extract 256 records and each I/O read operating takes 0.01 seconds, we may expect to take 2.56 seconds to get all the requested records. And that is precisely the benchmark obtained in last experiment. By contrast, current experiment resolves the query in just only 0.13 seconds. This can be explained just remembering that we have first cached the full MYD file into RAM. So I/O operations are not needed to get the 256 registers. RAM read operations are much faster than I/O disk read process. Pretty much faster, about 3 magnitude orders (1000-fold faster).

Main conclusion therefore: "Querying data fields considerably slows down full-text searches when unable to cache MYD file".

Then what about doing only index full-text queries? It should be much faster. Well, that is precisely next experiment.

Caching only index file

MYI extension is the signature to refer to MySQL MyISAM index file. We are going to cache only this index file and test the system performance.

# reboot

# free

total used free shared buffers cached
Mem: 255504 50212 205292 0 6040 26976
-/+ buffers/cache: 17196 238308
Swap: 0 0 0

# cat /var/lib/mysql/full_text_investigations/*.MYI > /dev/null

# free
total used free shared buffers cached
Mem: 255504 135852 119652 0 6176 109716
-/+ buffers/cache: 19960 235544
Swap: 0 0 0

Filesystem cache has read MYI file and has grown from 27 Mb to 110 Mb in size. Remember that MYI file occupies 81 Mb. Size of the cache and the file cached match. That's good. Now we query:

65 rows in set (1.51 sec)
165 rows in set (2.99 sec)
256 rows in set (3.51 sec)
27 rows in set (0.51 sec)
5 rows in set (0.08 sec)
Empty set (0.00 sec)
28 rows in set (0.51 sec)
236 rows in set (3.37 sec)
30 rows in set (0.55 sec)
75 rows in set (1.25 sec)

Oh! What the hell! Slow responses! Very similar timing to the obtained when not caching at all. Is that meaning that caching the index file is useless? The answer is YES, at least in this kind of "full-text search engine" queries.

Index file seems not to be the key factor to improve performance. Let's do next the same experiment but with the data file.

Forcing whole database to be cached

We reboot again the Linux server. This way we garantee filesystem cache is void. The folder "/var/lib/mysql/full_text_investigations/" holds the database. We are going to force the system to keep in RAM the data and index files this way:

# free
total used free shared buffers cached
Mem: 255504 50340 205164 0 6104 27048
-/+ buffers/cache: 17188 238316
Swap: 0 0 0

# cat /var/lib/mysql/full_text_investigations/* > /dev/null

# free
total used free shared buffers cached
Mem: 255504 237900 17604 0 6352 208316
-/+ buffers/cache: 23232 232272
Swap: 0 0 0

Filesystem cache has increased from 27 Mb to 208 Mb. The database files are 178 Mb in size. So, we can be sure database is cached in RAM.

We measure timings for the 10 predefined queries. Remember we have just reboot the system and force a cache reading:

65 rows in set (0.04 sec)
165 rows in set (0.02 sec)
256 rows in set (0.01 sec)
27 rows in set (0.00 sec)
5 rows in set (0.00 sec)
Empty set (0.00 sec)
28 rows in set (0.00 sec)
236 rows in set (0.02 sec)
30 rows in set (0.00 sec)
75 rows in set (0.01 sec)

Instant responses here prove that cache filling is crucial to fast database replies.

But I wonder, is data file more crucial to be cached than index file? Next experiment will show us this essential knowledge.

Rebooting to avoid cache biasing in the benchmarks

When the server is booted all the filesystem caches are clean. Therefore any data that MySQL needs to read has to be taken out directly from the hard disk. This way we are able to simulate a RAM lacking situation; and this will be very common when working with huge databases.

In the following experiments our database holds 30.000 registers (15 million words). Here, the size measures:

# ls -laSh /var/lib/mysql/full_text_investigations/
total 178M
-rw-rw---- 1 mysql mysql 97M Oct 23 21:07 articles.MYD
-rw-rw---- 1 mysql mysql 81M Oct 23 21:07 articles.MYI
-rw-rw---- 1 mysql mysql 8.4K Oct 23 16:25 articles.frm

The SQL dump file occupies 96Mb. Again the rule "Index size = Data size = Dump size" works reliably. Regarding that our server only has 256 Mb RAM we can expect that most of it will be consumed in caching data and index files. Hence, MySQL is going to respond fast, but it will reach the maximum limit of data for this situation.

Now, we reboot the server and take some measures:

# ps axv | grep sql
1609 ? S 0:00 276 566 5365 1160 0.4 /bin/sh /usr/bin/safe_mysqld
1633 ? S 0:00 767 3094 27097 5140 2.0 /usr/libexec/mysqld

# free
total used free shared buffers cached
Mem: 255504 50424 205080 0 6152 27052
-/+ buffers/cache: 17220 238284
Swap: 0 0 0

65 rows in set (1.68 sec)
165 rows in set (2.95 sec)
256 rows in set (3.65 sec)
27 rows in set (0.63 sec)
5 rows in set (0.16 sec)
Empty set (0.00 sec)
28 rows in set (0.55 sec)
236 rows in set (3.34 sec)
30 rows in set (0.54 sec)
75 rows in set (1.38 sec)

ps axv | grep sql
1609 ? S 0:00 276 566 5365 1160 0.4 /bin/sh /usr/bin/safe_mysqld
1633 ? S 0:00 767 3094 27225 5600 2.1 /usr/libexec/mysqld

# free
total used free shared buffers cached
Mem: 255504 83804 171700 0 6624 58708
-/+ buffers/cache: 18472 237032
Swap: 0 0 0

The command "ps axv | grep sql" let us know how much memory is consuming MySQL for its own. We see that before querying as after doing it, MySQL expends just 27 Mb.

Nevertheless, the cache size has increased considerably jumping from 27Mb at first to 58 Mb after querying.

We may conclude the filesystem has not cached completely the whole database, but only some chunks of data, precisely those needed to respond the ten predefined queries.

What about performance? Watch the timings. When there is no cache, we confirm the rough rule of thumb "0.01 seconds to obtain 1 register".

Now we measure times once data is cached:

65 rows in set (0.01 sec)
165 rows in set (0.01 sec)
256 rows in set (0.01 sec)
27 rows in set (0.00 sec)
5 rows in set (0.00 sec)
Empty set (0.00 sec)
28 rows in set (0.01 sec)
236 rows in set (0.01 sec)
30 rows in set (0.00 sec)
75 rows in set (0.00 sec)

Obvious benchmarks are these. When data is cached, respond times are nearly zero.

In the next experiment we will manipulate cache filesystem to force whole database to be cached.

Tuesday, October 23, 2007

Filesystem cache is important

This time, I have completed the import of 10.000 records from the Wikipedia into MySQL. This means 7 million words to index. The table and database structure is defined as in the previous examples:

DROP DATABASE IF EXISTS full_text_investigations;
CREATE DATABASE full_text_investigations;
USE full_text_investigations;

Indexing process is long in time (several minutes for 10.000 records), moreover when making full-text indexes.

The SQL dump file that defines the 10.000 Wikipedia registers of this example is 45 Mb in size. When indexed, the MyISAM files created are these:

# ls -laSh /var/lib/mysql/full_text_investigations/
total 76M
-rw-rw---- 1 mysql mysql 42M Oct 23 16:51 articles.MYD
-rw-rw---- 1 mysql mysql 34M Oct 23 16:51 articles.MYI
-rw-rw---- 1 mysql mysql 8.4K Oct 23 16:25 articles.frm

We may again verify the recent discovered rule here: "Index size = Data size = Dump size"

It is not an exact rule, but approximately seems correct until now.

I have prepared 10 predefined queries to benchmark the system. All they are of this form:

SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');

The queries are using 10 distinct keywords. Some keywords provides more results than others. MySQL drops this execution times:

37 rows in set (0.92 sec)
82 rows in set (1.31 sec)
151 rows in set (1.74 sec)
9 rows in set (0.18 sec)
1 row in set (0.01 sec)
Empty set (0.00 sec)
14 rows in set (0.17 sec)
108 rows in set (1.28 sec)
10 rows in set (0.12 sec)
38 rows in set (0.53 sec)

If we correlate the number of rows with the time elapsed we find an important rule of thumb: "0.01 seconds to obtain 1 register". Remember this rule; it's important and universal.

Besides, we can see that it's rare to find timings greater than 1 second. It may seem acceptable a time of 1 second to resolve a search engine query, but it is NOT. We cannot hold loaded a server for 1 second just to answer only 1 query. A server of this kind wouldn't be able to support a realistic load of 10.000 visits/day, for example. Think about the peaks, not only the valleys, in the server access stats.

Well, this is precisely the goal of the present investigations. To find out the real performance of MySQL full-text as a textual search engine.

If we re-query one by one the 10 previous queries, we will find an important result:

37 rows in set (0.01 sec)
82 rows in set (0.01 sec)
151 rows in set (0.01 sec)
9 rows in set (0.01 sec)
1 row in set (0.01 sec)
Empty set (0.00 sec)
14 rows in set (0.01 sec)
108 rows in set (0.01 sec)
10 rows in set (0.01 sec)
38 rows in set (0.01 sec)

0.01 seconds replies for almost every query. That is really fast, and good in order to build a search engine. But it is only the biasing effect of doing exactly the same queries for second time. This is due to the filesystem cache. If we would have configure MySQL to use its own query cache, the result would have also be so good, but this time the responsible wouldn't have been the filesystem cache but the MySQL query cache.

Filesystem cache is the key to understand the behaviour of MySQL performance under extreme circumstances like these that we are trying to investigate. When a process in the server access to a file to read it, it is kept in RAM memory (if fits, of course). This way any further reading of this file would avoid to make a physical access to the disk drive, obtaining the data directly from RAM memory. The operative system assets that the file hasn't been change in the meanwhile, between reads when it uses the cache to improve the file reading speed.

We can observe that cache size has increased (138 Mb) after invoking MySQL queries:

# free
total used free shared buffers cached
Mem: 255504 174508 80996 0 5864 138916
-/+ buffers/cache: 29728 225776
Swap: 0 0 0

So the main conclusion is this one: "Filesystem cache improves sharply the MySQL performance". And "0.01 seconds to obtain 1 register" when cache is not playing its role due to not previously read (or properly updated in RAM) files or file chunks.

In next investigation we are indexing 30.000 records and measure file sizes and timings in query responses. We will find out what are the essential files to be cached by filesystem.

First conclusion: Good performance under small corpus

My first seriuos test for full-text searches is obteined when inserting 100 records in MySQL. Each record corresponds to a Wikipedia article. Then mean value in characters long for one of this articles is 5.000. So, the whole data dump would occupy 500 Kb.

On the other hand, if we measure the index size and the data size for the MySQL raw MyISAM files, these is the result:

# ls -laSh /var/lib/mysql/full_text_investigations/
total 992K
-rw-rw---- 1 mysql mysql 505K Oct 23 13:37 articles.MYD
-rw-rw---- 1 mysql mysql 454K Oct 23 13:37 articles.MYI
-rw-rw---- 1 mysql mysql 8.4K Oct 23 13:34 articles.frm

Here we have one first rough conclusion: "Index size = Data size = Dump size". 500 kb each one in this case.

I haven´t apply any tweacking to MySQL, so my.cnf is the default file, as follows:

# cat /etc/my.cnf



Here there is a brief system description for the server:

# free
total used free shared buffers cached
Mem: 255504 213016 42488 0 55620 81948
-/+ buffers/cache: 75448 180056
Swap: 0 0 0

That means, the server has got 256 Mb RAM and only 40 Mb remains idle.

# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU T5500 @ 1.66GHz
stepping : 8
cpu MHz : 1662.696
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss pni ds_cpl
bogomips : 3329.22

The CPU is a Celeron Core 2 Duo at 1.6 Ghz. server version: 3.23.58

This is the MySQL server version.

# cat /etc/*release*
Fedora Core release 1 (Yarrow)

The Linux distrbution is a FC1.

# hdparm -tT /dev/sda1

Timing buffer-cache reads: 3968 MB in 2.00 seconds = 1984.00 MB/sec
Timing buffered disk reads: 62 MB in 3.00 seconds = 20.67 MB/sec

This is a quick way to test hard disk speed, for this SATA unit. It is not slow but it isn't very fast either. Today, some SATA drives reach 70 MB/sec.

Take attention to the point that having 256 Mb RAM, the whole database is cached in RAM (only 1 Mb for index+data).

The queries are extremely fast, as expected:

mysql> SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');
| id | title | LEFT(body, 64) |
| .. | ... | ... |
5 rows in set (0.00 sec)

Can´t be quickest.

So, the sencond big conclusion should be: "Good performance under small corpus".

Here, the corpus, as said, is 100 articles 5.000 characters length each one.

Next experiment: What if corpus is incremented till 10.000 articles?

Simplest case

Once the MySQL server is up and running, I test my first full-text example:

mysql> CREATE DATABASE full_text_investigations;
Query OK, 1 row affected (0.03 sec)

mysql> USE full_text_investigations;
Database changed

Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO articles (title,body) VALUES ('MySQL Tutorial','DBMS stands for DataBase ...'), ('How To Use MySQL Well','After you went through a ...'), ('Optimizing MySQL','In this tutorial we will show ...'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL vs. YourSQL','In the following database comparison ...'), ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database');
| id | title | body |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
2 rows in set (0.03 sec)

mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database');
| id | title | body |
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
2 rows in set (0.00 sec)

Take attention about the time lapsed in the first query and that of the second one (in cache).

Friday, October 19, 2007

First steps

I would like to show some examples with detailed information in order to be as much precise as possible.

I am installing a Fedora Core 1 (FC1) Linux server with its default MySQL server. I will try to import the big database and do some simple queries so as to know exactly the number of text documents included, the number of words, and other interesting stats.

First day - Introduction

MySQL is a powerful database engine. I had used it in several of my projects. It is fast and simple. That is what I need.

Nevertheless, while using MySQL in a big LAMP project I found slow queries (in a range from 30 seconds to 5 minutes long). And that is really weird if we are talking about MySQL. First explanation that comes to my head is that I have done something wrong. MySQL cannot be so slow.

I admit the situation is a bit uncommon. I have indexed 100.000 documents with 10.000 words each one. This means the index created to manage the full-text search is around 1 Gb in size.

I think I have discovered that the query response times are good (around 1 second) when the server RAM memory is greater than the index size. But when the index size is bigger that the RAM, the response times growth reaching benchmarks of 30 seconds and more per query.

I say again, that could be only a misconfiguration or a lack of knowledge in the MySQL internals for my part. I have been testing several queries and different versions of MySQL in diverse Linux servers. All these tests are leading me in one direction: index must be full cached in RAM in order to achieve fast query results.

I don't feel confortable in that situation. 30 seconds per query is unacceptable. What if I try to index bigger databases? I imagine the internal process of looking for a word in the index and then match it against the text that contains it; I don't understand why is it necessary the index to be cached.

I accept that random access reads to the disk are needed to find the matched document: one of them to find the index and another one to find the document. But two disk reads are not taking 30 seconds. That is sure.

So I need to understand this problem in-deph. I would like to know my error, or otherwise why MySQL Full-text is slow when the index is bigger than the amount of RAM available in the server.

Here, today I begin my investigation. If you have some idea about this problem, any hint that could help me to solve the issue, please contact me. Thanks in advance.