<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2531180625802634184</id><updated>2011-08-01T02:52:30.603-07:00</updated><category term='Google and Sphinx'/><category term='nomenclature'/><category term='LSI'/><category term='Term-document matrix'/><category term='documents'/><category term='Text Indexing'/><category term='Information Retrieval'/><category term='MySQL benchmarks'/><category term='sphinx'/><category term='weighted index'/><category term='Full-text searches'/><category term='unwanted MYD reads'/><category term='I/O reading'/><category term='IF'/><category term='tf.idf formula'/><category term='MySQL server'/><category term='disk'/><category term='Filesystem cache'/><category term='RAM'/><category term='query'/><category term='Melbourne University'/><category term='I/O operations'/><category term='reading index file'/><category term='CREATE DATABASE'/><category term='simple example'/><category term='random access I/O'/><category term='tag cloud'/><category term='MyISAM index structure'/><category term='Linux FC1'/><category term='MyISAM index'/><category term='MySQL indexes'/><category term='Okapi BM25'/><category term='Improved tag clouds'/><category term='Vector Space Method'/><category term='Celeron Core 2 Duo'/><category term='Hugh E. Williams'/><category term='Btree index'/><category term='cache measurements'/><category term='table cache'/><category term='MySQL internals'/><category term='Yann Neuhaus'/><category term='MySQL cache'/><category term='Natural Language Processing'/><category term='offset'/><category term='mapping semantic correlations'/><category term='Brian Wakem'/><category term='1000 times'/><category term='MyISAM index pointers'/><category term='caching table'/><category term='caching MYI index'/><category term='Linux reboot'/><category term='speed'/><category term='NLP'/><category term='term weight'/><category term='chunks'/><category term='Latent Semantic Indexes'/><category term='mysql'/><category term='1 million registers'/><category term='slow'/><category term='counter'/><category term='Zettair'/><category term='force cache'/><category term='Bottleneck'/><category term='memory'/><category term='instant responses'/><category term='caching database'/><category term='Justin Zobel'/><category term='VSM'/><category term='FC1'/><category term='Signature Files'/><category term='Derek J. Balling'/><category term='hello world'/><category term='Inverted Files'/><category term='words'/><category term='big database'/><category term='I/O is slowing down'/><category term='B-tree index'/><category term='Linux'/><category term='reading MYD file'/><category term='million'/><category term='stats'/><category term='Axel Schwenke'/><category term='full-text search engine'/><category term='256 Kb'/><category term='Latent Semantic Indexing'/><category term='Jeremy D. Zawodny'/><category term='SVD algorithm'/><title type='text'>MySQL Internals: Full-Text investigations</title><subtitle type='html'>Trying to understand the MySQL index reading internals, to achieve best benchmarks when querying full-text.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>27</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-4176373257204831352</id><published>2008-02-18T13:07:00.000-08:00</published><updated>2008-02-18T13:19:28.801-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Improved tag clouds'/><category scheme='http://www.blogger.com/atom/ns#' term='tag cloud'/><title type='text'>Improved tag clouds</title><content type='html'>Tag clouds are 2-dimensional keyword mappings. It is a concept similar to my goal of "Mapping semantic correlations".&lt;br /&gt;&lt;br /&gt;The search results of http://www.quintura.com/ are a good example of 2D tag clouds. Also Flickr and Technoraty engines use it.&lt;br /&gt;&lt;br /&gt;At http://www.nosolousabilidad.com/hassan/improving_tagclouds.pdf we find at figure 2 an improved tag cloud. This paper from Hassan-Montero, Y. and Herrero-Solana, V. titled "Improving Tag-Clouds as Visual Information Retrieval Interfaces" explains some techniques to define an order in tag clouds.&lt;br /&gt;&lt;br /&gt;A "semantically improved tag cloud" is a simple way to express semantic relations among terms. In them, distance means semantic similarity.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-4176373257204831352?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/4176373257204831352/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=4176373257204831352' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4176373257204831352'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4176373257204831352'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2008/02/improved-tag-clouds.html' title='Improved tag clouds'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-1798532506207253999</id><published>2008-02-15T10:53:00.000-08:00</published><updated>2008-02-15T11:29:39.960-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='LSI'/><category scheme='http://www.blogger.com/atom/ns#' term='Latent Semantic Indexes'/><category scheme='http://www.blogger.com/atom/ns#' term='Natural Language Processing'/><category scheme='http://www.blogger.com/atom/ns#' term='mapping semantic correlations'/><category scheme='http://www.blogger.com/atom/ns#' term='Latent Semantic Indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><title type='text'>Mapping semantic correlations</title><content type='html'>First of all I have to thank Shlomo Swidler comment in this blog. It seems he has found a way (a hack) to avoid MySQL read the MYD file when not needed. Nevertheless it would be even much interesting to find out why MySQL always reads this file when there must be a lot of cases in which the search can be accomplished just using MYI file. Precisely these searches are those to be more time consuming and the use of indexes would optimize them a lot.&lt;br /&gt;&lt;br /&gt;Much time have past since my last post. I have been thinking about several subjects, but overall on Latent Semantic Indexes. One of my goals has been dismissed: I wanted to patch MySQL to accomplish fast full text searches on huge databases. But now Sun is bidding on MySQL. That changes things severily. I think MySQL code openess may be in danger.&lt;br /&gt;&lt;br /&gt;On the other hand, Andrew Aksyonoff (with his Sphinx search engine) has possibly found the best solution ever to this problem. If MySQL/Sun wants it to acquire or not is an open question for the future and the fate.&lt;br /&gt;&lt;br /&gt;So, from now on, I would like to center mainly on one data-mining and search engine new trend: "mapping semantic correlations".&lt;br /&gt;&lt;br /&gt;LSI, Latent Semantic Indexing seems to be a cutting edge method in the Artificial Intlligence (AI) branch of Natural Language Processing (NLP). The main goal obtained through LSI is get some correlations among terms in a given document. Similar terms, semantically speaking, got correllated once LSI is applied. This improves search engine matching results.&lt;br /&gt;&lt;br /&gt;Nevertheless I think, LSI and similar techniques may give us more interesting results than just a better matching. I am focusing on "Mapping semantic correlations", that is to draw them. If we get concepts and relations drawn in a plane (2D-space) it will be very easy to develop a search engine. No more indexes. Just a low dimensional spatial search. Fast and simple.&lt;br /&gt;&lt;br /&gt;Another advantage of this kimera would be to represent the knowledge at a very tough level (sintaxis and grammar would be ignored in this process). A good mapping would place similar terms near from each other and dissimilar ones far among them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-1798532506207253999?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/1798532506207253999/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=1798532506207253999' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1798532506207253999'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1798532506207253999'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2008/02/mapping-semantic-correlations.html' title='Mapping semantic correlations'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-6687921281806332963</id><published>2007-11-08T15:44:00.000-08:00</published><updated>2007-11-08T15:47:17.372-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Yann Neuhaus'/><category scheme='http://www.blogger.com/atom/ns#' term='counter'/><category scheme='http://www.blogger.com/atom/ns#' term='offset'/><category scheme='http://www.blogger.com/atom/ns#' term='MyISAM index pointers'/><title type='text'>Yann Neuhaus and MyISAM index pointers</title><content type='html'>&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;Yann Neuhaus has replied one of my help requesting emails:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;My understanding is that you have 2 main storage strategies for MyISAM :&lt;br /&gt;&lt;br /&gt;Variable length : tables with at least one varchar-like type (variable). Then in the MYI file we store the Offset of the row in the .MYD file&lt;br /&gt;&lt;br /&gt;Fixed length : (only char or fixed length data types in the table) Then in the MYI file we store the row number which gives multiplied by the row length the Offset of the row in the MYD file.&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;Thanks for this information. I didn't know about those two kinds of pointers. Now I have a more exact idea. Thanks.&lt;br /&gt;&lt;br /&gt; &lt;/div&gt;&lt;br /&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt; I do not really understand why multiplying the row number with the row length is faster (or better) than storing the offset.&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;I guess it is more economic in space to store a counter (fixed length row number) than an offset (variable length tables), because the offset is always a bigger number. Much bigger in some cases, where the record length is big. To store a big number requires more bits than a small number, so the index is smaller when using the counter and the general resource usage (buffers, I/O operations) is lower in this case.&lt;br /&gt;&lt;br /&gt; &lt;/div&gt;&lt;br /&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;It is also important to consider that we also have the compressed storage.&lt;/blockquote&gt;&lt;br /&gt;Thanks, Yann. But is the index compressed, the data or both of them?&lt;br /&gt;&lt;br /&gt;Nevertheless, I go on my investigations, and unfortunately I still haven't found the reason for MySQL to read the MYD file when just asking for the number of results in a simple query, in which it is supposed not to be needed any information stored in the MYD file.&lt;br /&gt;&lt;br /&gt;Thanks!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-6687921281806332963?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/6687921281806332963/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=6687921281806332963' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6687921281806332963'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6687921281806332963'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/11/yann-neuhaus-and-myisam-index-pointers.html' title='Yann Neuhaus and MyISAM index pointers'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-5083513869671190493</id><published>2007-11-01T10:06:00.000-07:00</published><updated>2007-11-03T04:43:48.430-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Vector Space Method'/><category scheme='http://www.blogger.com/atom/ns#' term='LSI'/><category scheme='http://www.blogger.com/atom/ns#' term='VSM'/><category scheme='http://www.blogger.com/atom/ns#' term='Latent Semantic Indexing'/><title type='text'>Vector Reduction</title><content type='html'>* The Vector Space Method (VSM) may be useful to understand and optimize indexing techniques for full-text databases.&lt;br /&gt;&lt;br /&gt;* One trivial optimization is to convert all words to lower case. This way the index might be reduced nearly to the 50%. There is no reason to maintain two or more indexes for the same word with different cases, if we accept that the case mode is not reporting any additional meaning to the word.&lt;br /&gt;&lt;br /&gt;* Another proven application is the synonyms reduction, also called Latent Semantic Indexing (LSI). All synonyms of 1 word are reduced to this word. In this way, index size is reduced as long as distinct words due to this equivalence relation (that defines a class). Through the perspective of Vector Space Method, this is equivalent to a reduction in the dimension of the Vector Space. Conceptually, to reduce synonyms to one parent word or class word is like adding an external information (the thesaurus) to establish a kind of linear dependence among the synonyms and afterwards, doing a projection in the major significant axis. The projection would remove the dimensions representing peculiarities that make differences among same class synonyms, and preserve the dimensions that made these words different from other terms out of the class.&lt;br /&gt;&lt;br /&gt;* The approach to establish one dimension in the Vector Space for each distinct word in the document is good, but could be better, I think. Certain words are related and others are not. If we associate a dimension to a word, implicitly we are agreeing that every word in the index is linearly independent to each other. And that is not true. For example, the word "text" and the word "hypertext" are not independent. Actually "hypertext" is a mixture of "text" and the word "link", conceptually. So vector "text" plus vector "link" should sum vector "hypertext". We them have 3 words, but only 2 dimensions, because hypertext don't need a new dimension to be represented. "Text" and "link" are orthogonal words, but "hypertext" is a linear combination of them. That is the way, I think, of understanding the vector reduction.&lt;br /&gt;&lt;br /&gt;* Every word related to one another should be expressed through any kind of linear dependence. Accomplish that work in a given language would be equivalent to find the smallest vector space that preserves the meaning of the words and at the same time avoids redundant information. If we could find such extraordinary vector base, we should find that all the remaining class words or base vectors are orthogonal, that is, not related in any way.&lt;br /&gt;&lt;br /&gt;* Not being satisfied with that chimera, I have the impression that finding this orthogonal and universal base could be done through statistic methods instead of thesaurus application. That means, the sole in-depth studying of a corpus could determine which are the most suitable words in the vector base and which to be just linear combinations of them. The fact that certain words belongs to a given document is establishing topological relations among them. The very fact that staying close each other is telling us some kind of useful information, for example, that they belong to a given context. The Inverted File (IF) index is also considering these information when storing the document pointer (docId) and the offset for each word in order to calculate the relevance of a search. That proves this topological information is related to the meaning of the words: "a door to the semantic universe".&lt;br /&gt;&lt;br /&gt;* This way we could extract the semantic relations among words and use it to establish best orthogonal base vectors.&lt;br /&gt;&lt;br /&gt;* I have thought of a recurrent method to apply in order to get the best reduced vector base. I have to spend time on it ...&lt;br /&gt;&lt;br /&gt;* Next goal is to establish a solid theoretical framework for these ideas, investigate if anyone out there has developed something similar, and carry out some computer experiments trying to obtain some kind of tangible results. This personal brainstorm is not more that a roadmap to steer my following investigations.&lt;br /&gt;&lt;br /&gt;* I have heard about Singular Value Decomposition (SVD), and I think this is the formal name to the Vector reduction process I am trying to describe in here. According to Wikipedia:&lt;br /&gt;&lt;br /&gt;http://en.wikipedia.org/wiki/Latent_semantic_indexing&lt;br /&gt;&lt;br /&gt;the Singular Value Decomposition implementation can be done by 2 methods: Lanczos Methods and via Neural Networks. I think there exists a third method, a recurrent approach that could be rougher but quicker.&lt;br /&gt;&lt;br /&gt;* When forcing the Vector reduction process to its limits, we could get a final vector space of not too many dimensions (10, for example). What is the meaning of these 10 dimensions? I expect they were the Canonical Topics that give best taxonomy to the corpus used. If the corpus is the WWW, we should get a final base vector corresponding probably or roughly to the main level-1 topics of the Yahoo directory. So, if that is true (I will have to get some positive experimental results before assure it is true ...) we could obtain an elegant scheme to do Text Categorization, including their taxonomy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-5083513869671190493?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/5083513869671190493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=5083513869671190493' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/5083513869671190493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/5083513869671190493'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/11/vector-reduction.html' title='Vector Reduction'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-3480771651574786740</id><published>2007-10-31T09:27:00.000-07:00</published><updated>2007-10-31T09:41:51.134-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='chunks'/><category scheme='http://www.blogger.com/atom/ns#' term='Inverted Files'/><category scheme='http://www.blogger.com/atom/ns#' term='256 Kb'/><category scheme='http://www.blogger.com/atom/ns#' term='IF'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>Reading IF in 256 Kb chunks</title><content type='html'>Andrew Aksyonoff has been explaining me the chunk-oriented read process when Sphinx require the Inverted Files (IF). And these are the doubts and comments I have replied:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;&gt; ... Sphinx will read it in small (256 KB) chunks ...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So, if the query is a 2 word phrase, Sphinx will have open 2 windows simultaneously, that is, 2 buffers of 256 Kb at the same time. Is that right?&lt;br /&gt;&lt;br /&gt;If more words in the query, more simultaneous 256 Kb &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;chunks&lt;/span&gt;, right?&lt;br /&gt;&lt;br /&gt;And the major CPU consuming process are 2:&lt;br /&gt;1.- Performing the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;chunks&lt;/span&gt; intersections.&lt;br /&gt;2.- Sort the intersected &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;docId's&lt;/span&gt; by the user criteria.&lt;br /&gt;Right?&lt;br /&gt;&lt;br /&gt;And this is my final doubt: Is always &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;necessary&lt;/span&gt; to read from the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;beginning&lt;/span&gt; to the end the corresponding IF indexes when doing the intersection?&lt;br /&gt;&lt;br /&gt;Google in mind spurs this question. I mean, when searching in Google 2 common words like "Internet" and "WWW" we obtain this results:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;* 2.110 million results for "Internet".&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;* 9.310 million results for "WWW".&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When intersecting them &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Sphix&lt;/span&gt; would have to read and intersect 2 billion (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;docId&lt;/span&gt;+weight+probably other ranks) hits against 9 billion. Right?&lt;br /&gt;&lt;br /&gt;Isn't that too slow? The time spent in doing it, is log(N)? Or log(2 billion) or log(9 billion) or log (2 billion x 9 billion) or any other?&lt;br /&gt;&lt;br /&gt;Thanks again!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-3480771651574786740?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/3480771651574786740/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=3480771651574786740' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3480771651574786740'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3480771651574786740'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/reading-if-in-256-kb-chunks.html' title='Reading IF in 256 Kb chunks'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-4523954771937240434</id><published>2007-10-31T05:33:00.000-07:00</published><updated>2007-10-31T06:16:20.636-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Inverted Files'/><category scheme='http://www.blogger.com/atom/ns#' term='Text Indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='Signature Files'/><title type='text'>First conclusions on Text Indexing</title><content type='html'>After reading several papers on Information Retrieving and Search Engines I conclude that:&lt;br /&gt;&lt;br /&gt;* Sphinx engine only deals with indexes, not with data. So it delivers solely the indexes pointing to the documents that match certain criteria. Is up to the database to fetch this documents and up to the web application to show them on screen.&lt;br /&gt;&lt;br /&gt;* There are two ways to index full-text: Inverted Files and Signature Files. First ones seem to be more efficient. Second ones work better when matching common words, but it is very uncommon to search only common words. So, for real life, better to apply Inverted Files.&lt;br /&gt;&lt;br /&gt;* Inverted Files are lists of documents where a word appears, storing also the normalized weight to be able to do reliable matching.&lt;br /&gt;&lt;br /&gt;* Signature Files are bitmaps that indicates if a document contains a word or not. Just this bitwise information, yes or not, extended for all the documents indexed.&lt;br /&gt;&lt;br /&gt;* When a document hasn't got a certain keyword inside, Signature Files contain a bit value (1 or 0. In this case, 0), meanwhile Inverted Files never contain a reference to the document in such a case.&lt;br /&gt;&lt;br /&gt;* Sphinx uses Inverted Files. On the contrary, TBGsearch uses Signature Files.&lt;br /&gt;&lt;br /&gt;* I guess MySQL Full-text default engine is also using Inverted Files, but less efficiently by far than Sphinx.&lt;br /&gt;&lt;br /&gt;* Inverted Files may be compressed heavily, because it is an ordered sequence of integers, usually homogeneous. This property saves I/O disk reading operations.&lt;br /&gt;&lt;br /&gt;* In Sphinx, when doing two or more word searches, phrasal or just Boolean AND'ed, respective Inverted Files are read from disk, uncompressed and intersected in RAM, and then, order by search criteria (matching weight, date, or others).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-4523954771937240434?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/4523954771937240434/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=4523954771937240434' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4523954771937240434'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4523954771937240434'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/first-conclusions-on-text-indexing.html' title='First conclusions on Text Indexing'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-6517021733874440483</id><published>2007-10-30T08:54:00.000-07:00</published><updated>2007-10-30T09:26:58.453-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tf.idf formula'/><category scheme='http://www.blogger.com/atom/ns#' term='Term-document matrix'/><category scheme='http://www.blogger.com/atom/ns#' term='SVD algorithm'/><category scheme='http://www.blogger.com/atom/ns#' term='term weight'/><title type='text'>Vector treatment - Computing term weights</title><content type='html'>Looking for more information about the term-document matrix I have found an interesting and well-explained &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;web page&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.hirank.com/semantic-indexing-project/lsi/tdm.htm"&gt;http://www.hirank.com/semantic-indexing-project/lsi/tdm.htm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here, the author explains how to understand the vector space and the matrix meaning when treating the problem of assigning weights to terms in documents, in order to create a weighted index.&lt;br /&gt;&lt;br /&gt;Three factors seem to be important to establish a well-thought-out weighting function:&lt;br /&gt;&lt;br /&gt;1.- Local weight or Term Frequency (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;TF&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;2.- Global Weight or Inverse Document Frequency (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;IDF&lt;/span&gt;): point 5 in last post.&lt;br /&gt;&lt;br /&gt;3.- Normalization factor: in order to use the same scale among different term weights.&lt;br /&gt;&lt;br /&gt;Third point (I guess) is merely decorative but not functional, because the absolute weight value has nothing to do with the order of the results in the search. The order is important; the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;absolute&lt;/span&gt; value is not. So, third point could be ignored and then we had only 2 factors: &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;TF&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;IDF&lt;/span&gt;. This is precisely the so-called &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;tf&lt;/span&gt;.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;idf&lt;/span&gt; formula.&lt;br /&gt;&lt;br /&gt;The 3 factors (according to the URL above) may be integrated in what is called the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;SVD&lt;/span&gt; algorithm. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;SVD&lt;/span&gt; stands for Singular Value Decomposition, a linear algebra factorization, but I still ignore their relation with the 3 factors that define the term's weighting function.  I will appreciate any hint about this ...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-6517021733874440483?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/6517021733874440483/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=6517021733874440483' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6517021733874440483'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6517021733874440483'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/vector-treatment-computing-term-weights.html' title='Vector treatment - Computing term weights'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-6355402582010967747</id><published>2007-10-30T06:05:00.000-07:00</published><updated>2007-11-03T04:29:48.824-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nomenclature'/><category scheme='http://www.blogger.com/atom/ns#' term='Justin Zobel'/><category scheme='http://www.blogger.com/atom/ns#' term='Information Retrieval'/><title type='text'>Full-text search nomenclature</title><content type='html'>I've been processing another reference document dealing with Full-text searching that gave me Andrew Aksyonoff: "Recommended Reading for IR Research Students" being Justin Zobel one of its editors and IR meaning "Information Retrieval".&lt;br /&gt;&lt;br /&gt;This is a list of important papers about Information Retrieval, and Full-text searching is an important area of study in this field (IR).&lt;br /&gt;&lt;br /&gt;Reading their abstracts I have found there are several and distinct ways to refer to one concept. I expose them here. Nomenclature:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;* Scoring function: equal to Ranking function, or weight function?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;* Term: equal to word, or keyword.&lt;br /&gt;&lt;br /&gt;* Document: equal to data, row, record, message.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;* &lt;/span&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;Index =? weighted index, inverted index, reverse index, inverted list, inverted file, document list.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;* Term-document matrix =? document-term frequency weights.&lt;br /&gt;&lt;br /&gt;* Phrase =? compound terms, several words among quotes.&lt;br /&gt;&lt;br /&gt;* Vector Space Model (VSM) = Term Vector Model&lt;br /&gt;&lt;br /&gt;* Latent Semantic Analysis = Latent Semantic Indexing (LSI) = Vector reduction&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Part of this terminology could probably be ordered in this (maybe candid) way:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;1.- Words, terms or keywords belongs to a given document.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;2.- A document is more or less relevant respect to a word according to the weight/rank/score function.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;3.- One weight for a given word is just a vector component, an scalar value.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;4.- The weights of all the words in a document defines one vector. This vector represents the document.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; color: rgb(102, 102, 102);"&gt;5.- All the documents in the database are vectors in a vector space. To compare them, they should firstly be normalized to avoid that longest documents always stay at first result positions.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is supposed to be the so-called Vector Space Method (VSM), I guess. If so, I have to say that this is a bit strange vector space because the vector components never get a negative value. All the vectors/documents resides in the same multidimensional quadrant. Could be that a problem?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-6355402582010967747?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/6355402582010967747/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=6355402582010967747' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6355402582010967747'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/6355402582010967747'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/full-text-search-nomenclature.html' title='Full-text search nomenclature'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-8525093273755351435</id><published>2007-10-29T17:07:00.000-07:00</published><updated>2007-10-29T17:39:49.278-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Okapi BM25'/><category scheme='http://www.blogger.com/atom/ns#' term='Zettair'/><category scheme='http://www.blogger.com/atom/ns#' term='Hugh E. Williams'/><category scheme='http://www.blogger.com/atom/ns#' term='Justin Zobel'/><category scheme='http://www.blogger.com/atom/ns#' term='Melbourne University'/><title type='text'>Building fast search engines and Zettair</title><content type='html'>I am following Andrew Aksyonoff guidelines to go on with the full-text investigations.&lt;br /&gt;First of all I am reading carefully "Building fast search engines" by Hugh E. Williams at http://www.hughwilliams.com/t1.pdf&lt;br /&gt;&lt;br /&gt;It is a very interesting paper and has driven me to Zettair project ( http://www.seg.rmit.edu.au/zettair/ ) by Justin Zobel and others at Melbourne University. This is a powerful indexer and search engine designed to work with big text repositories.&lt;br /&gt;&lt;br /&gt;Hugh E. Williams also works at Melbourne University, and has settled very clearly the search engine basis in his document. I have learned several new concepts:&lt;br /&gt;&lt;br /&gt;* Ranked query: Query formed by 2 or more words.&lt;br /&gt;&lt;br /&gt;* Phrase search: Query formed by 2 or more words among quotes.&lt;br /&gt;&lt;br /&gt;* Term Frequency (TF): The number of appearances of a given word in a document.&lt;br /&gt;&lt;br /&gt;* Inverse Document Frequency (IDF): A word oddness, that is, if a word is more o less frequent in a text in general.&lt;br /&gt;&lt;br /&gt;* Ranking function: The formula that measures the weight of a word in a document. It is useful to find out which document is closer to the a keyword from the search query.&lt;br /&gt;&lt;br /&gt;* Okapi BM25: Very well crafted formula that gives a good and fair word weight measure.&lt;br /&gt;&lt;br /&gt;* Inverted index: It is what I should call an "weighted index". I don't know why they call it "inverted". Inverted index shows the database under a "word point of view". That is where the documents containing that word are placed.&lt;br /&gt;&lt;br /&gt;* Compressed indexes: A way lo reduce I/O operations. The same for the data (or rows).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-8525093273755351435?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/8525093273755351435/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=8525093273755351435' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/8525093273755351435'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/8525093273755351435'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/building-fast-search-engines-and.html' title='Building fast search engines and Zettair'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-3671563579462955298</id><published>2007-10-29T06:20:00.000-07:00</published><updated>2007-10-29T06:23:24.317-07:00</updated><title type='text'>Andrew Aksyonoff has replied me again. Thanks!</title><content type='html'>&lt;span style="font-weight: bold;"&gt;&gt; Selecting 1 million random rows from disk-based database of 100&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; million rows will be slow as hell, too.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;According to my short experience, reading spread rows in disk takes 0.01 seconds per record, and from RAM takes 0.00001 seconds, that is 1000 times less than from disk (without cache).&lt;br /&gt;&lt;br /&gt;So selecting 1 million rows would take 10.000 seconds from disk and 10 seconds from RAM. Do you agree with it?&lt;br /&gt;&lt;br /&gt;I have entered at boardreader.com and looked for the keyword "internet". And I get this stats: "Results 1 - 10 of 1,367,788 for internet (0.385 seconds)".&lt;br /&gt;&lt;br /&gt;Website boardreader.com is Sphinx powered, and is selecting more than 1 Million rows in less than a second. According to my rules above, it should take around 13 seconds long.&lt;br /&gt;&lt;br /&gt;Maybe boardreader.com is using parallel processing of some kind or their RAM memory is faster than standard.&lt;br /&gt;&lt;br /&gt;These numbers are getting me a bit confused ...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; Eg. Sphinx takes mutual keyword positions into account and ranks&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; closer phrase matches higher. Therefore ordering per-keyword document&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; list by anything but document ID will in fact slow down processing&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; a lot, because intersecting unsorted lists will me much slower.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I think you are giving me here very valuable information about the Sphinx internals and I really appreciate it.&lt;br /&gt;&lt;br /&gt;You will agree with me if I say that understanding the engine of a powerful search engine like Sphinx is not easy. This is why I am trying to simplify as much as possible the situation. You talk me about intersecting indexes, but I guess this is only needed when doing phrase searches. I am also very interested in phrase searches but maybe it would be easier to understand firstly single keyword searches.&lt;br /&gt;&lt;br /&gt;I suppose that in a single keyword search only 1 index should be consulted, and that index is ranked or ordered under a given criteria. If that order is equal to the page results order, then there is no need to process all the matching rows to find the first 10 that will be showed in the page. So directly the first 10 better ranked pointers in the index should lead me to the 10 data rows to extract from the database. I assume that extracting 10 spread data rows is a very fast operation even though the database is huge or not cached.&lt;br /&gt;&lt;br /&gt;So the question is, which criteria or order is Sphinx using in a given keyword index? What is that rank function? Is it just a counter (or its logarithm) of the number of appearances of the keyword in the document? Is that simple?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; No. The key tricks are 1) to read and process a lot of data in linear&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; manner, friendly with modern HDDs and CPUs; and 2) to distribute&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; searching across many nodes, and search on all of them in parallel.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I don't still understand how to order the indexes and the data so as to do linear readings ... I don't realize any way to order data that let us to read it linearly in any search. If you search "internet" the data associated to this keyword might be localized, but if you then search "Russia" the corresponding data rows couldn't be contiguous because "technology order" is different and incompatible to "country order".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; I've seen a paper describing a similar idea several years ago. But it&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; would only work with statistical approaches, not for phrase matches,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&gt; so I did not bother even to memorize the title :)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;What is an statistical approach? Are you referring to fuzzy logic or related?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I wouldn't like to bother you with my questions. I guess you have much work developing Sphinx. Maybe there is some kind of document in Internet where all this concepts are explained. I have spent many hours in Google looking for it unsuccessfully. If you know about such a valuable source, please, tell me.&lt;br /&gt;&lt;br /&gt;Thanks a lot again, Andrew.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-3671563579462955298?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/3671563579462955298/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=3671563579462955298' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3671563579462955298'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3671563579462955298'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/andrew-aksyonoff-has-replied-me-again.html' title='Andrew Aksyonoff has replied me again. Thanks!'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-847979328805348421</id><published>2007-10-28T08:46:00.000-07:00</published><updated>2007-10-28T09:08:41.006-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL indexes'/><category scheme='http://www.blogger.com/atom/ns#' term='Btree index'/><category scheme='http://www.blogger.com/atom/ns#' term='MyISAM index structure'/><title type='text'>B-tree MyISAM index structure</title><content type='html'>Interesting references still pending to investigate are:&lt;br /&gt;&lt;br /&gt;* http://forge.mysql.com/wiki/MySQL_Internals_MyISAM (section "The .MYI Key Values ")&lt;br /&gt;&lt;br /&gt;* http://lists.mysql.com/internals/18131&lt;br /&gt;&lt;br /&gt;* "Pro MySQL" by Michael Kruckenberg, Jay Pipes, Chad Russell, page 160:&lt;br /&gt;&lt;br /&gt;After the header section, the MyISAM index blocks compose the remainder of the .MYI file. The index blocks are 1KB on-disk pages of data, representing the B-tree leaf and non-leaf nodes. A single index block contains only key values pertaining to one index in the table. The header section (detailed in the previous section) contains information about how the MyISAM storage engine should find the root node of each index by supplying the offset for the root node’s index block in the keydef elements.&lt;br /&gt;&lt;br /&gt;Each index block contains the following:&lt;br /&gt;&lt;br /&gt;A single 2-byte block header. The first bit of the 16 bits in the header indicates whether the block is a leaf node (0 for leaf; 1 for non-leaf). The remaining 15 bits contain the total length of bytes used in the block (non free space).&lt;br /&gt;&lt;br /&gt;Following the header, index keys and record identifiers are laid out in a balanced organization (the B-tree format). With each key is stored the key value (of a length equal to the data type of the indexed field) and a 4-byte record pointer.&lt;br /&gt;&lt;br /&gt;The remainder of the index block is junk bytes (filler bytes), which become used as the B-tree index “fills out” with inserts. This is the “fill factor” for MyISAM B-tree index pages, and typically represents between 65% and 80% of the data used within the index block under normal operations, to allow for split-free growth along with the insertions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-847979328805348421?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/847979328805348421/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=847979328805348421' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/847979328805348421'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/847979328805348421'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/b-tree-myisam-index-structure.html' title='B-tree MyISAM index structure'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-3467771891625516129</id><published>2007-10-28T08:06:00.000-07:00</published><updated>2007-10-28T08:57:27.978-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Derek J. Balling'/><category scheme='http://www.blogger.com/atom/ns#' term='B-tree index'/><category scheme='http://www.blogger.com/atom/ns#' term='Jeremy D. Zawodny'/><title type='text'>MyISAM full-text indexed have weight</title><content type='html'>As exposed at &lt;a href="http://groups.google.com/group/comp.databases.mysql/browse_thread/thread/3fe8a2a2c4571730/411629f5eb22c7fc"&gt;&lt;span style="text-decoration: underline;"&gt;Usenet&lt;/span&gt;&lt;/a&gt; I still haven't found the cause of MySQL reading &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;MYD&lt;/span&gt; files when querying only index information, such as number of results, or id fields. I consider that next queries shouldn't force MySQL to extract information from the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;MYD&lt;/span&gt; file because the index &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;MYI&lt;/span&gt; content should be enough:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT id FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;But the fact is that on several versions of MySQL I have tested, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;MYD&lt;/span&gt; files are being reading exhaustively when performing these queries. If anybody could give me a hint on this issue, I would appreciate it.&lt;br /&gt;&lt;br /&gt;Studying the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;MyISAM&lt;/span&gt; index structure when doing full-text matching, I have found that Jeremy D. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Zawodny&lt;/span&gt; and Derek J. Balling hold that full-text indexes are implemented with a two-part &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;MyISAM&lt;/span&gt; B-tree index. First  field is the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;VARCHAR&lt;/span&gt; that stores the keyword and second one is a FLOAT that measures his weight.&lt;br /&gt;&lt;br /&gt;I am confused. If the index is weighted, why does MySQL reads &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;MYD&lt;/span&gt; file to find out the MATCHING order in a query like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;SELECT id, MATCH (title,body) AGAINST ('keyword') AS score FROM articles ORDER BY score DESC LIMIT 0,10;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;These kind of queries are slow under MySQL because &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;MYD&lt;/span&gt; files are consulted, and that is the paradox. Why to consult them if it isn't required? According to Jeremy D. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;Zawodny&lt;/span&gt; and Derek J. Balling the index know the keyword and its weight. Isn't it enough?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-3467771891625516129?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/3467771891625516129/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=3467771891625516129' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3467771891625516129'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3467771891625516129'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/myisam-full-text-indexed-have-weight.html' title='MyISAM full-text indexed have weight'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-1418697341134816709</id><published>2007-10-27T12:20:00.000-07:00</published><updated>2007-10-27T12:36:34.474-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='1000 times'/><category scheme='http://www.blogger.com/atom/ns#' term='I/O operations'/><category scheme='http://www.blogger.com/atom/ns#' term='disk'/><category scheme='http://www.blogger.com/atom/ns#' term='RAM'/><title type='text'>RAM is 1000 times faster than disk</title><content type='html'>It has me taken 4 days to convert the english Wkipedia content into a SQL dump file. The original XML file occupy 13 Gb uncompressed. Once filtered and converted to SQL it occupies 4.3 Gb:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;-rw-r--r--  1 root root  13G Oct 23 06:27 enwiki-20071018-pages-articles.xml&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;-rw-r--r--  1 root root 4.3G Oct 27 01:18 enwiki-20071018-pages-articles.sql&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Having enough RAM insert the SQL dump into MySQL doesn't take too long: around 5 minutes. Creating the full-text index takes much more: around 2 hours.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;# ls -laSh /var/lib/mysql/full_text_investigations/&lt;br /&gt;total 4.1G&lt;br /&gt;-rw-rw----  1 ;mysql mysql 4.1G Oct 27 18:16 articles.MYD&lt;br /&gt;-rw-rw----  1 ;mysql mysql  42M Oct 27 18:16 articles.MYI&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;This is the file sizes before the index creation, and next figures are afterwards:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;# ls -laSh /var/lib/mysql/full_text_investigations/&lt;br /&gt;total 6.4G&lt;br /&gt;-rw-rw----  1 ;mysql mysql 4.1G Oct 27 18:31 articles.MYD&lt;br /&gt;-rw-rw----  1 ;mysql mysql 2.3G Oct 27 19:07 articles.MYI&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;We may verify once more the rule &lt;span style="font-weight: bold;"&gt;"Index size = Data size = Dump size"&lt;/span&gt;. Here index size is a lower than the rest sizes, but not too far.&lt;br /&gt;&lt;br /&gt;So we have created a database with 4 million records:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;mysql&gt; SELECT COUNT(*) FROM articles;&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|  4203633 |&lt;br /&gt;+----------+&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Doing some maths we find out that 1 record is 1 Kb long on average.&lt;br /&gt;&lt;br /&gt;This test has been done under a 16 Gb RAM server, so the MYD and the MYI files are cached in RAM. Here are some benchmarks:&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Einstein');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|     2744 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (0.02 sec)&lt;br /&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Linux');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|     5907 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('America');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|   107015 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (0.88 sec)&lt;br /&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('world');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|   324369 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (2.40 sec)&lt;br /&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('Google');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|    30881 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (0.22 sec)&lt;br /&gt;&lt;br /&gt;mysql&gt; SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('history');&lt;br /&gt;+----------+&lt;br /&gt;| COUNT(*) |&lt;br /&gt;+----------+&lt;br /&gt;|   182735 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (1.47 sec)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;There is a clear correlation among registers found and time consumed:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;MYD cached     =&gt; 100.000 records per second&lt;br /&gt;MYD NOT cached =&gt;     100 records per second&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;So, RAM is 1000 times faster than disk.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-1418697341134816709?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/1418697341134816709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=1418697341134816709' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1418697341134816709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1418697341134816709'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/ram-is-1000-times-faster-than-disk.html' title='RAM is 1000 times faster than disk'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-4766053190630204422</id><published>2007-10-26T16:14:00.002-07:00</published><updated>2007-10-26T17:14:39.519-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Google and Sphinx'/><category scheme='http://www.blogger.com/atom/ns#' term='weighted index'/><title type='text'>Andrew Aksyonoff gave me a hint</title><content type='html'>Andrew Aksyonoff is the main developer of Sphinx. I asked him about the reason for MySQL being too much slow that Sphinx: "I would like you to give me a small hint about the Sphinx index architecture. I have the thesis that MySQL is slow because its full-text indexes are not weight-ordered."&lt;br /&gt;&lt;br /&gt;He has told me that MySQL suffers from internal fragmentation rather than wrong ordering: It needs to perform too much semi-random I/O task on huge requests.&lt;br /&gt;&lt;br /&gt;Thanks, Andrew. As much as I understand (not so far) the index architecture of a database engine, the query leads to the indexes and these lead to the data. If the data to recover is spread, the result will take long. And that seems to be what happens in MySQL.&lt;br /&gt;&lt;br /&gt;Indeed, I have observed that MySQL satisfies roughly the rule of "0.01 sec per result found" no matter how big or small the database is. Precisely 0.01 is a coarse measure of the time usually taken to do a I/O random read to a disk. If the data to read wouldn't be so spread or would be stored in cache, in only 0.01 seconds MySQL could retrieve a lot of registers (maybe 100 or more, instead of just one).&lt;br /&gt;&lt;br /&gt;Therefore, my benchmarks satisfy Andrew's theory. Now let's observe a Google typical search result: Suppose we search the keyword "MySQL": 10 URLs in screen among 163 million matches. It took 0.06 seconds to deliver it.&lt;br /&gt;&lt;br /&gt;If MySQL would have made this search operation it would have taken probably 163.000.000 * 0.01 seconds, and that is 18 days. Too much time for a query. No doubt. Meanwhile Google or Sphinx can resolve the query in less than 1 second. Where is the miracle?&lt;br /&gt;&lt;br /&gt;This is my point: If the index could tell us directly which are the 10 first and most significant results for this search, we could avoid 163 million I/O operations. Only 10 I/O operations would have been necessary. And that is fast, just 0.10 seconds according to the rule above.&lt;br /&gt;&lt;br /&gt;MySQL needs to perform this 163 million I/O operations to find out the grade of matching of each register found against the search keyword. Google and Sphinx do not need it, probably because their indexes contain that matching weight information inside, someway.&lt;br /&gt;&lt;br /&gt;Hence the key factor to develop a fast search engine is an index with its pointers ordered by matching weight decreasingly. That is what I call a "weighted index". Maybe this nomenclature is stupid, I don't know. I am sorry in such a case.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-4766053190630204422?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/4766053190630204422/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=4766053190630204422' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4766053190630204422'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4766053190630204422'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/andrew-aksyonoff-gave-me-hint.html' title='Andrew Aksyonoff gave me a hint'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-313902647979879876</id><published>2007-10-26T12:46:00.000-07:00</published><updated>2007-10-31T10:36:37.607-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='I/O reading'/><category scheme='http://www.blogger.com/atom/ns#' term='1 million registers'/><category scheme='http://www.blogger.com/atom/ns#' term='Bottleneck'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>Discovery of Sphinx and others</title><content type='html'>I have written several people, forums and a Usenet group about the issue of "MySQL reading MYD when not necessary", but I haven't still gathered replies enough that gives me some light to solve this problem. So I will carry on and be patient.&lt;br /&gt;&lt;br /&gt;Meanwhile, while trying to find out something on my own about this subject, I have discovered some important projects I didn't know till now. This projects change the focus of this text-full investigations.&lt;br /&gt;&lt;br /&gt;Remember that the main goal in these investigations is to achieve a MySQL search engine performance acceptable when using huge databases. I thought MySQL could be tweaked or patched in some way that resolves search queries in 0.1 seconds instead of 1 minute or 10 minutes. These are real benchmarks when a table holds 1 million records and the server has nor enough RAM to cache the whole data file.&lt;br /&gt;&lt;br /&gt;I still think it can be done if we could (I don't know how) create a weighted index and could constrain the query to be fulfilled by reading uniquely the index MYI file. This way, it would be necessary to read and process the index value for every keyword of the full-text query and read only the 10 (for example) records in data MYD file that best match the search. This reading I/O process could take 0.01 seconds for each record, so the query could be resolved in 0.12 seconds taking into account that the main bottleneck is I/O reading.&lt;br /&gt;&lt;br /&gt;My recent discoveries are the existence of Sphinx project driven by Andrew Aksyonoff, and also other fast full-text engines like Mnogosearch, Lucene/DBSight, tbgSearch, Sienna, Zettair, Sphider, TSEP, PHPDig, iSearch, RiSearch, SiteSearch, Simple Web Search, IndexServer and Xapian.&lt;br /&gt;&lt;br /&gt;There is a good review of them at:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.onlamp.com/pub/a/php/2006/02/16/search-engine-showdown.html"&gt;http://www.onlamp.com/pub/a/php/2006/02/16/search-engine-showdown.html&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-313902647979879876?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/313902647979879876/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=313902647979879876' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/313902647979879876'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/313902647979879876'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/discovery-of-sphinx.html' title='Discovery of Sphinx and others'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-3944029841080885479</id><published>2007-10-24T15:07:00.000-07:00</published><updated>2007-10-24T15:10:42.483-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='caching table'/><category scheme='http://www.blogger.com/atom/ns#' term='Brian Wakem'/><category scheme='http://www.blogger.com/atom/ns#' term='table cache'/><title type='text'>Brian Wakem has also replied</title><content type='html'>Brian Wakem has replied:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; color: rgb(102, 102, 102);font-size:85%;" &gt;Forcing a filesystem cache of the table is probably loading the index into&lt;br /&gt;memory too. In the second case you are not, so the extra time is probably&lt;br /&gt;spent reading the index from disk, not the data file.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;and we have replied him as soon:&lt;br /&gt;&lt;br /&gt;But I am not loading the full table in cache, just only the corresponding MYD file as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; color: rgb(102, 102, 102);font-size:85%;" &gt;# cat /var/lib/mysql/full_text_investigations/*.MYD &gt; /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;dev&lt;/span&gt;/null&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The increment in cache size obtained through the "free" command matches with the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;MYD&lt;/span&gt; file size.&lt;br /&gt;&lt;br /&gt;Thence, we can be sure that the query needs to read the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;MYD&lt;/span&gt; file, in order to explain the penalty in the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;benchmarks&lt;/span&gt; when not caching the file.&lt;br /&gt;&lt;br /&gt;Thanks.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-3944029841080885479?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/3944029841080885479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=3944029841080885479' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3944029841080885479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3944029841080885479'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/brian-wakem-has-also-replied.html' title='Brian Wakem has also replied'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-361344425393888958</id><published>2007-10-24T12:46:00.000-07:00</published><updated>2007-10-24T12:47:46.405-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Axel Schwenke'/><category scheme='http://www.blogger.com/atom/ns#' term='reading MYD file'/><title type='text'>Axel Schwenke helped us. Thanks!</title><content type='html'>Axel Schwenke, a Support Engineer at MySQL AB has replied my question posted at comp.databases.mysql in this way:&lt;br /&gt;&lt;br /&gt;1. an index on a MyISAM table does not refer to the PK, but to the&lt;br /&gt;   physical address (or row number) of the row.&lt;br /&gt;&lt;br /&gt;2. how do you *know* MySQL is reading from the MYD file?&lt;br /&gt;&lt;br /&gt;3. use EXPLAIN to see how your query will be executed.&lt;br /&gt;   If it shows "using index" then no datafile reads will be done.&lt;br /&gt;&lt;br /&gt;This is what I have replied to Axel Schwenke:&lt;br /&gt;&lt;br /&gt;1.- What is the PK?&lt;br /&gt;&lt;br /&gt;1b.- The query is a count of registers, so it doesn´t matter what kind of pointer is the index using, isn't it?&lt;br /&gt;&lt;br /&gt;2.- I know MySQL is reading the MYD file because I reboot the server and then I force a reading of MYD file in order to be cached by the filesystem. Then I measure the query reply speed. Secondly I do the same process (including reboot) but not caching MYD file. The timings are extremely diffrent in each case. In the first case query responds in 0.1 seconds. In the second case query takes up to 5.0 seconds. So I conclude: MySQL is reading MYD file. Take into account that the MYD file is 100Mb in size.&lt;br /&gt;&lt;br /&gt;3.- The query should use the full-text index. Look at the EXPLAIN result: (key = title)&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;mysql&gt; EXPLAIN SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;+----------+----------+---------------+-------+---------+------+------+------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;| table    | type     | possible_keys | key   | key_len | ref  | rows | Extra      |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;+----------+----------+---------------+-------+---------+------+------+------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;| articles | fulltext | title         | title |       0 |      |    1 | where used |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;+----------+----------+---------------+-------+---------+------+------+------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;1 row in set (0.02 sec)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Thanks a lot for your help. Any further hint will also make me very thankful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-361344425393888958?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/361344425393888958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=361344425393888958' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/361344425393888958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/361344425393888958'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/axel-schwenke-helped-us-thanks.html' title='Axel Schwenke helped us. Thanks!'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-66434490766086421</id><published>2007-10-24T05:13:00.000-07:00</published><updated>2007-10-24T12:49:24.844-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unwanted MYD reads'/><category scheme='http://www.blogger.com/atom/ns#' term='I/O is slowing down'/><title type='text'>Index full-text queries are not fast</title><content type='html'>We are going to flush the filesystem cache by rebooting the server and ask for a series of index only full-text queries. We suppose they have to be very fast (around 0.01 seconds each one) because it won't be necessary that MySQL reads MYD file. It will read only certain chunks of the MYI file. Now the queries are this kind:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;SELECT id FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is evident we are not asking for any information stored in the MYD file. Just the id that is supposed to be found in the MYI file.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;65 rows in set (1.68 sec)&lt;br /&gt;165 rows in set (3.08 sec)&lt;br /&gt;256 rows in set (3.57 sec)&lt;br /&gt;27 rows in set (0.57 sec)&lt;br /&gt;5 rows in set (0.12 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;28 rows in set (0.54 sec)&lt;br /&gt;236 rows in set (3.36 sec)&lt;br /&gt;30 rows in set (0.57 sec)&lt;br /&gt;75 rows in set (1.32 sec)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Oh, oh! What is happening here? This benchmarks make a nonsense of our theory. I have no explanation. Maybe index is not actually stored in MYI file, but another kind of pointer. Let's do another test. Lets query only the sum of registers. This way id won't be asked, just the number of index references found in the MYI file:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;And the results are:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;1 row in set (3.09 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (5.72 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (4.04 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (0.55 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (0.13 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (0.62 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (4.18 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (0.58 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1 row in set (1.33 sec)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Bad results again. So the problem is that the real internals in this MySQL queries are different from what we thought. I am lost. Any ideas?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-66434490766086421?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/66434490766086421/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=66434490766086421' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/66434490766086421'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/66434490766086421'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/index-full-text-queries-are-not-fast.html' title='Index full-text queries are not fast'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-4534983509242107731</id><published>2007-10-24T04:19:00.001-07:00</published><updated>2007-10-24T05:11:56.326-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random access I/O'/><category scheme='http://www.blogger.com/atom/ns#' term='reading index file'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL internals'/><title type='text'>Caching only data file</title><content type='html'>MYD is the associated signature to MySQL MyISAM data files. We are going to cache only this file and measure query timings:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;# reboot&lt;br /&gt;&lt;br /&gt;# free&lt;br /&gt;           total       used       free     shared    buffers     cached&lt;br /&gt;Mem:        255504      50216     205288          0       6020      26988&lt;br /&gt;-/+ buffers/cache:      17208     238296&lt;br /&gt;Swap:            0          0          0&lt;br /&gt;&lt;br /&gt;# cat /var/lib/mysql/full_text_investigations/*.MYD &gt; /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;dev&lt;/span&gt;/null&lt;br /&gt;&lt;br /&gt;# free&lt;br /&gt;           total       used       free     shared    buffers     cached&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Mem&lt;/span&gt;:        255504     152156     103348          0       6148     125504&lt;br /&gt;-/+ buffers/cache:      20504     235000&lt;br /&gt;Swap:            0          0          0&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Filesystem&lt;/span&gt; cache has read &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;MyISAM&lt;/span&gt; data (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;MYD&lt;/span&gt;) file and has increased in 100 Mb size, meanwhile data file is 97 Mb in size. Roughly enough good matching.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;65 rows in set (0.19 sec)&lt;br /&gt;165 rows in set (0.09 sec)&lt;br /&gt;256 rows in set (0.13 sec)&lt;br /&gt;27 rows in set (0.05 sec)&lt;br /&gt;5 rows in set (0.03 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;28 rows in set (0.04 sec)&lt;br /&gt;236 rows in set (0.09 sec)&lt;br /&gt;30 rows in set (0.05 sec)&lt;br /&gt;75 rows in set (0.07 sec)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;Much better now. This timings are good, fast responses. Therefore it seems that caching data file is essential to obtain fast replies meanwhile caching index file is not necessary. Interesting conclusions ...&lt;br /&gt;&lt;br /&gt;We deserve an explanation, to comprehend why data must be cached meanwhile index is unnecessary. I am afraid first of all is compulsory to understand the internal mechanism of the query development in the relational database.&lt;br /&gt;&lt;br /&gt;Here is my theory: When a MySQL full-text query asks for registers it is giving only one clue to find them: the keyword. MySQL is going to search in the index file for that keyword. It is not needed to read the full index to find one keyword. Index is ordered in such a way that finding a keyword is really fast. Imagine it is ordered alphabetically. It is easy to guess where is going to be located the keyword index with certain margin of error. So it is no needed to read the full index file, only just the chunk where it is expected to find it. Therefore finding one keyword in index file may take one random access read I/O operation in the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;filesystem&lt;/span&gt;. You may do some benchmarks and find out that a typical I/O not cached read operation use to take around 0.01 seconds (supposing we are reading reasonable chunk sizes as for example 100 Kb)&lt;br /&gt;&lt;br /&gt;The second part to resolve the query is reading data records from the data &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;MYD&lt;/span&gt; file according to the index values MySQL just have found. And that is the most I/O resource consuming part. Let explore it in depth: We choose the following case:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;256 rows in set (0.13 sec)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;The index value is pointing to 256 records that contain the keyword asked. Now, remember the query:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;MySQL needs to find out not only the id's of the matched records but also the title and body contents. This information is kept in the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;MYD&lt;/span&gt; file. The good news are that only 1 more file MySQL needs to read to answer the query. The bad news are that this file may be very big (97 Mb in our case) and the information we need is not found sequentially, but in a spread way. MySQL have its &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;MYD&lt;/span&gt; file ordered by the id field of the table to accelerate as much as possible the extraction of certain register. We cannot read in a block all the registers needed. Much on the contrary it is going to be needed to make one read I/O access per each register. So, if MySQL has to extract 256 records and each I/O read &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_9"&gt;operating&lt;/span&gt; takes 0.01 seconds, we may expect to take 2.56 &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_10"&gt;seconds&lt;/span&gt; to get all the requested records. And that is precisely the benchmark obtained in last experiment. By contrast, current experiment resolves the query in just only 0.13 seconds. This can be explained just &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_11"&gt;remembering&lt;/span&gt; that we have first cached the full &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;MYD&lt;/span&gt; file into RAM. So I/O operations are not needed to get the 256 registers. RAM read operations are much faster than I/O disk read process. Pretty much faster, about 3 magnitude orders (1000-fold faster).&lt;br /&gt;&lt;br /&gt;Main conclusion therefore: "Querying data fields considerably slows down full-text searches when unable to cache &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;MYD&lt;/span&gt; file".&lt;br /&gt;&lt;br /&gt;Then what about doing only index full-text queries? It should be much faster. Well, that is precisely next experiment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-4534983509242107731?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/4534983509242107731/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=4534983509242107731' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4534983509242107731'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4534983509242107731'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/caching-only-data-file.html' title='Caching only data file'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-1369178680229346145</id><published>2007-10-24T04:07:00.000-07:00</published><updated>2007-10-24T04:18:20.330-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MyISAM index'/><category scheme='http://www.blogger.com/atom/ns#' term='caching MYI index'/><category scheme='http://www.blogger.com/atom/ns#' term='full-text search engine'/><title type='text'>Caching only index file</title><content type='html'>MYI extension is the signature to refer to MySQL MyISAM index file. We are going to cache only this index file and test the system performance.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family: courier new;"&gt;# reboot&lt;br /&gt;&lt;br /&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Mem:        255504      50212     205292          0       6040      26976&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;-/+ buffers/cache:      17196     238308&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;# cat /var/lib/mysql/full_text_investigations/*.MYI &gt; /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;dev&lt;/span&gt;/null&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Mem&lt;/span&gt;:        255504     135852     119652          0       6176     109716&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;-/+ buffers/cache:      19960     235544&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Filesystem&lt;/span&gt; cache has read &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;MYI&lt;/span&gt; file and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;has grown&lt;/span&gt; from 27 Mb to 110 Mb in size. Remember that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;MYI&lt;/span&gt; file occupies 81 Mb. Size of the cache and the file cached match. That's good. Now we query:&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;65 rows in set (1.51 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;165 rows in set (2.99 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;256 rows in set (3.51 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;27 rows in set (0.51 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;5 rows in set (0.08 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Empty set (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;28 rows in set (0.51 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;236 rows in set (3.37 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;30 rows in set (0.55 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;75 rows in set (1.25 sec)&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Oh! What the hell! Slow responses! Very similar timing to the obtained when not caching at all. Is that meaning that caching the index file is useless? The answer is YES, at least in this kind of "full-text search engine" queries.&lt;br /&gt;&lt;br /&gt;Index file seems not to be the key factor to improve performance. Let's do next the same experiment but with the data file.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-1369178680229346145?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/1369178680229346145/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=1369178680229346145' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1369178680229346145'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/1369178680229346145'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/caching-only-index-file.html' title='Caching only index file'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-2441592527602556739</id><published>2007-10-24T03:46:00.000-07:00</published><updated>2007-10-24T04:01:35.821-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='caching database'/><category scheme='http://www.blogger.com/atom/ns#' term='instant responses'/><category scheme='http://www.blogger.com/atom/ns#' term='force cache'/><title type='text'>Forcing whole database to be cached</title><content type='html'>We reboot again the Linux server. This way we garantee filesystem cache is void. The folder "/var/lib/mysql/full_text_investigations/" holds the database. We are going to force the system to keep in RAM the data and index files this way:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Mem:        255504      50340     205164          0       6104      27048&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-/+ buffers/cache:      17188     238316&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# cat /var/lib/mysql/full_text_investigations/* &gt; /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;dev&lt;/span&gt;/null&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Mem&lt;/span&gt;:        255504     237900      17604          0       6352     208316&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-/+ buffers/cache:      23232     232272&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Filesystem&lt;/span&gt; cache has increased from 27 Mb to 208 Mb. The database files are 178 Mb in size. So, we can be sure database is cached in RAM.&lt;br /&gt;&lt;br /&gt;We measure timings for the 10 predefined queries. Remember we have just reboot the system and force a cache reading:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;65 rows in set (0.04 sec)&lt;br /&gt;165 rows in set (0.02 sec)&lt;br /&gt;256 rows in set (0.01 sec)&lt;br /&gt;27 rows in set (0.00 sec)&lt;br /&gt;5 rows in set (0.00 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;28 rows in set (0.00 sec)&lt;br /&gt;236 rows in set (0.02 sec)&lt;br /&gt;30 rows in set (0.00 sec)&lt;br /&gt;75 rows in set (0.01 sec)&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Instant responses here prove that cache filling is crucial to fast database replies.&lt;br /&gt;&lt;br /&gt;But I wonder, is data file more crucial to be cached than index file? Next experiment will show us this essential knowledge.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-2441592527602556739?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/2441592527602556739/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=2441592527602556739' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/2441592527602556739'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/2441592527602556739'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/we-reboot-again-linux-server.html' title='Forcing whole database to be cached'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-5913048109936920152</id><published>2007-10-24T03:12:00.001-07:00</published><updated>2007-10-24T04:03:35.849-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Linux reboot'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL benchmarks'/><category scheme='http://www.blogger.com/atom/ns#' term='cache measurements'/><title type='text'>Rebooting to avoid cache biasing in the benchmarks</title><content type='html'>When the server is booted all the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;filesystem&lt;/span&gt; caches are clean. Therefore any data that MySQL needs to read has to be taken out directly from the hard disk. This way we are able to simulate a RAM lacking situation; and this will be very common when working with huge databases.&lt;br /&gt;&lt;br /&gt;In the following experiments our database holds 30.000 registers (15 million words). Here, the size measures:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;# ls -&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;laSh&lt;/span&gt; /var/lib/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;mysql&lt;/span&gt;/full_text_investigations/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;total 178M&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;mysql&lt;/span&gt;         97M Oct 23 21:07 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;MYD&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;mysql&lt;/span&gt;         81M Oct 23 21:07 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;MYI&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;mysql&lt;/span&gt;        8.4K Oct 23 16:25 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;frm&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;SQL&lt;/span&gt; dump file occupies 96Mb. Again the rule &lt;span style="font-weight: bold;"&gt;"Index size = Data size = Dump size" &lt;/span&gt;works reliably. Regarding that our server only has 256 Mb RAM we can expect that most of it will be consumed in caching data and index files. Hence, MySQL is going to respond fast, but it will reach the maximum limit of data for this situation.&lt;br /&gt;&lt;br /&gt;Now, we reboot the server and take some measures:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;# &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;ps&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;axv&lt;/span&gt; | grep &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;sql&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; 1609 ?        S      0:00    276   566  5365 1160  0.4 /bin/sh /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;usr&lt;/span&gt;/bin/safe_&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;mysqld&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; 1633 ?        S      0:00    767  3094 27097 5140  2.0 /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;usr&lt;/span&gt;/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;libexec&lt;/span&gt;/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_26"&gt;mysqld&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;Mem&lt;/span&gt;:        255504      50424     205080          0       6152      27052&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-/+ buffers/cache:      17220     238284&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;65 rows in set (1.68 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;165 rows in set (2.95 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;256 rows in set (3.65 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;27 rows in set (0.63 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;5 rows in set (0.16 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Empty set (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;28 rows in set (0.55 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;236 rows in set (3.34 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;30 rows in set (0.54 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;75 rows in set (1.38 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;ps&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;axv&lt;/span&gt; | grep &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_30"&gt;sql&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; 1609 ?        S      0:00    276   566  5365 1160  0.4 /bin/sh /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_31"&gt;usr&lt;/span&gt;/bin/safe_&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_32"&gt;mysqld&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; 1633 ?        S      0:00    767  3094 27225 5600  2.1 /&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_33"&gt;usr&lt;/span&gt;/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_34"&gt;libexec&lt;/span&gt;/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_35"&gt;mysqld&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# free&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;             total       used       free     shared    buffers     cached&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_36"&gt;Mem&lt;/span&gt;:        255504      83804     171700          0       6624      58708&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-/+ buffers/cache:      18472     237032&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Swap:            0          0          0&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The command "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_37"&gt;ps&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_38"&gt;axv&lt;/span&gt; | grep &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_39"&gt;sql&lt;/span&gt;" let us know how much memory is consuming MySQL for its own. We see that before &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_40"&gt;querying&lt;/span&gt; as after doing it, MySQL expend&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_41"&gt;s&lt;/span&gt; just 27 Mb.&lt;br /&gt;&lt;br /&gt;Nevertheless, the cache size has increased considerably jumping from 27Mb at first to 58 Mb after &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_42"&gt;querying&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;We may conclude the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_43"&gt;filesystem&lt;/span&gt; has not cached completely the whole database, but only some chunks of data, precisely those needed to respond the ten predefined queries.&lt;br /&gt;&lt;br /&gt;What about performance? Watch the timings. When there is no cache, we confirm the rough rule of thumb &lt;span style="font-weight: bold;"&gt;"0.01 seconds to obtain 1 register"&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Now we measure times once data is cached:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:arial;"&gt;65 rows in set (0.01 sec)&lt;br /&gt;165 rows in set (0.01 sec)&lt;br /&gt;256 rows in set (0.01 sec)&lt;br /&gt;27 rows in set (0.00 sec)&lt;br /&gt;5 rows in set (0.00 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;28 rows in set (0.01 sec)&lt;br /&gt;236 rows in set (0.01 sec)&lt;br /&gt;30 rows in set (0.00 sec)&lt;br /&gt;75 rows in set (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Obvious benchmarks are these. When data is cached, respond times are nearly zero.&lt;br /&gt;&lt;br /&gt;In the next experiment we will manipulate cache &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_44"&gt;filesystem&lt;/span&gt; to force whole database to be cached.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-5913048109936920152?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/5913048109936920152/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=5913048109936920152' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/5913048109936920152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/5913048109936920152'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/rebooting-to-avoid-cache-biasing-in.html' title='Rebooting to avoid cache biasing in the benchmarks'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-4223224378928651894</id><published>2007-10-23T17:31:00.000-07:00</published><updated>2007-10-23T18:32:45.227-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL cache'/><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><category scheme='http://www.blogger.com/atom/ns#' term='Filesystem cache'/><category scheme='http://www.blogger.com/atom/ns#' term='memory'/><category scheme='http://www.blogger.com/atom/ns#' term='RAM'/><title type='text'>Filesystem cache is important</title><content type='html'>This time, I have completed the import of 10.000 records from the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Wikipedia&lt;/span&gt; into MySQL. This means 7 million words to index. The table and database structure is defined as in the previous examples:&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;br /&gt;DROP DATABASE IF EXISTS full_text_investigations;&lt;br /&gt;CREATE DATABASE full_text_investigations;&lt;br /&gt;USE full_text_investigations;&lt;br /&gt;CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;VARCHAR&lt;/span&gt;(200), body TEXT, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;FULLTEXT&lt;/span&gt; (title,body) );&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Indexing process is long in time (several minutes for 10.000 records), moreover when making full-text indexes.&lt;br /&gt;&lt;br /&gt;The &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;SQL&lt;/span&gt; dump file that defines the 10.000 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Wikipedia&lt;/span&gt; registers of this example is 45 Mb in size. When indexed, the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;MyISAM&lt;/span&gt; files created are these:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;# ls -&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;laSh&lt;/span&gt; /var/lib/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;mysql&lt;/span&gt;/full_text_investigations/&lt;br /&gt;total 76M&lt;br /&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;mysql&lt;/span&gt;         42M Oct 23 16:51 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;MYD&lt;/span&gt;&lt;br /&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;mysql&lt;/span&gt;         34M Oct 23 16:51 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;MYI&lt;/span&gt;&lt;br /&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;rw&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;rw&lt;/span&gt;----    1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;mysql&lt;/span&gt;    &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;mysql&lt;/span&gt;        8.4K Oct 23 16:25 articles.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;frm&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;We may again verify the recent discovered rule here: &lt;span style="font-weight: bold;"&gt;"Index size = Data size = Dump size"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is not an exact rule, but &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_23"&gt;approximately&lt;/span&gt; seems correct until now.&lt;br /&gt;&lt;br /&gt;I have prepared 10 predefined queries to benchmark the system. All they are of this form:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;The queries are using 10 distinct keywords. Some keywords provides more results than others. MySQL drops this execution times:&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;37 rows in set (0.92 sec)&lt;br /&gt;82 rows in set (1.31 sec)&lt;br /&gt;151 rows in set (1.74 sec)&lt;br /&gt;9 rows in set (0.18 sec)&lt;br /&gt;1 row in set (0.01 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;14 rows in set (0.17 sec)&lt;br /&gt;108 rows in set (1.28 sec)&lt;br /&gt;10 rows in set (0.12 sec)&lt;br /&gt;38 rows in set (0.53 sec)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;If we correlate the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_24"&gt;number&lt;/span&gt; of rows with the time elapsed we find an important rule of thumb: &lt;span style="font-weight: bold;"&gt;"0.01 seconds to obtain 1 register"&lt;/span&gt;. Remember this rule; it's important and universal.&lt;br /&gt;&lt;br /&gt;Besides, we can see that it's rare to find timings greater than 1 second. It may seem acceptable a time of 1 second to resolve a search engine query, but it is &lt;span style="font-weight: bold;"&gt;NOT&lt;/span&gt;. We cannot hold loaded a server for 1 second just to answer only 1 query. A server of this kind &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_25"&gt;would&lt;/span&gt;n't be able to &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_26"&gt;support&lt;/span&gt; a realistic load of 10.000 visits/day, for example. Think about the peaks, not only the valleys, in the server access stats.&lt;br /&gt;&lt;br /&gt;Well, this is precisely the goal of the present investigations. To find out the real performance of MySQL full-text as a textual search engine.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If we &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;re-query&lt;/span&gt; one by one the 10 previous queries, we will find an important result:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;37 rows in set (0.01 sec)&lt;br /&gt;82 rows in set (0.01 sec)&lt;br /&gt;151 rows in set (0.01 sec)&lt;br /&gt;9 rows in set (0.01 sec)&lt;br /&gt;1 row in set (0.01 sec)&lt;br /&gt;Empty set (0.00 sec)&lt;br /&gt;14 rows in set (0.01 sec)&lt;br /&gt;108 rows in set (0.01 sec)&lt;br /&gt;10 rows in set (0.01 sec)&lt;br /&gt;38 rows in set (0.01 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;0.01 seconds replies for almost every query. That is really fast, and good in order to build a search engine. But it is only the biasing effect of doing exactly the same queries for second time. This is due to the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;filesystem&lt;/span&gt; cache. If we would have configure MySQL to use its own query cache, the result would have also be so good, but this time the responsible wouldn't have &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;been the&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_30"&gt;filesystem&lt;/span&gt; cache but the MySQL query cache.&lt;br /&gt;&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_31"&gt;Filesystem&lt;/span&gt; cache is the key to &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_32"&gt;understand&lt;/span&gt; the behaviour of MySQL performance under extreme &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_33"&gt;circumstances&lt;/span&gt; like these that we are trying to investigate. When a process in the server access to a file to read it, it is kept in RAM memory (if fits, of course). This way any further reading of this file would avoid to make a physical &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_34"&gt;access&lt;/span&gt; to the disk drive, obtaining the data directly from RAM memory. The operative system assets that the file &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_35"&gt;hasn&lt;/span&gt;'t been change in the meanwhile, between reads when it uses the cache to improve the file reading speed.&lt;br /&gt;&lt;br /&gt;We can observe that cache size has increased (138 Mb) after invoking MySQL queries:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;# free&lt;br /&gt;         total       used       free     shared    buffers     cached&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_36"&gt;Mem&lt;/span&gt;:        255504     174508      80996          0       5864     138916&lt;br /&gt;-/+ buffers/cache:      29728     225776&lt;br /&gt;Swap:            0          0          0&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So the main conclusion is this one: &lt;span style="font-weight: bold;"&gt;"&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_37"&gt;Filesystem&lt;/span&gt; cache improves sharply the MySQL performance"&lt;/span&gt;. And &lt;span style="font-weight: bold;"&gt;"0.01 seconds to obtain 1 register"&lt;/span&gt; when cache is not playing its role due to not previously &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_38"&gt;read&lt;/span&gt; (or properly updated in RAM) files or file chunks.&lt;br /&gt;&lt;br /&gt;In next investigation we are indexing 30.000 records and measure file sizes and timings in query responses. We will find out what are the essential files to be cached by &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_39"&gt;filesystem&lt;/span&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-4223224378928651894?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/4223224378928651894/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=4223224378928651894' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4223224378928651894'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/4223224378928651894'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/filesystem-cache-is-important.html' title='Filesystem cache is important'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-784763534326086282</id><published>2007-10-23T12:05:00.000-07:00</published><updated>2007-10-23T12:41:21.938-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Linux FC1'/><category scheme='http://www.blogger.com/atom/ns#' term='Full-text searches'/><category scheme='http://www.blogger.com/atom/ns#' term='Celeron Core 2 Duo'/><title type='text'>First conclusion: Good performance under small corpus</title><content type='html'>My first seriuos test for full-text searches is obteined when inserting 100 records in MySQL. Each record corresponds to a Wikipedia article. Then mean value in characters long for one of this articles is 5.000. So, the whole data dump would occupy 500 Kb.&lt;br /&gt;&lt;br /&gt;On the other hand, if we measure the index size and the data size for the MySQL raw MyISAM files, these is the result:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;# ls -laSh /var/lib/mysql/full_text_investigations/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;total 992K&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-rw-rw----    1 mysql    mysql        505K Oct 23 13:37 articles.MYD&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-rw-rw----    1 mysql    mysql        454K Oct 23 13:37 articles.MYI&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-rw-rw----    1 mysql    mysql        8.4K Oct 23 13:34 articles.frm&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Here we have one first rough conclusion: &lt;span style="font-weight: bold;"&gt;"Index size = Data size = Dump size"&lt;/span&gt;. 500 kb each one in this case.&lt;br /&gt;&lt;br /&gt;I haven´t apply any tweacking to MySQL, so my.cnf is the default file, as follows:&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;&lt;br /&gt;# cat /etc/my.cnf&lt;br /&gt;[mysqld]&lt;br /&gt;datadir=/var/lib/mysql&lt;br /&gt;socket=/var/lib/mysql/mysql.sock&lt;br /&gt;&lt;br /&gt;[mysql.server]&lt;br /&gt;user=mysql&lt;br /&gt;basedir=/var/lib&lt;br /&gt;&lt;br /&gt;[safe_mysqld]&lt;br /&gt;err-log=/var/log/mysqld.log&lt;br /&gt;pid-file=/var/run/mysqld/mysqld.pid&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Here there is a brief system description for the server:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;# free&lt;br /&gt;           total       used       free     shared    buffers     cached&lt;br /&gt;Mem:        255504     213016      42488          0      55620      81948&lt;br /&gt;-/+ buffers/cache:      75448     180056&lt;br /&gt;Swap:            0          0          0&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;That means, the server has got 256 Mb RAM and only 40 Mb remains idle.&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;&lt;br /&gt;# cat /proc/cpuinfo&lt;br /&gt;processor       : 0&lt;br /&gt;vendor_id       : GenuineIntel&lt;br /&gt;cpu family      : 6&lt;br /&gt;model           : 15&lt;br /&gt;model name      : Intel(R) Core(TM)2 CPU         T5500  @ 1.66GHz&lt;br /&gt;stepping        : 8&lt;br /&gt;cpu MHz         : 1662.696&lt;br /&gt;cache size      : 64 KB&lt;br /&gt;fdiv_bug        : no&lt;br /&gt;hlt_bug         : no&lt;br /&gt;f00f_bug        : no&lt;br /&gt;coma_bug        : no&lt;br /&gt;fpu             : yes&lt;br /&gt;fpu_exception   : yes&lt;br /&gt;cpuid level     : 10&lt;br /&gt;wp              : yes&lt;br /&gt;flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss pni ds_cpl&lt;br /&gt;bogomips        : 3329.22&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;The CPU is a Celeron Core 2 Duo at 1.6 Ghz.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;...to server version: 3.23.58&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;This is the MySQL server version.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;# cat /etc/*release*&lt;br /&gt;Fedora Core release 1 (Yarrow)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;&lt;br /&gt;&lt;/span&gt;The Linux distrbution is a FC1.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;# hdparm -tT /dev/sda1&lt;br /&gt;&lt;br /&gt;/dev/sda1:&lt;br /&gt;Timing buffer-cache reads:   3968 MB in  2.00 seconds = 1984.00 MB/sec&lt;br /&gt;Timing buffered disk reads:   62 MB in  3.00 seconds =  20.67 MB/sec&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This is a quick way to test hard disk speed, for this SATA unit. It is not slow but it isn't very fast either. Today, some SATA drives reach 70 MB/sec.&lt;br /&gt;&lt;br /&gt;Take attention to the point that having 256 Mb RAM, the whole database is cached in RAM (only 1 Mb for index+data).&lt;br /&gt;&lt;br /&gt;The queries are extremely fast, as expected:&lt;br /&gt;&lt;br /&gt;&lt;pre style="color: rgb(102, 102, 102);font-family:courier new;font-size:85%;"  &gt;mysql&gt; SELECT id, title, LEFT(body, 64) FROM articles WHERE MATCH (title,body) AGAINST ('keyword');&lt;br /&gt;+----+---------------------------------+------------------+&lt;br /&gt;| id | title                           |   LEFT(body, 64) |&lt;br /&gt;+----+---------------------------------+------------------+&lt;br /&gt;| .. | ...                             | ...              |&lt;br /&gt;+----+---------------------------------+------------------+&lt;br /&gt;5 rows in set (0.00 sec)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Can´t be quickest.&lt;br /&gt;&lt;br /&gt;So, the sencond big conclusion should be: &lt;span style="font-weight: bold;"&gt;"Good performance under small corpus"&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Here, the corpus, as said, is 100 articles 5.000 characters length each one.&lt;br /&gt;&lt;br /&gt;Next experiment: What if corpus is incremented till 10.000 articles?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-784763534326086282?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/784763534326086282/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=784763534326086282' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/784763534326086282'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/784763534326086282'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/first-conclusion-good-performance-under.html' title='First conclusion: Good performance under small corpus'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-3507877281161051354</id><published>2007-10-23T02:08:00.000-07:00</published><updated>2007-10-23T12:04:41.069-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CREATE DATABASE'/><category scheme='http://www.blogger.com/atom/ns#' term='hello world'/><category scheme='http://www.blogger.com/atom/ns#' term='simple example'/><title type='text'>Simplest case</title><content type='html'>Once the MySQL server is up and running, I test my first full-text example:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);font-size:85%;" &gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; CREATE DATABASE full_text_investigations;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Query OK, 1 row affected (0.03 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; USE full_text_investigations;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Database changed&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) );&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Query OK, 0 rows affected (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; INSERT INTO articles (title,body) VALUES ('MySQL Tutorial','DBMS stands for DataBase ...'), ('How To Use MySQL Well','After you went through a ...'), ('Optimizing MySQL','In this tutorial we will show ...'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL vs. YourSQL','In the following database comparison ...'), ('MySQL Security','When configured properly, MySQL ...');&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Query OK, 6 rows affected (0.00 sec)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Records: 6  Duplicates: 0  Warnings: 0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database');&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;| id | title             | body                                     |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;|  5 | MySQL vs. YourSQL | In the following database comparison ... |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;2 rows in set (0.03 sec)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&gt; SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database');&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;| id | title             | body                                     |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;|  5 | MySQL vs. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;YourSQL&lt;/span&gt; | In the following database comparison ... |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;|  1 | MySQL Tutorial    | DBMS stands for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;DataBase&lt;/span&gt; ...             |&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;+----+-------------------+------------------------------------------+&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;2 rows in set (0.00 sec)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Take attention about the time lapsed in the first query and that of the second one (in cache).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-3507877281161051354?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/3507877281161051354/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=3507877281161051354' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3507877281161051354'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/3507877281161051354'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/simplest-case.html' title='Simplest case'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-206043023646549096</id><published>2007-10-19T16:49:00.000-07:00</published><updated>2007-10-19T16:57:14.602-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL server'/><category scheme='http://www.blogger.com/atom/ns#' term='big database'/><category scheme='http://www.blogger.com/atom/ns#' term='FC1'/><title type='text'>First steps</title><content type='html'>I would like to show some examples with detailed information in order to be as much precise as possible.&lt;br /&gt;&lt;br /&gt;I am installing a Fedora Core 1 (FC1) Linux server with its default &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;MySQL&lt;/span&gt; server. I will try to import the big database and do some simple queries so as to know exactly the number of text documents included, the number of words, and other interesting stats.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-206043023646549096?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/206043023646549096/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=206043023646549096' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/206043023646549096'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/206043023646549096'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/first-steps.html' title='First steps'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531180625802634184.post-7777597984927214390</id><published>2007-10-19T09:19:00.000-07:00</published><updated>2007-10-31T08:49:06.084-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='documents'/><category scheme='http://www.blogger.com/atom/ns#' term='speed'/><category scheme='http://www.blogger.com/atom/ns#' term='query'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='words'/><category scheme='http://www.blogger.com/atom/ns#' term='slow'/><category scheme='http://www.blogger.com/atom/ns#' term='million'/><title type='text'>First day - Introduction</title><content type='html'>MySQL is a powerful database engine. I had used it in several of my projects. It is fast and simple. That is what I need.&lt;br /&gt;&lt;br /&gt;Nevertheless, while using MySQL in a big LAMP project I found slow queries (in a range from 30 seconds to 5 minutes long). And that is really weird if we are talking about MySQL. First explanation that comes to my head is that I have done something wrong. MySQL cannot be so slow.&lt;br /&gt;&lt;br /&gt;I admit the situation is a bit uncommon. I have indexed 100.000 documents with 10.000 words each one. This means the index created to manage the full-text search is around 1 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Gb&lt;/span&gt; in size.&lt;br /&gt;&lt;br /&gt;I think I have discovered that the query response times are good (around 1 second) when the server RAM memory is greater than the index size. But when the index size is bigger that the RAM, the response times growth reaching benchmarks of 30 seconds and more per query.&lt;br /&gt;&lt;br /&gt;I say again, that could be only a misconfiguration or a lack of knowledge in the MySQL internals for my part. I have been testing several queries and different versions of MySQL in diverse Linux servers. All these tests are leading me in one direction: index must be full cached in RAM in order to achieve fast query results.&lt;br /&gt;&lt;br /&gt;I don't feel confortable in that situation. 30 seconds per query is unacceptable. What if I try to index bigger databases? I imagine the internal process of looking for a word in the index and then match it against the text that contains it; I don't understand why is it necessary the index to be cached.&lt;br /&gt;&lt;br /&gt;I accept that random access reads to the disk are needed to find the matched document: one of them to find the index and another one to find the document. But two disk reads are not taking 30 seconds. That is sure.&lt;br /&gt;&lt;br /&gt;So I need to understand this problem in-deph. I would like to know my error, or otherwise why MySQL Full-text is slow when the index is bigger than the amount of RAM available in the server.&lt;br /&gt;&lt;br /&gt;Here, today I begin my investigation. If you have some idea about this problem, any hint that could help me to solve the issue, please contact me. Thanks in advance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531180625802634184-7777597984927214390?l=mysql-full-text.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysql-full-text.blogspot.com/feeds/7777597984927214390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2531180625802634184&amp;postID=7777597984927214390' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/7777597984927214390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531180625802634184/posts/default/7777597984927214390'/><link rel='alternate' type='text/html' href='http://mysql-full-text.blogspot.com/2007/10/first-day-introduction.html' title='First day - Introduction'/><author><name>BankHacker</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
