Friday, February 15, 2008

Mapping semantic correlations

First of all I have to thank Shlomo Swidler comment in this blog. It seems he has found a way (a hack) to avoid MySQL read the MYD file when not needed. Nevertheless it would be even much interesting to find out why MySQL always reads this file when there must be a lot of cases in which the search can be accomplished just using MYI file. Precisely these searches are those to be more time consuming and the use of indexes would optimize them a lot.

Much time have past since my last post. I have been thinking about several subjects, but overall on Latent Semantic Indexes. One of my goals has been dismissed: I wanted to patch MySQL to accomplish fast full text searches on huge databases. But now Sun is bidding on MySQL. That changes things severily. I think MySQL code openess may be in danger.

On the other hand, Andrew Aksyonoff (with his Sphinx search engine) has possibly found the best solution ever to this problem. If MySQL/Sun wants it to acquire or not is an open question for the future and the fate.

So, from now on, I would like to center mainly on one data-mining and search engine new trend: "mapping semantic correlations".

LSI, Latent Semantic Indexing seems to be a cutting edge method in the Artificial Intlligence (AI) branch of Natural Language Processing (NLP). The main goal obtained through LSI is get some correlations among terms in a given document. Similar terms, semantically speaking, got correllated once LSI is applied. This improves search engine matching results.

Nevertheless I think, LSI and similar techniques may give us more interesting results than just a better matching. I am focusing on "Mapping semantic correlations", that is to draw them. If we get concepts and relations drawn in a plane (2D-space) it will be very easy to develop a search engine. No more indexes. Just a low dimensional spatial search. Fast and simple.

Another advantage of this kimera would be to represent the knowledge at a very tough level (sintaxis and grammar would be ignored in this process). A good mapping would place similar terms near from each other and dissimilar ones far among them.

No comments: