MySQL Internals: Full-Text investigations: Vector Reduction

* The Vector Space Method (VSM) may be useful to understand and optimize indexing techniques for full-text databases.

* One trivial optimization is to convert all words to lower case. This way the index might be reduced nearly to the 50%. There is no reason to maintain two or more indexes for the same word with different cases, if we accept that the case mode is not reporting any additional meaning to the word.

* Another proven application is the synonyms reduction, also called Latent Semantic Indexing (LSI). All synonyms of 1 word are reduced to this word. In this way, index size is reduced as long as distinct words due to this equivalence relation (that defines a class). Through the perspective of Vector Space Method, this is equivalent to a reduction in the dimension of the Vector Space. Conceptually, to reduce synonyms to one parent word or class word is like adding an external information (the thesaurus) to establish a kind of linear dependence among the synonyms and afterwards, doing a projection in the major significant axis. The projection would remove the dimensions representing peculiarities that make differences among same class synonyms, and preserve the dimensions that made these words different from other terms out of the class.

* The approach to establish one dimension in the Vector Space for each distinct word in the document is good, but could be better, I think. Certain words are related and others are not. If we associate a dimension to a word, implicitly we are agreeing that every word in the index is linearly independent to each other. And that is not true. For example, the word "text" and the word "hypertext" are not independent. Actually "hypertext" is a mixture of "text" and the word "link", conceptually. So vector "text" plus vector "link" should sum vector "hypertext". We them have 3 words, but only 2 dimensions, because hypertext don't need a new dimension to be represented. "Text" and "link" are orthogonal words, but "hypertext" is a linear combination of them. That is the way, I think, of understanding the vector reduction.

* Every word related to one another should be expressed through any kind of linear dependence. Accomplish that work in a given language would be equivalent to find the smallest vector space that preserves the meaning of the words and at the same time avoids redundant information. If we could find such extraordinary vector base, we should find that all the remaining class words or base vectors are orthogonal, that is, not related in any way.

* Not being satisfied with that chimera, I have the impression that finding this orthogonal and universal base could be done through statistic methods instead of thesaurus application. That means, the sole in-depth studying of a corpus could determine which are the most suitable words in the vector base and which to be just linear combinations of them. The fact that certain words belongs to a given document is establishing topological relations among them. The very fact that staying close each other is telling us some kind of useful information, for example, that they belong to a given context. The Inverted File (IF) index is also considering these information when storing the document pointer (docId) and the offset for each word in order to calculate the relevance of a search. That proves this topological information is related to the meaning of the words: "a door to the semantic universe".

* This way we could extract the semantic relations among words and use it to establish best orthogonal base vectors.

* I have thought of a recurrent method to apply in order to get the best reduced vector base. I have to spend time on it ...

* Next goal is to establish a solid theoretical framework for these ideas, investigate if anyone out there has developed something similar, and carry out some computer experiments trying to obtain some kind of tangible results. This personal brainstorm is not more that a roadmap to steer my following investigations.

* I have heard about Singular Value Decomposition (SVD), and I think this is the formal name to the Vector reduction process I am trying to describe in here. According to Wikipedia:

http://en.wikipedia.org/wiki/Latent_semantic_indexing

the Singular Value Decomposition implementation can be done by 2 methods: Lanczos Methods and via Neural Networks. I think there exists a third method, a recurrent approach that could be rougher but quicker.

* When forcing the Vector reduction process to its limits, we could get a final vector space of not too many dimensions (10, for example). What is the meaning of these 10 dimensions? I expect they were the Canonical Topics that give best taxonomy to the corpus used. If the corpus is the WWW, we should get a final base vector corresponding probably or roughly to the main level-1 topics of the Yahoo directory. So, if that is true (I will have to get some positive experimental results before assure it is true ...) we could obtain an elegant scheme to do Text Categorization, including their taxonomy.

MySQL Internals: Full-Text investigations

Thursday, November 1, 2007

Vector Reduction

No comments: