Tuesday, October 30, 2007

Vector treatment - Computing term weights

Looking for more information about the term-document matrix I have found an interesting and well-explained web page:

http://www.hirank.com/semantic-indexing-project/lsi/tdm.htm

Here, the author explains how to understand the vector space and the matrix meaning when treating the problem of assigning weights to terms in documents, in order to create a weighted index.

Three factors seem to be important to establish a well-thought-out weighting function:

1.- Local weight or Term Frequency (TF).

2.- Global Weight or Inverse Document Frequency (IDF): point 5 in last post.

3.- Normalization factor: in order to use the same scale among different term weights.

Third point (I guess) is merely decorative but not functional, because the absolute weight value has nothing to do with the order of the results in the search. The order is important; the absolute value is not. So, third point could be ignored and then we had only 2 factors: TF and IDF. This is precisely the so-called tf.idf formula.

The 3 factors (according to the URL above) may be integrated in what is called the SVD algorithm. SVD stands for Singular Value Decomposition, a linear algebra factorization, but I still ignore their relation with the 3 factors that define the term's weighting function. I will appreciate any hint about this ...

No comments: