Latent Semantic Indexing is a newish technique for indexing documents; essentially, you get a list of all the important words in all your documents, then build an n-dimensional space (where n is the number of words in the list), and each document takes a place in the space according to whether it contains each of the words or not. You can then compare two documents for similarity by looking at each document's vector in the space (the line between the origin and the document's point) and seeing how small the cosine of the angle between the two vectors is.
This is basically how Autonomy works. I'm interested in this because, if it can be implemented, I don't have to buy Autonomy :)
Is there any chance I could see the code behind your implementation, Todd? I have something up and running, but everything that I search returns all docs with a relevance of 1 in each case; clearly not desired behavior. It also returns nothing on words that are in the actual $self->{ word_list }, which seems very wrong as well.
The perl.com article has a few issues in it, btw… most notably the line on the final script that uses a method not available in the class:
$engine->set_threshold( 0.8 );It’s a fascinating concept and I’m looking forward to seeing it return “real” results, but my non-math-genius head is having some trouble wrapping around the basic concepts.