Latent Semantic Indexing

Latent Semantic Indexing is a newish technique for indexing documents; essentially, you get a list of all the important words in all your documents, then build an n-dimensional space (where n is the number of words in the list), and each document takes a place in the space according to whether it contains each of the words or not. You can then compare two documents for similarity by looking at each document’s vector in the space (the line between the origin and the document’s point) and seeing how small the cosine of the angle between the two vectors is.

This is basically how Autonomy works. I’m interested in this because, if it can be implemented, I don’t have to buy Autonomy :)

More in the discussion (powered by webmentions)

  • (no mentions, yet.)