Latent Semantic Indexing

Latent Semantic Indexing is a newish technique for indexing documents; essentially, you get a list of all the important words in all your documents, then build an n-dimensional space (where n is the number of words in the list), and each document takes a place in the space according to whether it contains each of the words or not. You can then compare two documents for similarity by looking at each document’s vector in the space (the line between the origin and the document’s point) and seeing how small the cosine of the angle between the two vectors is.

This is basically how Autonomy works. I’m interested in this because, if it can be implemented, I don’t have to buy Autonomy :)

6 Responses to “Latent Semantic Indexing”

  1. Is there any chance I could see the code behind your implementation, Todd? I have something up and running, but everything that I search returns all docs with a relevance of 1 in each case; clearly not desired behavior. It also returns nothing on words that are in the actual $self->{ word_list }, which seems very wrong as well.

    The perl.com article has a few issues in it, btw… most notably the line on the final script that uses a method not available in the class: $engine->set_threshold( 0.8 );

    It’s a fascinating concept and I’m looking forward to seeing it return “real” results, but my non-math-genius head is having some trouble wrapping around the basic concepts.

    Mark Guckeyson
  2. From what I can tell, Autonomy uses Bayesian networks, which is a rather different model than the vector-space approach underlying LSI.Colin is right to say that the key step in LSI is the dimensionality reduction, which squishes things down and creates the expanded-recall magic.You might want to take a look at Search::ContextGraph for a different approach that gives similar behavior to LSI, without all the overhead of the vector stuff. And send in patches :-)

    Maciej Ceglowski
  3. I’m using a search engine trivially built on top of VectorSpace.pm from the perl.com story (with one change[1] to avoid a warning; still not sure this is the right fix).I’ve been pretty happy with it after not quite 2 months use.[1]- my $offset = $self->{‘word_index‘}->{$w};- index( $vector, $offset ) .= $value;+ if (defined($w) && defined($self->{word_index}->{$w})) {+ my $offset = $self->{‘word_index‘}->{$w};+ index( $vector, $offset ) .= $value;+ }

    Todd Larason
  4. I’m not an expert at all, so I may be confusing matters here, but I think your description misses the most important aspect of LSI, which is that the SVD stage finds similarities between terms, allowing searches to return results which do not contain the terms in the query. Your description sounds like a standard vector-space system without LSI.As far as Autonomy is concerned, I’ve always got the impression that the important thing about their systems is the user-interface and the degree of integration into whatever else it is that you’re doing, rather than the particular algorithms they use. (I also somewhat distrust them generally—though you probably shouldn’t base your buying decisions on the hunches of random readers of your site. :) )

    colin_zr
  5. I’m not an expert at all, so I may be confusing matters here, but I think your description misses the most important aspect of LSI, which is that the SVD stage finds similarities between terms, allowing searches to return results which do not contain the terms in the query. Your description sounds like a standard vector-space system without LSI.As far as Autonomy is concerned, I’ve always got the impression that the important thing about their systems is the user-interface and the degree of integration into whatever else it is that you’re doing, rather than the particular algorithms they use.

    colin_zr
  6. I find this particularly fascinating because I did a course on vectors last term at Uni and was given the impression that they were mainly used for graphics related stuff – seeing how they could be used for seach as well was really interesting.

    Simon Willison

Leave a Reply

OpenID is a decentralised authentication system. If you use LiveJournal or Vox you already have an OpenID; just use the URL of your homepage there. See also how to get yourself an OpenID.