You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Robert Stewart <bs...@gmail.com> on 2011/10/27 20:46:18 UTC

Generating a semantic thesaurus from a SOLR/Lucene index

I have some documents indexed in SOLR (about 200 million news articles).

I want to create a semantic thesaurus from my SOLR index for use in query expansion.

For example, I'd like a query for "green cars" to be expanded to something like ("green cars" OR "green vehicles" OR "low-emission vehicles") etc.

I suppose using some sort of LSA or SVD can be used to accomplish this.

I know how to get vectors out of SOLR using lucene.vector job.

I also have generated some lda output from those vectors, but that is not getting me where I need to be.

I'd eventually like to have some output to create a map of words to other semantically related words in the index, such that:

lookup("green") ==> "environmental", "low-emission", etc.

lookup("car") ==> "vehicle", "truck", "SUV", "driving", etc.

Is this possible with Mahout? And specifically how could it be done using vectors output from lucene using lucene.vector job.

Thanks
Bob