You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Robert Stewart <bs...@gmail.com> on 2011/10/27 20:46:18 UTC
Generating a semantic thesaurus from a SOLR/Lucene index
I have some documents indexed in SOLR (about 200 million news articles).
I want to create a semantic thesaurus from my SOLR index for use in query expansion.
For example, I'd like a query for "green cars" to be expanded to something like ("green cars" OR "green vehicles" OR "low-emission vehicles") etc.
I suppose using some sort of LSA or SVD can be used to accomplish this.
I know how to get vectors out of SOLR using lucene.vector job.
I also have generated some lda output from those vectors, but that is not getting me where I need to be.
I'd eventually like to have some output to create a map of words to other semantically related words in the index, such that:
lookup("green") ==> "environmental", "low-emission", etc.
lookup("car") ==> "vehicle", "truck", "SUV", "driving", etc.
Is this possible with Mahout? And specifically how could it be done using vectors output from lucene using lucene.vector job.
Thanks
Bob