You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andy Liu <an...@gmail.com> on 2005/11/15 00:50:44 UTC

Lucene and Latent Semantic Indexing

I'm currently experimenting with latent semantic indexing techniques and
Lucene. I need to extract term frequencies from a Lucene index and construct
a document/term matrix, then subsequently perform some mathematical
algorithms on this matrix which produces float and potentially negative term
frequency values. Extracting the tf's from the Lucene index is easy. The
hard part is importing the modified tf's back into the index, since in
Lucene, tf's are stored as integer values.

Anybody that knows the Lucene codebase well have any tips? Has anybody even
tried performing LSI on a Lucene index?

Andy

Re: Lucene and Latent Semantic Indexing

Posted by Ken Krugler <kk...@transpac.com>.
I just heard about a student project, where he 
wrapped a C-based LSI tool for Java, then 
integrated it into Lucene. I'll see if he's going 
to be posting about it any time soon.

-- Ken

>I am also willing to contribute..
>
>- Tarjei
>
>On 11/22/05, Sebastian Menge <se...@uni-dortmund.de> wrote:
>>
>>  Am Dienstag, den 22.11.2005, 13:26 +0100 schrieb Tarjei Lægreid:
>>  > Are there any plans of including LSI as a Lucene feature in the future?
>>
>>  I am looking for this too, and I really do think there would be "market"
>>  for such a component.
>>
>>  I am not an "IR" man, but I if we could find a small team (and of
>>  course, a lead), I would love to contribute ...
>>
>  > Sebastian.

-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene and Latent Semantic Indexing

Posted by Tarjei Lægreid <ta...@gmail.com>.
I am also willing to contribute..


- Tarjei

On 11/22/05, Sebastian Menge <se...@uni-dortmund.de> wrote:
>
> Am Dienstag, den 22.11.2005, 13:26 +0100 schrieb Tarjei Lægreid:
> > Are there any plans of including LSI as a Lucene feature in the future?
>
> I am looking for this too, and I really do think there would be "market"
> for such a component.
>
> I am not an "IR" man, but I if we could find a small team (and of
> course, a lead), I would love to contribute ...
>
> Sebastian.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene and Latent Semantic Indexing

Posted by Sebastian Menge <se...@uni-dortmund.de>.
Am Dienstag, den 22.11.2005, 13:26 +0100 schrieb Tarjei Lægreid:
> Are there any plans of including LSI as a Lucene feature in the future?

I am looking for this too, and I really do think there would be "market"
for such a component.

I am not an "IR" man, but I if we could find a small team (and of
course, a lead), I would love to contribute ...

Sebastian.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene and Latent Semantic Indexing

Posted by Tarjei Lægreid <ta...@gmail.com>.
Hi Andy,

I am also very interested in such approaches. I have tried a hack to
simulate the effects of LSI in a Lucene index. What I did was, as you
suggested to extract the term frequencies from the index, constructed a
term/document matrix, and performed SVD on the matrix. Then I multiplied the
resulting values by a constant factor to simulate term frequencies in the
LSI space (that is, I created a new field "lsi" in the documents and added
the words with their corresponding frequencies). However this is a pretty
nasty hack, and I would appreciate if anyone knows a good way of applying
LSI to Lucene.

Are there any plans of including LSI as a Lucene feature in the future?


Regards,
Tarjei

On 11/15/05, Andy Liu <an...@gmail.com> wrote:
>
> I'm currently experimenting with latent semantic indexing techniques and
> Lucene. I need to extract term frequencies from a Lucene index and
> construct
> a document/term matrix, then subsequently perform some mathematical
> algorithms on this matrix which produces float and potentially negative
> term
> frequency values. Extracting the tf's from the Lucene index is easy. The
> hard part is importing the modified tf's back into the index, since in
> Lucene, tf's are stored as integer values.
>
> Anybody that knows the Lucene codebase well have any tips? Has anybody
> even
> tried performing LSI on a Lucene index?
>
> Andy
>
>