You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mariolone <ma...@hotmail.com> on 2006/12/13 19:13:23 UTC

Lucene & LSA

Hi!!!!
I have a problem:
i must create a matrix term for document in which every element of the
matrix it represents the number of occurrences of that term in the document.
How can I do? 
Can someone help me?
Thanks to all....

P.S. I must applicate LSA to this matrix.
-- 
View this message in context: http://www.nabble.com/Lucene---LSA-tf2815727.html#a7858202
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene & LSA

Posted by Miles Efron <me...@metalab.unc.edu>.
U of Tennessee professor Michael Berry maintains a good site regarding 
software for computing SVD on large, sparse matrices:

 	http://www.cs.utk.edu/~lsi/

The site also points to the LSI patent.

FWIW it's very easy to extract term-doc counts from a lucene index and 
format them for software such as SVDPACK

 	http://www.netlib.org/svdpack/index.html

or

 	http://tedlab.mit.edu:16080/~dr/SVDLIBC/

Of course using the resulting matrices isn't so trivial, if you want to 
stay within Lucene.  Also, they are dense, so even at relatively low 
dimensionality, you're still storing a lot of data.

-Miles

On Thu, 14 Dec 2006, Marvin Humphrey wrote:

>
> On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:
>
>>> it is possible to extract the matrix from the indexing file?
>> 
>> I don?t know any API to extract the matrix from the index file directly.
>
> How could we make it work to write an open source decomposed vector model 
> search engine a la LSA without running afoul of the LSA patents?  Maybe use 
> an algorithm other than SVD for the decomposition?
>
> I'm only superficially familiar with LSA, but I'm always looking for ways to 
> improve relevance.  In theory it would be nice to factor in a decomposed 
> similarity measure, so that on a search for 'napoleonic war', documents which 
> contained a lot of words which were similar to either 'napoleon' and 'war' 
> would score higher than documents which had only a passing mention.
>
> Personally, I'm less interested in "more like this" queries, because the 
> precision of search results based solely on on similar document vectors is so 
> poor -- proper names and other rare tokens unrelated to the original query 
> wreak havok on the relevance scores.  But maybe there's a way in the original 
> keyword search to juice up the scores of documents which not only contain the 
> original terms, but also a lot of terms which are similar to them.
>
> I dunno if it would be worth the computational effort, though.  A decomposed 
> matrix is going to be inherently expensive to generate, because you have to 
> start from a complete matrix.  That doesn't jibe well with incremental 
> indexing.
>
> Also, it's not clear to me how much of a gain we'd get in relevance.  My 
> hunch is that shorter, tightly focused documents would benefit some and that 
> longer more diffuse documents -- which might contain passages which were just 
> as useful as those in a shorter document -- would lose.  That wouldn't be 
> helpful for a common case in naive web search, where impossible-to-exclude 
> navigational and advertising text could end up diluting the scores of 
> perfectly good material.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

__________________________________
Miles Efron
http://www.ibiblio.org/mefron
mefron@ibiblio.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene & LSA

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:

>> it is possible to extract the matrix from the indexing file?
>
> I don’t know any API to extract the matrix from the index file  
> directly.

How could we make it work to write an open source decomposed vector  
model search engine a la LSA without running afoul of the LSA  
patents?  Maybe use an algorithm other than SVD for the decomposition?

I'm only superficially familiar with LSA, but I'm always looking for  
ways to improve relevance.  In theory it would be nice to factor in a  
decomposed similarity measure, so that on a search for 'napoleonic  
war', documents which contained a lot of words which were similar to  
either 'napoleon' and 'war' would score higher than documents which  
had only a passing mention.

Personally, I'm less interested in "more like this" queries, because  
the precision of search results based solely on on similar document  
vectors is so poor -- proper names and other rare tokens unrelated to  
the original query wreak havok on the relevance scores.  But maybe  
there's a way in the original keyword search to juice up the scores  
of documents which not only contain the original terms, but also a  
lot of terms which are similar to them.

I dunno if it would be worth the computational effort, though.  A  
decomposed matrix is going to be inherently expensive to generate,  
because you have to start from a complete matrix.  That doesn't jibe  
well with incremental indexing.

Also, it's not clear to me how much of a gain we'd get in relevance.   
My hunch is that shorter, tightly focused documents would benefit  
some and that longer more diffuse documents -- which might contain  
passages which were just as useful as those in a shorter document --  
would lose.  That wouldn't be helpful for a common case in naive web  
search, where impossible-to-exclude navigational and advertising text  
could end up diluting the scores of perfectly good material.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene & LSA

Posted by Soeren Pekrul <so...@gmx.de>.
mariolone wrote:
> They are successful to extract the matrix. 
> But with collections of large documents is not one too much expensive
> solution? 

I have a quite small collection with 14,960 documents and 29,828 unique 
terms. If I remember right it took a few minutes on a normal laptop 
computer to iterate the terms and documents. I stored the matrix in mySQL:

CREATE TABLE term_document_matrix (
	term VARCHAR( 32 ) NOT NULL ,
	document INT NOT NULL ,
	weight DOUBLE NOT NULL DEFAULT '0',
	PRIMARY KEY (term, document)
);

You can see it is not a real matrix just a normal table in the 
relational model. I stored the weights greater than 0 only, so I have 
much less entries than 14,960 x 29,828 = 446,226,880 (in my case 159,407).

> it is possible to extract the matrix from the indexing file? 

I don’t know any API to extract the matrix from the index file directly.

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene & LSA

Posted by mariolone <ma...@hotmail.com>.
Thanks for the aid, Soren!!!
They are successful to extract the matrix. 
But with collections of large documents is not one too much expensive
solution? 
it is possible to extract the matrix from the indexing file? 

Mario


Sören Pekrul wrote:
> 
> Hello Mario,
> 
> I had a similar problem a few weeks ago (thread "How to get Term Weights 
> (document term matrix)?", 2006-11-02, 
> http://www.gossamer-threads.com/lists/lucene/java-user/41726).
> 
> I think there is no simple function creating a document term matrix or 
> accessing it. I extracted the matrix from my index and stored the matrix 
> in a database.
> 
> To create the matrix I iterated the terms and the documents for each term:
> TermEnum terms=IndexReader.terms();
> while(terms.next()) {
>      TermDocs docs=IndexReader.termDocs(terms.term());
>      while(docs.next()) {
>          //store the term, the document and the weight
>          //document frequency: indexreader.docFreq(term)
>          //term frequency: termdoc.freq()
>      }
> }
> 
> Sören
> 
> mariolone wrote:
>> Hi!!!!
>> I have a problem:
>> i must create a matrix term for document in which every element of the
>> matrix it represents the number of occurrences of that term in the
>> document.
>> How can I do? 
>> Can someone help me?
>> Thanks to all....
>> 
>> P.S. I must applicate LSA to this matrix.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene---LSA-tf2815727.html#a7870561
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene & LSA

Posted by Soeren Pekrul <so...@gmx.de>.
Hello Mario,

I had a similar problem a few weeks ago (thread "How to get Term Weights 
(document term matrix)?", 2006-11-02, 
http://www.gossamer-threads.com/lists/lucene/java-user/41726).

I think there is no simple function creating a document term matrix or 
accessing it. I extracted the matrix from my index and stored the matrix 
in a database.

To create the matrix I iterated the terms and the documents for each term:
TermEnum terms=IndexReader.terms();
while(terms.next()) {
     TermDocs docs=IndexReader.termDocs(terms.term());
     while(docs.next()) {
         //store the term, the document and the weight
         //document frequency: indexreader.docFreq(term)
         //term frequency: termdoc.freq()
     }
}

Sören

mariolone wrote:
> Hi!!!!
> I have a problem:
> i must create a matrix term for document in which every element of the
> matrix it represents the number of occurrences of that term in the document.
> How can I do? 
> Can someone help me?
> Thanks to all....
> 
> P.S. I must applicate LSA to this matrix.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org