You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by William Koscho <wk...@gmail.com> on 2010/10/05 20:01:13 UTC

Weight for all Terms in all Documents

How do I get the weights for all terms in all documents?

For a given set of documents, what are the series of API calls I need to
make to get the following type of information:

doc1, termA_weight, termB_weight, etc..
doc2, termC_weight, termD_weight, etc..
doc3, termE_weight, termZ_weight, etc..

It seems that I have to start with a Query object, that is typically
provided by an end-user.  However, in my case, I don't have an end user or a
specific query.  Instead I am trying to analyze the documents and interested
in getting the weights of all terms so that I can compute some statistics
about the similarity among documents.

Thanks in advance,
Bill

Re: Weight for all Terms in all Documents

Posted by Grant Ingersoll <gs...@apache.org>.
Have a look at the TermVectors.  If you are Solr user, the TermVectorComponent.  In either case, you will have to reassemble some things to get the weights Lucene actually uses for scoring.  You can, however, get a simple TF-IDF weight without too much work.  


On Oct 5, 2010, at 2:01 PM, William Koscho wrote:

> How do I get the weights for all terms in all documents?
> 
> For a given set of documents, what are the series of API calls I need to
> make to get the following type of information:
> 
> doc1, termA_weight, termB_weight, etc..
> doc2, termC_weight, termD_weight, etc..
> doc3, termE_weight, termZ_weight, etc..
> 
> It seems that I have to start with a Query object, that is typically
> provided by an end-user.  However, in my case, I don't have an end user or a
> specific query.  Instead I am trying to analyze the documents and interested
> in getting the weights of all terms so that I can compute some statistics
> about the similarity among documents.
> 
> Thanks in advance,
> Bill

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Re: Weight for all Terms in all Documents

Posted by Ted Dunning <te...@gmail.com>.
There is a utility in the Apache Mahout project that dumps documents as
weight vectors.

On Tue, Oct 5, 2010 at 11:01 AM, William Koscho <wk...@gmail.com> wrote:

> How do I get the weights for all terms in all documents?
>
> For a given set of documents, what are the series of API calls I need to
> make to get the following type of information:
>
> doc1, termA_weight, termB_weight, etc..
> doc2, termC_weight, termD_weight, etc..
> doc3, termE_weight, termZ_weight, etc..
>
> It seems that I have to start with a Query object, that is typically
> provided by an end-user.  However, in my case, I don't have an end user or
> a
> specific query.  Instead I am trying to analyze the documents and
> interested
> in getting the weights of all terms so that I can compute some statistics
> about the similarity among documents.
>
> Thanks in advance,
> Bill
>