You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ziqi Zhang <zi...@sheffield.ac.uk> on 2015/09/20 15:35:40 UTC

get frequency of each term from a document

Hi

Is it possible to get a list of terms within a document, and also TF of 
each of these terms *in that document only*? (Lucene 5.3)

IndexReader has a method "Terms getTermVector(int docID, String field)", 
which gives me a "Terms" object, on which I can get a TermsEnum. But I 
do not know where to go then.

thanks


Re: get frequency of each term from a document

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

For term vectors enum the doc freq is always 1 and the term freq is the one from the document you got term vectors.

Term vectors just implement the same interface, but they can be seen as a small index per document. This is made like that to allow executing queries for highlighting on single document.

Uwe

Am 20. September 2015 16:28:12 MESZ, schrieb Ziqi Zhang <zi...@sheffield.ac.uk>:
>Thanks but TermsEnum has two methods that returns frequency-related 
>info, both are corpus-level, not document specific:
>
>-docFreq() Returns the number of documents containing the current term.
>-totalTermFreq() Returns the total number of occurrences of this term 
>across all documents (the sum of the freq() for each doc that has this 
>term).
>
>However I will need document specific frequency, i.e., freq of term A
>in 
>Doc 1, 2, ... N
>
>Thanks
>
>On 20/09/2015 15:07, Uwe Schindler wrote:
>> Hi,
>>
>> With the terms enum you can iterate over all terms. Each one returns
>its term frequency. Of course, you need to enable term vectors during
>indexing. The pattern how to use terms enum can be looked up at various
>places in Lucene source code. It's a very expert API but it is the way
>to go here.
>>
>> Uwe
>>
>> Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang
><zi...@sheffield.ac.uk>:
>>> Hi
>>>
>>> Is it possible to get a list of terms within a document, and also TF
>of
>>>
>>> each of these terms *in that document only*? (Lucene 5.3)
>>>
>>> IndexReader has a method "Terms getTermVector(int docID, String
>>> field)",
>>> which gives me a "Terms" object, on which I can get a TermsEnum. But
>I
>>> do not know where to go then.
>>>
>>> thanks
>> --
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, 28213 Bremen
>> http://www.thetaphi.de
>
>
>-- 
>Ziqi Zhang
>Research Associate
>Department of Computer Science
>University of Sheffield
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: get frequency of each term from a document

Posted by Ziqi Zhang <zi...@sheffield.ac.uk>.
Thanks but TermsEnum has two methods that returns frequency-related 
info, both are corpus-level, not document specific:

-docFreq() Returns the number of documents containing the current term.
-totalTermFreq() Returns the total number of occurrences of this term 
across all documents (the sum of the freq() for each doc that has this 
term).

However I will need document specific frequency, i.e., freq of term A in 
Doc 1, 2, ... N

Thanks

On 20/09/2015 15:07, Uwe Schindler wrote:
> Hi,
>
> With the terms enum you can iterate over all terms. Each one returns its term frequency. Of course, you need to enable term vectors during indexing. The pattern how to use terms enum can be looked up at various places in Lucene source code. It's a very expert API but it is the way to go here.
>
> Uwe
>
> Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang <zi...@sheffield.ac.uk>:
>> Hi
>>
>> Is it possible to get a list of terms within a document, and also TF of
>>
>> each of these terms *in that document only*? (Lucene 5.3)
>>
>> IndexReader has a method "Terms getTermVector(int docID, String
>> field)",
>> which gives me a "Terms" object, on which I can get a TermsEnum. But I
>> do not know where to go then.
>>
>> thanks
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de


-- 
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: get frequency of each term from a document

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

With the terms enum you can iterate over all terms. Each one returns its term frequency. Of course, you need to enable term vectors during indexing. The pattern how to use terms enum can be looked up at various places in Lucene source code. It's a very expert API but it is the way to go here.

Uwe

Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang <zi...@sheffield.ac.uk>:
>Hi
>
>Is it possible to get a list of terms within a document, and also TF of
>
>each of these terms *in that document only*? (Lucene 5.3)
>
>IndexReader has a method "Terms getTermVector(int docID, String
>field)", 
>which gives me a "Terms" object, on which I can get a TermsEnum. But I 
>do not know where to go then.
>
>thanks

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de