You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Gmehlin Floran <fg...@student.ethz.ch> on 2016/08/18 14:31:19 UTC

Total number of terms in index for collection frequency

Hi,

I am struggling to compute the collection frequency of a term (PyLucene 4.10.1).
So far, I can have the collection count of terms with :

reader = IndexReader.open(SimpleFSDirectory(File(LUCENE_INDEX)))
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
cf_dict = {}
try:
    while (termsref.next()):
        termval = TermsEnum.cast_(termsref)
        fg = termval.term().utf8ToString()
        cf = reader.totalTermFreq(Term("contents", termval.term())    # collection count
        cf_dict[fg]=cf
except StopIteration, e:
    print ''

I would like to have the "frequency" in cf_dict instead of the count. For this, I need to divide it with the total number of indistinct terms in the index.

Does anyone know how to get this ?

Thank you for your help,

Floran

Re: Total number of terms in index for collection frequency

Posted by Dirk Rothe <d....@semantics.de>.

Hi Floran,

we're looping over all lucene-docs, apply the appropriate analyzer,  
iterate and collect the distinct tokens. Pretty inefficient I guess, but  
you also get the frequency for each unique token. Nice for checking:  
https://en.wikipedia.org/wiki/Zipf%27s_law

--dirk

Am 18.08.2016, 16:31 Uhr, schrieb Gmehlin  Floran  
<fg...@student.ethz.ch>:

> Hi,
>
> I am struggling to compute the collection frequency of a term (PyLucene  
> 4.10.1).
> So far, I can have the collection count of terms with :
>
> reader = IndexReader.open(SimpleFSDirectory(File(LUCENE_INDEX)))
> termVector = reader.getTermVector(docID, "contents");
> termsEnumvar = termVector.iterator(None)
> termsref = BytesRefIterator.cast_(termsEnumvar)
> cf_dict = {}
> try:
>     while (termsref.next()):
>         termval = TermsEnum.cast_(termsref)
>         fg = termval.term().utf8ToString()
>         cf = reader.totalTermFreq(Term("contents", termval.term())    #  
> collection count
>         cf_dict[fg]=cf
> except StopIteration, e:
>     print ''
>
> I would like to have the "frequency" in cf_dict instead of the count.  
> For this, I need to divide it with the total number of indistinct terms  
> in the index.
>
> Does anyone know how to get this ?
>
> Thank you for your help,
>
> Floran
>
>