You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "tech.vronk" <te...@vronk.net> on 2012/08/09 16:20:25 UTC

how to retrieve total token count per collection/index

Hello,

I wonder how to figure out the total token count in a collection (per 
index), i.e. the size of a corpus/collection measured in tokens.

The statistics in /admin tell the number of distinct terms,
and the frequency list per index reveals the number of documents with 
given term. So even if I would sum all the frequencies, I wouldn't get 
the result I need.

Thank you.

best,
Matej

Re: how to retrieve total token count per collection/index

Posted by Walter Underwood <wu...@wunderwood.org>.
For a rough estimate, square the number of unique terms to get the number of terms. Vocabulary usually goes up as the square root of the corpus size in words.

wunder

On Aug 9, 2012, at 7:20 AM, tech.vronk wrote:

> Hello,
> 
> I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens.
> 
> The statistics in /admin tell the number of distinct terms,
> and the frequency list per index reveals the number of documents with given term. So even if I would sum all the frequencies, I wouldn't get the result I need.
> 
> Thank you.
> 
> best,
> Matej






Re: how to retrieve total token count per collection/index

Posted by "tech.vronk" <te...@vronk.net>.
Am 09.08.2012 18:02, schrieb Robert Muir:
> On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk <te...@vronk.net> wrote:
>> Hello,
>>
>> I wonder how to figure out the total token count in a collection (per
>> index), i.e. the size of a corpus/collection measured in tokens.
>>
> You want to use this statistic, which tells you number of tokens for
> an indexed field:
> http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29
>

just to say:
thank you, this seems to work well!

matej

Re: how to retrieve total token count per collection/index

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Aug 9, 2012 at 4:24 PM, tech.vronk <te...@vronk.net> wrote:

> Is there any 3.6 equivalent for this, before I install and run 4.0?
> I can't seem to find a corresponding class (org.apache.lucene.index.Terms)
> in 3.6.
>

unfortunately 3.6 does not carry this statistic, there is really no
clear delineation of 'field' in 3.x, all the terms are just a big
sorted list of field+term.
so there are no field-level statistics at all! these are new in 4.0

-- 
lucidimagination.com

Re: how to retrieve total token count per collection/index

Posted by "tech.vronk" <te...@vronk.net>.
Am 09.08.2012 18:02, schrieb Robert Muir:
> On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk <te...@vronk.net> wrote:
>> Hello,
>>
>> I wonder how to figure out the total token count in a collection (per
>> index), i.e. the size of a corpus/collection measured in tokens.
>>
> You want to use this statistic, which tells you number of tokens for
> an indexed field:
> http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29
>
thank you.

Is there any 3.6 equivalent for this, before I install and run 4.0?
I can't seem to find a corresponding class (org.apache.lucene.index.Terms)
in 3.6.


Re: how to retrieve total token count per collection/index

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk <te...@vronk.net> wrote:
> Hello,
>
> I wonder how to figure out the total token count in a collection (per
> index), i.e. the size of a corpus/collection measured in tokens.
>

You want to use this statistic, which tells you number of tokens for
an indexed field:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29

-- 
lucidimagination.com