You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2010/09/07 08:34:17 UTC

solr getUniqueTermCount() when multiple segments?

Hello-

I'm looking at using the new terms.getUniqueTermCount() to give a
quick count for the LukeRequestHandler rather then needing to walk all
the terms.

When solr index reader has just one segment, it works great.  However
with more segments I get:

java.lang.UnsupportedOperationException: this reader does not
implement getUniqueTermCount()
	at org.apache.lucene.index.Terms.getUniqueTermCount(Terms.java:84)

Is this expected?  Is there any way around that?

I am getting the terms using:

          Terms terms = MultiFields.getTerms(reader, fieldName);
          long cnt = (terms==null) ? 0 : terms.getUniqueTermCount();

Thanks
ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: solr getUniqueTermCount() when multiple segments?

Posted by Ryan McKinley <ry...@gmail.com>.

Ahh -- this makes sense.  I thought it was too good to be true!


On Tue, Sep 7, 2010 at 4:45 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> This is expected/intentional, because computing the "true" unique term
> count across multiple segments is exceptionally costly (you have to do
> the merge sort to de-dup).
>
> If you really want the true count, you can pull the TermsEnum and
> .next() until exhaustion.
>
> Alternatively, you can use IndexReader.getSequentialSubReaders(), then
> step through each SegReader calling its .getUniqueTermCount() and then
> somehow "approximate" (eg the sum will be an upper bound of the total
> unique count).
>
> Mike
>
> On Tue, Sep 7, 2010 at 2:34 AM, Ryan McKinley <ry...@gmail.com> wrote:
>> Hello-
>>
>> I'm looking at using the new terms.getUniqueTermCount() to give a
>> quick count for the LukeRequestHandler rather then needing to walk all
>> the terms.
>>
>> When solr index reader has just one segment, it works great.  However
>> with more segments I get:
>>
>> java.lang.UnsupportedOperationException: this reader does not
>> implement getUniqueTermCount()
>>        at org.apache.lucene.index.Terms.getUniqueTermCount(Terms.java:84)
>>
>> Is this expected?  Is there any way around that?
>>
>> I am getting the terms using:
>>
>>          Terms terms = MultiFields.getTerms(reader, fieldName);
>>          long cnt = (terms==null) ? 0 : terms.getUniqueTermCount();
>>
>> Thanks
>> ryan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: solr getUniqueTermCount() when multiple segments?

Posted by Michael McCandless <lu...@mikemccandless.com>.

This is expected/intentional, because computing the "true" unique term
count across multiple segments is exceptionally costly (you have to do
the merge sort to de-dup).

If you really want the true count, you can pull the TermsEnum and
.next() until exhaustion.

Alternatively, you can use IndexReader.getSequentialSubReaders(), then
step through each SegReader calling its .getUniqueTermCount() and then
somehow "approximate" (eg the sum will be an upper bound of the total
unique count).

Mike

On Tue, Sep 7, 2010 at 2:34 AM, Ryan McKinley <ry...@gmail.com> wrote:
> Hello-
>
> I'm looking at using the new terms.getUniqueTermCount() to give a
> quick count for the LukeRequestHandler rather then needing to walk all
> the terms.
>
> When solr index reader has just one segment, it works great.  However
> with more segments I get:
>
> java.lang.UnsupportedOperationException: this reader does not
> implement getUniqueTermCount()
>        at org.apache.lucene.index.Terms.getUniqueTermCount(Terms.java:84)
>
> Is this expected?  Is there any way around that?
>
> I am getting the terms using:
>
>          Terms terms = MultiFields.getTerms(reader, fieldName);
>          long cnt = (terms==null) ? 0 : terms.getUniqueTermCount();
>
> Thanks
> ryan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org