You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2009/09/21 15:14:54 UTC

TermCount per fiend

Hi guys:
     Not sure if this would be a better fit on the users or the dev list.

     It would be very useful to be able to get term count given a field,
e.g.

     int IndexReader.termCount(String field)

     Wanted to get your opinion on what is the best way to approach this.
After looking through the code, seems like we do have it stored
in TermsHashPerField.numPostingInt. (hopefully I am reading it correctly)

    Is it possible to add to the FieldInfo class and write it out?

Thanks

-John

Re: TermCount per fiend

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Sep 21, 2009 at 8:11 PM, John Wang <jo...@gmail.com> wrote:

> Makes lotta sense to me to wait for LUCENE-1458 then. Should I create an
> issue with a depedency on 1458?

Yes please open a new issue.

> One application for this is within FieldCache construction of StringIndex:
>
> If we know the number of terms is small, the orderArray using an int per doc
> is wasteful. In the case where we have 10 terms but 100M docs for a given
> field, the orderArray would take up 400MB where as half a byte is
> sufficient, which means 50MB is enough. (keep in mind this is per field!)
>
> To do such memory optimization now requires iterating the term table twice
> to get the number, hence the movition for this feature.

That sounds like a great improvement too!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TermCount per fiend

Posted by John Wang <jo...@gmail.com>.

Thanks Michael!

Makes lotta sense to me to wait for LUCENE-1458 then. Should I create an
issue with a depedency on 1458?

One application for this is within FieldCache construction of StringIndex:

If we know the number of terms is small, the orderArray using an int per doc
is wasteful. In the case where we have 10 terms but 100M docs for a given
field, the orderArray would take up 400MB where as half a byte is
sufficient, which means 50MB is enough. (keep in mind this is per field!)

To do such memory optimization now requires iterating the term table twice
to get the number, hence the movition for this feature.

Thanks

-John

On Tue, Sep 22, 2009 at 2:17 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> MultiReaders can't quickly compute the exact term count.  Would they
> be allowed to throw UOE?  (Like IndexReader.getUniqueTermCount)
>
> TermsHashPerField.numPostings (not .numPostingsInt) tells you the #
> unique terms currently in IndexWriter's RAM buffer, so I think we
> could save that out with FieldInfo.  That seems reasonable?
>
> We could also compute it at search time, because the SegmentTermEnum
> knows its position.  Ie you could seek to first term of field X and
> then first term of field after X and subtract the positions.  But, the
> position is not exposed publicly now, and this'd be more costly to do
> (though we could cache & reuse the result).  It wouldn't involve
> changing the index format.
>
> With LUCENE-1458 this becomes simple (it already keeps track of each
> fields's terms, separately, including total number of terms for that
> field).
>
> Mike
>
> On Mon, Sep 21, 2009 at 9:14 AM, John Wang <jo...@gmail.com> wrote:
> > Hi guys:
> >      Not sure if this would be a better fit on the users or the dev list.
> >      It would be very useful to be able to get term count given a field,
> > e.g.
> >      int IndexReader.termCount(String field)
> >      Wanted to get your opinion on what is the best way to approach this.
> > After looking through the code, seems like we do have it stored
> > in TermsHashPerField.numPostingInt. (hopefully I am reading it correctly)
> >     Is it possible to add to the FieldInfo class and write it out?
> >
> > Thanks
> > -John
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: TermCount per fiend

Posted by Michael McCandless <lu...@mikemccandless.com>.

MultiReaders can't quickly compute the exact term count.  Would they
be allowed to throw UOE?  (Like IndexReader.getUniqueTermCount)

TermsHashPerField.numPostings (not .numPostingsInt) tells you the #
unique terms currently in IndexWriter's RAM buffer, so I think we
could save that out with FieldInfo.  That seems reasonable?

We could also compute it at search time, because the SegmentTermEnum
knows its position.  Ie you could seek to first term of field X and
then first term of field after X and subtract the positions.  But, the
position is not exposed publicly now, and this'd be more costly to do
(though we could cache & reuse the result).  It wouldn't involve
changing the index format.

With LUCENE-1458 this becomes simple (it already keeps track of each
fields's terms, separately, including total number of terms for that
field).

Mike

On Mon, Sep 21, 2009 at 9:14 AM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>      Not sure if this would be a better fit on the users or the dev list.
>      It would be very useful to be able to get term count given a field,
> e.g.
>      int IndexReader.termCount(String field)
>      Wanted to get your opinion on what is the best way to approach this.
> After looking through the code, seems like we do have it stored
> in TermsHashPerField.numPostingInt. (hopefully I am reading it correctly)
>     Is it possible to add to the FieldInfo class and write it out?
>
> Thanks
> -John
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org