You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2010/04/30 19:25:17 UTC

Re: questions about DocsEnum.read()in flex api

On Fri, Apr 30, 2010 at 1:15 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> I’m a bit confused about the DocsEnum.read() in the flex API.   I have three
> questions:
>
>
> DocsEnum.read() currently delegates to nextDoc() in the base class and there
> is a note that subclasses may do this more efficiently.  Is there currently
> a more efficient implementation in a subclass?  I didn’t see one in
> MultiDocsEnum or MappingMultiDocsEnum, but perhaps I’m not understanding the
> code.

Yes, the standard codec does so (StandardPostingsReaderImpl.java).

MultiDocsEnum doesn't... but you should not use that (if performance
is important).  Instead you should go segment by segment.

> DocsEnum.read reads 64 docs/freqs at a time as set up in initBulkResult().
> Would it make sense to have this configurable as an argument somewhere?
> I’m looking at very large indexes where a common term might occur in 100,000
> or more docs.

We could do that... maybe .getBulkResult should take a "suggested
size"?  It'd just be a suggestion though, since eg block based codecs
would presumably return to you a direct slice into their underlying
int[] buffers.

> At the very top of the JavaDoc there is a warning “you must first call
> nextDoc”   It seems that this applies to calling DocsEnum.docID() or
> DocsEnum.freq() but not to DocsEnum.read().  Is that correct?

That's right -- I just committed a small fix to the jdoc to clarify this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: questions about DocsEnum.read()in flex api

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Apr 30, 2010 at 2:37 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Mike!
>
> A follow-up question:
>
>> DocsEnum.read() currently delegates to nextDoc() in the base class and there
>> is a note that subclasses may do this more efficiently.  Is there currently
>> a more efficient implementation in a subclass?
>>>Yes, the standard codec does so (StandardPostingsReaderImpl.java).
>
> I assume that the standard codec is the default.

Right.  Only if the app uses its own codec in IndexWriter will it be
different...

> Will what I'm using in HighFreqTermsWithTF to instantiate an IndexReader (below) eventually end up instantiating the StandardPostingReaderImpl or do I need to do something explicitly that will cause it to be instantiated?
>
> dir = FSDirectory.open(new File(args[0]));
> reader = IndexReader.open(dir, true);

Yes, this will use standard codec (assuming your IndexWriter didn't
use a different codec).

Actually, for HighFreqTerms, what I said before ("you should go
segment by segment") is not a good idea, since that'd mean you'd have
to aggregate across terms, ie when the same term appears in multiple
segments.

I would say you should just use MultiTermsEnum, and use the bulk API,
but not worry for now that MultiTermsEnum doesn't override the bulk
read impl?  Or your patch could fix this -- it's just a matter of
calling the current segment's bulk read, and then doing a 2nd pass to
add the offset to each doc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: questions about DocsEnum.read()in flex api

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Mike!

A follow-up question: 

> DocsEnum.read() currently delegates to nextDoc() in the base class and there
> is a note that subclasses may do this more efficiently.  Is there currently
> a more efficient implementation in a subclass?  
>>Yes, the standard codec does so (StandardPostingsReaderImpl.java).

I assume that the standard codec is the default.  Will what I'm using in HighFreqTermsWithTF to instantiate an IndexReader (below) eventually end up instantiating the StandardPostingReaderImpl or do I need to do something explicitly that will cause it to be instantiated?
 
dir = FSDirectory.open(new File(args[0]));
reader = IndexReader.open(dir, true); 

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org