You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Stephen GRAY <st...@immi.gov.au> on 2013/10/30 07:21:00 UTC

splitting docIds from a search by segment [SEC=UNOFFICIAL]

UNOFFICIAL
Hi everyone,

I am trying to write an application that loops through 500,000 - 1,000,000 documents returned by a search and calculates some statistics using the value in a stored field. Obviously this needs to be as fast as possible so I am using a NumericDocValues field to store the value.

What I don't know is how to get the NumericDocValues value for each docId returned by the search. What I've been told to do in a previous thread was:

1. Split the docIds according to the segment they belong to

2. Get a per-segment NumericDocValues instance and use this to extract the values

Can someone tell me how to do 1 and 2? I don't know how to discover what segment a given docId is in, or how to convert a segment into a NumericDocValues array.

By the way it's also been suggested that I just use MultiDocValue.getNumericValues, but I gather that this will be much slower.

I'd appreciate any help,

Thanks,
Steve

UNOFFICIAL

--------------------------------------------------------------------
Important Notice: If you have received this email by mistake, please advise
the sender and delete the message and attachments immediately. This email,
including attachments, may contain confidential, sensitive, legally privileged
and/or copyright information. Any review, retransmission, dissemination
or other use of this information by persons or entities other than the
intended recipient is prohibited. DIAC respects your privacy and has
obligations under the Privacy Act 1988. The official departmental privacy
policy can be viewed on the department's website at www.immi.gov.au. See:
http://www.immi.gov.au/functional/privacy.htm

---------------------------------------------------------------------

Re: splitting docIds from a search by segment [SEC=UNOFFICIAL]

Posted by Kyle Judson <kv...@hotmail.com>.

All,

Is the best way to get the docIDs in a case like this to use
IndexSercher.search to get TopDocs and then get the ScoreDoc[] from
TopDocs.scoreDocs?

Thanks

Kyle


On 10/30/13 4:56 AM, "Michael McCandless" <lu...@mikemccandless.com>
wrote:

>You should try MultiDocValues first; it's trivial to use and may not
>be horribly slow.
>
>It must do a binary-search for every docID lookup.
>
>And then if this is too slow, assuming you traverse the docIDs in
>order, you can use IndexReader.leaves() to get the sub-readers.  The
>docIDs are just "appended" from these sub-readers, so you'd walk your
>docIDs and also walk you sub-readers, moving to the next sub-reader
>once you have a docID that's beyond its end.  Each sub-reader spans
>AtomicReaderContext.docBase to docBase +
>AtomicReaderContext.reader.maxDoc().
>
>Mike McCandless
>
>http://blog.mikemccandless.com
>
>On Wed, Oct 30, 2013 at 2:21 AM, Stephen GRAY <st...@immi.gov.au>
>wrote:
>> UNOFFICIAL
>> Hi everyone,
>>
>> I am trying to write an application that loops through 500,000 -
>>1,000,000 documents returned by a search and calculates some statistics
>>using the value in a stored field. Obviously this needs to be as fast as
>>possible so I am using a NumericDocValues field to store the value.
>>
>> What I don't know is how to get the NumericDocValues value for each
>>docId returned by the search. What I've been told to do in a previous
>>thread was:
>>
>> 1.       Split the docIds according to the segment they belong to
>>
>> 2.       Get a per-segment NumericDocValues instance and use this to
>>extract the values
>>
>> Can someone tell me how to do 1 and 2? I don't know how to discover
>>what segment a given docId is in, or how to convert a segment into a
>>NumericDocValues array.
>>
>> By the way it's also been suggested that I just use
>>MultiDocValue.getNumericValues, but I gather that this will be much
>>slower.
>>
>> I'd appreciate any help,
>>
>> Thanks,
>> Steve
>>
>> UNOFFICIAL
>>
>>
>> --------------------------------------------------------------------
>> Important Notice: If you have received this email by mistake, please
>>advise
>> the sender and delete the message and attachments immediately.  This
>>email,
>> including attachments, may contain confidential, sensitive, legally
>>privileged
>> and/or copyright information.  Any review, retransmission, dissemination
>> or other use of this information by persons or entities other than the
>> intended recipient is prohibited.  DIAC respects your privacy and has
>> obligations under the Privacy Act 1988.  The official departmental
>>privacy
>> policy can be viewed on the department's website at www.immi.gov.au.
>>See:
>> http://www.immi.gov.au/functional/privacy.htm
>>
>>
>> ---------------------------------------------------------------------
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: splitting docIds from a search by segment [SEC=UNOFFICIAL]

Posted by Michael McCandless <lu...@mikemccandless.com>.

You should try MultiDocValues first; it's trivial to use and may not
be horribly slow.

It must do a binary-search for every docID lookup.

And then if this is too slow, assuming you traverse the docIDs in
order, you can use IndexReader.leaves() to get the sub-readers.  The
docIDs are just "appended" from these sub-readers, so you'd walk your
docIDs and also walk you sub-readers, moving to the next sub-reader
once you have a docID that's beyond its end.  Each sub-reader spans
AtomicReaderContext.docBase to docBase +
AtomicReaderContext.reader.maxDoc().

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 30, 2013 at 2:21 AM, Stephen GRAY <st...@immi.gov.au> wrote:
> UNOFFICIAL
> Hi everyone,
>
> I am trying to write an application that loops through 500,000 - 1,000,000 documents returned by a search and calculates some statistics using the value in a stored field. Obviously this needs to be as fast as possible so I am using a NumericDocValues field to store the value.
>
> What I don't know is how to get the NumericDocValues value for each docId returned by the search. What I've been told to do in a previous thread was:
>
> 1.       Split the docIds according to the segment they belong to
>
> 2.       Get a per-segment NumericDocValues instance and use this to extract the values
>
> Can someone tell me how to do 1 and 2? I don't know how to discover what segment a given docId is in, or how to convert a segment into a NumericDocValues array.
>
> By the way it's also been suggested that I just use MultiDocValue.getNumericValues, but I gather that this will be much slower.
>
> I'd appreciate any help,
>
> Thanks,
> Steve
>
> UNOFFICIAL
>
>
> --------------------------------------------------------------------
> Important Notice: If you have received this email by mistake, please advise
> the sender and delete the message and attachments immediately.  This email,
> including attachments, may contain confidential, sensitive, legally privileged
> and/or copyright information.  Any review, retransmission, dissemination
> or other use of this information by persons or entities other than the
> intended recipient is prohibited.  DIAC respects your privacy and has
> obligations under the Privacy Act 1988.  The official departmental privacy
> policy can be viewed on the department's website at www.immi.gov.au.  See:
> http://www.immi.gov.au/functional/privacy.htm
>
>
> ---------------------------------------------------------------------
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org