You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christoph Kaser <lu...@iconparc.de> on 2017/06/01 10:42:57 UTC
Re: searchAfter is missing results when custom noncontinuous slices are used

Hello Mike,

thank you for the explanation!
I created a jira issue: LUCENE-7861

Best regards,
Christoph

Am 25.05.2017 um 16:11 schrieb Michael McCandless:
> Yes, there is a (hidden) assumption in TopDocs.merge that the hits it's
> merging are logically non-overlapping, sequential slices of the index, but
> in your case they are "interleaved".
>
> TopDocs.merge doesn't otherwise trust the incoming docID to be from the
> same docID space, and in your case it is.
>
> Maybe we could improve TopDocs.merge to optionally use the already global
> docID for tie breaking?
>
> Yes, please open an issue.  Maybe we just improve the javadocs as you
> suggested, but the situation sure is trappy today.
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, May 24, 2017 at 10:06 AM, Christoph Kaser <lu...@iconparc.de>
> wrote:
>
>> Hello everybody,
>>
>> I have observed an unexpected behavior in Lucene, and I am unsure whether
>> this is a bug, or a missing warning in the documentation:
>>
>> I am using the IndexSearcher with an ExecutorService in order to take
>> advantage of multiple CPU cores during the searches. I want to limit the
>> number of cores a single search can occupy, so I have overwritten the
>> IndexSearcher method
>>      protected LeafSlice[] slices(List<LeafReaderContext> leaves)
>> to return a fixed number of Slices. (e.g. 4).
>>
>> I tried to create slices that are about the same size by looping over the
>> leaves (ordered by size descending) and adding the current leaf to the
>> slice with the smallest number of documents.
>>
>> This worked well, until I stumbled upon a query for which searchAfter
>> seemed to skip hits, so that the total number of hits obtained by multiple
>> calls to searchAfter was lower than TopDocs.totalHits.
>>
>> The issue seems to be how searchAfter works vs how TopDocs.merge works:
>>
>> searchAfter skips every document with a higher score than the "after"
>> document. In case of equal scores, it uses the document id and skips every
>> document with a <= document id (see PagingFieldCollector).
>>
>> TopDocs.merge uses the score to determine which hits should be part of the
>> merged TopDocs. In case of equal scores, it uses the shard index (this
>> corresponds to the slices the IndexSearcher uses) to break ties (see
>> ScoreMergeSortQueue.lessThan)
>>
>> So if the shards are noncontinuous (as they are in my case), searchAfter
>> uses a different way of sorting the documents than TopDocs.merge, and
>> therefore hits are skipped.
>>
>> Here are my questions:
>>
>> * Are slices meant to be continuous "sublists" of the passed leaves-list?
>> Or is my way of slicing meant to be supported?
>> * If my way of slicing is not supported, could you either add a warning to
>> the javadocs of the slices method or maybe even add  a check for a legal
>> return value of slices()?
>> * Should I create a jira issue for this?
>>
>> Sorry for the wall of text, I hope I explained the problem in an
>> understandable way!
>>
>> Thank you and best regards
>> Christoph
>>
>>
>>