You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arthur Gavlyukovskiy <ag...@gmail.com> on 2020/07/21 20:25:58 UTC

Rerank for distributed requests

Hi.

We're LTR and after switching to multiple shards we found that rerank
happens on individual shards and during the merge phase the first pass
score isn't used. Currently our LTR model doesn't use textual match and
assumes that reranked documents are already more or less good in terms of
textual score, which is not always the case when documents are distributed
across shards.

To avoid it I've tried to use sort by function that replicates actual query
and results I get is somewhat interesting - on individual shards first pass
happens by my sorting, then documents are reranked and during the merge
documents from the same shard are compared by "orderInShard" and from
different shards by value from sort, so that final order is neither sort
value nor score.
For example let's assume that documents coming from shard 1 are:
    doc1(first_pass_score = 1, second_pass_score = 2)
    doc2(first_pass_score = 4, second_pass_score = 1)
and documents coming from shard 2 are:
    doc4(first_pass_score = 3, second_pass_score = 4)
    doc3(first_pass_score = 2, second_pass_score = 3)
where first_pass_score is doc.sort_values[0] and second_pass_score is
doc.score

when we try to merge all documents this will happen
    queue.insertWithOverflow(doc1)
    queue.insertWithOverflow(doc2)
        queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 <
doc2.orderInShard = 2)
    queue.insertWithOverflow(doc4)
        queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 >
doc2.first_pass_score = 3)
    queue.insertWithOverflow(doc3)
        queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 <
doc3.orderInShard = 2)

and final documents result will be:
    doc1(first_pass_score = 1, second_pass_score = 2)
    doc2(first_pass_score = 4, second_pass_score = 1)
    doc4(first_pass_score = 3, second_pass_score = 4)
    doc3(first_pass_score = 2, second_pass_score = 3)

Ideally I would want to see rerank happening based on global order across
all shards, I've implemented custom component that asks shards to
return *Math.max(reRankDocs,
offset + rows)* documents, which are first sorted by first pass score and
then only top *reRankDocs *are sorted by second pass score. I understand
that it might not be the best way in terms of performance (we rerank only
top 60 documents so it's not that big of a deal), but it's functionally
equivalent to the single shard behavior.

I'm curious if current behavior is intended or not, typically I would
expect either something I described above or at least ignoring sort during
the merge and using only doc.score that was generated by LTR rescorer.
Maybe the community would be interested in the approach I've implemented?
Or is it considered bad design to rely on first pass score and our LTR
model should use fields from first pass / use OriginalScoreFeature?

Re: Rerank for distributed requests

Posted by Dmitry Kan <so...@gmail.com>.
Hi Arthur,

I'm facing a similar issue with an LTR query over multiple collections in
SolrCloud. The issue is that the documents returned and merged into a
single page will have scores that don't look like sorted at all.

For example (this is a single page of results):

// collection1
-2.1818457
-2.1818457
...
4.2359614

// collection2
-2.224318

// collection1
2.7780528

// collection3
2.807676// collection1
-1.3967791


The expectation I had while testing against a single collection: the
reranked N documents are placed at the top of the page and the tail of
documents will be sorted by the original non-LTR scoring model (like TF-IDF
or BM25).
And this is how a single shard returned the results.

The expectation for multiple queried collections: all reranked documents
form a top of the page (and the question here is: should this top be of
size N or N*number of collections), having the tail of the documents
interleaved and sorted by the non-LTR scoring model.

Would you mind sharing the details of your component, provided that you
would still be interested in sharing your implementation with the
community? Thanks!



On Tue, Jul 21, 2020 at 11:33 PM Arthur Gavlyukovskiy <
agavlyukovskiy@gmail.com> wrote:

> Hi.
>
> We're LTR and after switching to multiple shards we found that rerank
> happens on individual shards and during the merge phase the first pass
> score isn't used. Currently our LTR model doesn't use textual match and
> assumes that reranked documents are already more or less good in terms of
> textual score, which is not always the case when documents are distributed
> across shards.
>
> To avoid it I've tried to use sort by function that replicates actual query
> and results I get is somewhat interesting - on individual shards first pass
> happens by my sorting, then documents are reranked and during the merge
> documents from the same shard are compared by "orderInShard" and from
> different shards by value from sort, so that final order is neither sort
> value nor score.
> For example let's assume that documents coming from shard 1 are:
>     doc1(first_pass_score = 1, second_pass_score = 2)
>     doc2(first_pass_score = 4, second_pass_score = 1)
> and documents coming from shard 2 are:
>     doc4(first_pass_score = 3, second_pass_score = 4)
>     doc3(first_pass_score = 2, second_pass_score = 3)
> where first_pass_score is doc.sort_values[0] and second_pass_score is
> doc.score
>
> when we try to merge all documents this will happen
>     queue.insertWithOverflow(doc1)
>     queue.insertWithOverflow(doc2)
>         queue.lessThan(doc1, doc2) -> false (doc1.orderInShard = 1 <
> doc2.orderInShard = 2)
>     queue.insertWithOverflow(doc4)
>         queue.lessThan(doc2, doc4) -> false (doc2.first_pass_score = 4 >
> doc2.first_pass_score = 3)
>     queue.insertWithOverflow(doc3)
>         queue.lessThan(doc4, doc3) -> false (doc4.orderInShard = 1 <
> doc3.orderInShard = 2)
>
> and final documents result will be:
>     doc1(first_pass_score = 1, second_pass_score = 2)
>     doc2(first_pass_score = 4, second_pass_score = 1)
>     doc4(first_pass_score = 3, second_pass_score = 4)
>     doc3(first_pass_score = 2, second_pass_score = 3)
>
> Ideally I would want to see rerank happening based on global order across
> all shards, I've implemented custom component that asks shards to
> return *Math.max(reRankDocs,
> offset + rows)* documents, which are first sorted by first pass score and
> then only top *reRankDocs *are sorted by second pass score. I understand
> that it might not be the best way in terms of performance (we rerank only
> top 60 documents so it's not that big of a deal), but it's functionally
> equivalent to the single shard behavior.
>
> I'm curious if current behavior is intended or not, typically I would
> expect either something I described above or at least ignoring sort during
> the merge and using only doc.score that was generated by LTR rescorer.
> Maybe the community would be interested in the approach I've implemented?
> Or is it considered bad design to rely on first pass score and our LTR
> model should use fields from first pass / use OriginalScoreFeature?
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: https://semanticanalyzer.info