You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2021/06/04 18:39:59 UTC

[GitHub] [solr] tomglk commented on a change in pull request #151: SOLR-15437: ReRanking/LTR does not work in combination with custom sort and SolrCloud

tomglk commented on a change in pull request #151:
URL: https://github.com/apache/solr/pull/151#discussion_r645777966



##########
File path: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
##########
@@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest sreq) {
 
           shardDoc.sortFieldValues = unmarshalledSortFieldValues;
 
-          queue.insertWithOverflow(shardDoc);
+          if(reRankQueue != null && docCounter++ <= reRankDocsSize) {
+              ShardDoc droppedShardDoc = reRankQueue.insertWithOverflow(shardDoc);
+              // FIXME: Only works if the original request does not sort by score

Review comment:
       Hi @cpoerschke please excuse the delay, I am very sorry!
   I just found time to look at your commit.
   The tests fail because they assume, that the number of reRankDocs is applied to the whole result. Your change applies it to each shard.
   So with e.g. 3 shards and reRankDocs=2 we expect the first 2 docs to be reranked, but instead 6 docs get reranked. Then the assertion on sort by id fails.
   (Actually, the 6 is variable. I ran tests where 5 docs were reRanked in total, because shard 3 only had one doc.)
   
   Because the client cannot (and should not have to) know, how many shards the cluster has, we would have to convert the reRankDocs that were passed in the `{!ltr...}`-query to a per shard value. 
   So basically `<value that was passed in> / numShards`.
   
   However, I don't think that we can convert the global setting for reRankDocs into a per shard setting because we cannot know how the top reRankDocs are distributed across the shards. 
   So when the client wants 6 docs to be reRanked and we have 3 shards, we cannot tell each shard that it should reReank the first 2 documents.
   This is because the top 6 reRanked docs may be all be on one shard.
   That was our reasoning behind the `droppedShardDoc`-logic.
   
   So referring to your examples (thank you for the visualization! :) ):
   ```
   dual sharded collection with reRankDocs=5 (per shard)
   
   shard1: a c e g i   y # without reranking
   shard2: b d f h j x z # without reranking
   
   shard1: I G E C A   y
   shard2: J H F D B x z
   
   reRankedQueue: J I H G F E D C B A
   queue: x y z
   
   overall results: J I H G F E D C B A x y z
   ```
   The top 10 docs by their reRanked score could be:
   ```
   dual sharded collection with reRankDocs=5 (across the whole result)
   
   shard1: a c e g i   y # without reranking
   shard2: b d f h j x z # without reranking
   
   shard1: I G E C A   y
   shard2: J H F D B x z
   
   reRankedQueue: J H F D B
   queue: a c e g i x y z
   
   overall results: J H F D B a c e g i x y z
   ```
   if the reRankScore for all docs on shard2 is higher then the score for each reRanked doc on shard1.
   
   That's a bit sad, because if we could just convert the global param for the reRankDocs to a per shard one, we would not have the problem with the missing originalScore for documents, that were reRanked on a shard but not in the whole result. :/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org