You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2013/09/03 01:42:41 UTC

distributed query result order tie break question

My question is about how query results are ordered in a distributed 
query when sorting by "relevance" and all the documents have the same 
score, for example, when querying for "*:*".

It looks to me as if score ties are broken by shard and then within each 
shard, by docid.  So for example, if I were to iterate over all the 
documents using such a query, I would expect to get all the documents 
from one shard first, then all the documents from another shard, etc.  
Is that right?

Thanks

-Mike Sokolov

Re: distributed query result order tie break question

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 09/03/2013 12:50 PM, Chris Hostetter wrote:
> : like to understand how the ordering is defined so that I can compute an
> : integer that is sorted in the same way.  For example (shard "id" << 24) |
> : docid or something like that.
>
> If you want to ensure a consistent ordering, you have to index a
> (unique) value that you use as a secondary sort -- you can't trust the
> internal docids will remain unchanged.
>
Thanks, Hoss - that was the conclusion that I was coming to. It's good 
to have it confirmed.

-Mike

Re: distributed query result order tie break question

Posted by Chris Hostetter <ho...@fucit.org>.

: like to understand how the ordering is defined so that I can compute an
: integer that is sorted in the same way.  For example (shard "id" << 24) |
: docid or something like that.

If you want to ensure a consistent ordering, you have to index a 
(unique) value that you use as a secondary sort -- you can't trust the 
internal docids will remain unchanged.


-Hoss

Re: distributed query result order tie break question

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Mostly I'm just trying to understand. For the moment I'm putting 
together a design for distributed Lux (XQuery backed by Solr Cloud).  My 
motivation is that I am feeding results into its separate XQuery system, 
and that requires a consistent global document ordering.  The ordering 
can be arbitrary, it just has to be stable for the duration of a single 
query (but this could span multiple lucene/solr queries).  In the 
non-distributed version of this, I just use the docid directly, which is 
convenient.  In the distributed case, I'd like to understand how the 
ordering is defined so that I can compute an integer that is sorted in 
the same way.  For example (shard "id" << 24) | docid or something like 
that.

I can see that there might be perturbations in the ordering if there are 
updates (Lucene can reassign docids, etc).  With Lucene I'm able to 
control this by keeping a Searcher/Reader open for the duration of the 
query.  It seems that in Solr (cloud or not), I can't really get this 
kind of guarantee.  I guess I'm willing to live with this since the time 
window is very small and the likelihood of a problem is small (most 
XQueries only use a single underlying Solr query anyway, so this whole 
concern is a little bit pathological).  I've been considering using a 
global ordering based on my unique id (document uri), although of course 
an update can still happen and mess things up mid-query, so ultimately 
it's not a bulletproof solution either.

Thanks, Jack

-Mike

On 9/2/2013 8:26 PM, Jack Krupansky wrote:
> "*:*" is a constant score query - every document has the same score, 
> so the concept of relevancy has no relevance.
>
> But, in theory, you could apply boost queries and function queries to 
> scale or offset those constant scores. If so, then you should see 
> relevancy sorting, otherwise the concept of relevancy does not apply.
>
> I don't think Solr offers any "contract" as to ordering of constant 
> score documents or merging of same score documents across shards. At 
> least I have never seen such a contract published. So, if you are 
> merely observing the actual behavior of Solr, fine, but if you are 
> expecting that such behavior will persist in future releases, there 
> can be no such guarantee.
>
> I don't think Solr will necessarily guarantee that Lucene doc IDs will 
> be the same between replicas (the order in which distributed updates 
> are received), so there is no guarantee that behavior you see from one 
> round-robin iteration will necessarily be the same on a repeat of the 
> same distributed query.
>
> The bottom line is: What exactly are you after, simply an explanation 
> for what you are seeing, or a guarantee that you will always see that 
> behavior?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Monday, September 02, 2013 7:42 PM
> To: solr-user@lucene.apache.org
> Subject: distributed query result order tie break question
>
> My question is about how query results are ordered in a distributed
> query when sorting by "relevance" and all the documents have the same
> score, for example, when querying for "*:*".
>
> It looks to me as if score ties are broken by shard and then within each
> shard, by docid.  So for example, if I were to iterate over all the
> documents using such a query, I would expect to get all the documents
> from one shard first, then all the documents from another shard, etc.
> Is that right?
>
> Thanks
>
> -Mike Sokolov

Re: distributed query result order tie break question

Posted by Jack Krupansky <ja...@basetechnology.com>.

"*:*" is a constant score query - every document has the same score, so the 
concept of relevancy has no relevance.

But, in theory, you could apply boost queries and function queries to scale 
or offset those constant scores. If so, then you should see relevancy 
sorting, otherwise the concept of relevancy does not apply.

I don't think Solr offers any "contract" as to ordering of constant score 
documents or merging of same score documents across shards. At least I have 
never seen such a contract published. So, if you are merely observing the 
actual behavior of Solr, fine, but if you are expecting that such behavior 
will persist in future releases, there can be no such guarantee.

I don't think Solr will necessarily guarantee that Lucene doc IDs will be 
the same between replicas (the order in which distributed updates are 
received), so there is no guarantee that behavior you see from one 
round-robin iteration will necessarily be the same on a repeat of the same 
distributed query.

The bottom line is: What exactly are you after, simply an explanation for 
what you are seeing, or a guarantee that you will always see that behavior?

-- Jack Krupansky

-----Original Message----- 
From: Michael Sokolov
Sent: Monday, September 02, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: distributed query result order tie break question

My question is about how query results are ordered in a distributed
query when sorting by "relevance" and all the documents have the same
score, for example, when querying for "*:*".

It looks to me as if score ties are broken by shard and then within each
shard, by docid.  So for example, if I were to iterate over all the
documents using such a query, I would expect to get all the documents
from one shard first, then all the documents from another shard, etc.
Is that right?

Thanks

-Mike Sokolov