You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2013/09/03 01:42:41 UTC
distributed query result order tie break question
My question is about how query results are ordered in a distributed
query when sorting by "relevance" and all the documents have the same
score, for example, when querying for "*:*".
It looks to me as if score ties are broken by shard and then within each
shard, by docid. So for example, if I were to iterate over all the
documents using such a query, I would expect to get all the documents
from one shard first, then all the documents from another shard, etc.
Is that right?
Thanks
-Mike Sokolov
Re: distributed query result order tie break question
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
On 09/03/2013 12:50 PM, Chris Hostetter wrote:
> : like to understand how the ordering is defined so that I can compute an
> : integer that is sorted in the same way. For example (shard "id" << 24) |
> : docid or something like that.
>
> If you want to ensure a consistent ordering, you have to index a
> (unique) value that you use as a secondary sort -- you can't trust the
> internal docids will remain unchanged.
>
Thanks, Hoss - that was the conclusion that I was coming to. It's good
to have it confirmed.
-Mike
Re: distributed query result order tie break question
Posted by Chris Hostetter <ho...@fucit.org>.
: like to understand how the ordering is defined so that I can compute an
: integer that is sorted in the same way. For example (shard "id" << 24) |
: docid or something like that.
If you want to ensure a consistent ordering, you have to index a
(unique) value that you use as a secondary sort -- you can't trust the
internal docids will remain unchanged.
-Hoss
Re: distributed query result order tie break question
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Mostly I'm just trying to understand. For the moment I'm putting
together a design for distributed Lux (XQuery backed by Solr Cloud). My
motivation is that I am feeding results into its separate XQuery system,
and that requires a consistent global document ordering. The ordering
can be arbitrary, it just has to be stable for the duration of a single
query (but this could span multiple lucene/solr queries). In the
non-distributed version of this, I just use the docid directly, which is
convenient. In the distributed case, I'd like to understand how the
ordering is defined so that I can compute an integer that is sorted in
the same way. For example (shard "id" << 24) | docid or something like
that.
I can see that there might be perturbations in the ordering if there are
updates (Lucene can reassign docids, etc). With Lucene I'm able to
control this by keeping a Searcher/Reader open for the duration of the
query. It seems that in Solr (cloud or not), I can't really get this
kind of guarantee. I guess I'm willing to live with this since the time
window is very small and the likelihood of a problem is small (most
XQueries only use a single underlying Solr query anyway, so this whole
concern is a little bit pathological). I've been considering using a
global ordering based on my unique id (document uri), although of course
an update can still happen and mess things up mid-query, so ultimately
it's not a bulletproof solution either.
Thanks, Jack
-Mike
On 9/2/2013 8:26 PM, Jack Krupansky wrote:
> "*:*" is a constant score query - every document has the same score,
> so the concept of relevancy has no relevance.
>
> But, in theory, you could apply boost queries and function queries to
> scale or offset those constant scores. If so, then you should see
> relevancy sorting, otherwise the concept of relevancy does not apply.
>
> I don't think Solr offers any "contract" as to ordering of constant
> score documents or merging of same score documents across shards. At
> least I have never seen such a contract published. So, if you are
> merely observing the actual behavior of Solr, fine, but if you are
> expecting that such behavior will persist in future releases, there
> can be no such guarantee.
>
> I don't think Solr will necessarily guarantee that Lucene doc IDs will
> be the same between replicas (the order in which distributed updates
> are received), so there is no guarantee that behavior you see from one
> round-robin iteration will necessarily be the same on a repeat of the
> same distributed query.
>
> The bottom line is: What exactly are you after, simply an explanation
> for what you are seeing, or a guarantee that you will always see that
> behavior?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Monday, September 02, 2013 7:42 PM
> To: solr-user@lucene.apache.org
> Subject: distributed query result order tie break question
>
> My question is about how query results are ordered in a distributed
> query when sorting by "relevance" and all the documents have the same
> score, for example, when querying for "*:*".
>
> It looks to me as if score ties are broken by shard and then within each
> shard, by docid. So for example, if I were to iterate over all the
> documents using such a query, I would expect to get all the documents
> from one shard first, then all the documents from another shard, etc.
> Is that right?
>
> Thanks
>
> -Mike Sokolov
Re: distributed query result order tie break question
Posted by Jack Krupansky <ja...@basetechnology.com>.
"*:*" is a constant score query - every document has the same score, so the
concept of relevancy has no relevance.
But, in theory, you could apply boost queries and function queries to scale
or offset those constant scores. If so, then you should see relevancy
sorting, otherwise the concept of relevancy does not apply.
I don't think Solr offers any "contract" as to ordering of constant score
documents or merging of same score documents across shards. At least I have
never seen such a contract published. So, if you are merely observing the
actual behavior of Solr, fine, but if you are expecting that such behavior
will persist in future releases, there can be no such guarantee.
I don't think Solr will necessarily guarantee that Lucene doc IDs will be
the same between replicas (the order in which distributed updates are
received), so there is no guarantee that behavior you see from one
round-robin iteration will necessarily be the same on a repeat of the same
distributed query.
The bottom line is: What exactly are you after, simply an explanation for
what you are seeing, or a guarantee that you will always see that behavior?
-- Jack Krupansky
-----Original Message-----
From: Michael Sokolov
Sent: Monday, September 02, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: distributed query result order tie break question
My question is about how query results are ordered in a distributed
query when sorting by "relevance" and all the documents have the same
score, for example, when querying for "*:*".
It looks to me as if score ties are broken by shard and then within each
shard, by docid. So for example, if I were to iterate over all the
documents using such a query, I would expect to get all the documents
from one shard first, then all the documents from another shard, etc.
Is that right?
Thanks
-Mike Sokolov