You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2016/04/04 19:34:27 UTC

Sort order for *:* query

Hi everyone,

When I send Solr the query *:* the result I get back is sorted based on
Lucene's internal DocID which is oldest to most recent (can someone correct
me if I get this wrong?)  Given this, the most recently added / updated
document is at the bottom of the list.  Is there a way to reverse this sort
order?  If so, how can I make this the default in Solr's solrconfig.xml
file?

Thanks

Steve

Re: Sort order for *:* query

Posted by Steven White <sw...@gmail.com>.
This is all good stuff.  Thank you all for your insight.

Steve

On Mon, Apr 4, 2016 at 6:15 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Mon, Apr 4, 2016 at 6:06 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
> > :
> > : Not sure I understand... _version_ is time based and hence will give
> > : roughly the same accuracy as something like
> > : TimestampUpdateProcessorFactory that you recommend below.  Both
> >
> > Hmmm... last time i looked, i thought _version_ numbers were allocated &
> > incremented on a per-shard basis and "time" was only used for initial
> > seeding when the leader started up
>
> No, time is used for every version generated.  Upper bits are
> milliseconds and lower bits are incremented only if needed for
> uniqueness in the shard (i.e. two documents indexed at the same
> millisecond).  We have 20 lower bits, so one would need a sustained
> indexing rate of over 1M documents per millisecond (or 1B docs/sec) to
> introduce a permanent skew due to indexing.
>
> There is system clock skew between shards of course, but an update
> processor that added a date field would include that as well.
>
> The code in VersionInfo is:
>
> public long getNewClock() {
>   synchronized (clockSync) {
>     long time = System.currentTimeMillis();
>     long result = time << 20;
>     if (result <= vclock) {
>       result = vclock + 1;
>     }
>     vclock = result;
>     return vclock;
>   }
> }
>
>
> -Yonik
>
> > -- so in a stable system running for
> > a long time, if shardA gets signifcantly more updates then shardB the
> > _version_ numbers can get skewed and a new doc in shardB might be updated
> > with a _version_ less then the _version_ of a document added to shardA
> > well before that.
> >
> > But maybe I'm remembering wrong?
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>

Re: Sort order for *:* query

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Apr 4, 2016 at 6:06 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> :
> : Not sure I understand... _version_ is time based and hence will give
> : roughly the same accuracy as something like
> : TimestampUpdateProcessorFactory that you recommend below.  Both
>
> Hmmm... last time i looked, i thought _version_ numbers were allocated &
> incremented on a per-shard basis and "time" was only used for initial
> seeding when the leader started up

No, time is used for every version generated.  Upper bits are
milliseconds and lower bits are incremented only if needed for
uniqueness in the shard (i.e. two documents indexed at the same
millisecond).  We have 20 lower bits, so one would need a sustained
indexing rate of over 1M documents per millisecond (or 1B docs/sec) to
introduce a permanent skew due to indexing.

There is system clock skew between shards of course, but an update
processor that added a date field would include that as well.

The code in VersionInfo is:

public long getNewClock() {
  synchronized (clockSync) {
    long time = System.currentTimeMillis();
    long result = time << 20;
    if (result <= vclock) {
      result = vclock + 1;
    }
    vclock = result;
    return vclock;
  }
}


-Yonik

> -- so in a stable system running for
> a long time, if shardA gets signifcantly more updates then shardB the
> _version_ numbers can get skewed and a new doc in shardB might be updated
> with a _version_ less then the _version_ of a document added to shardA
> well before that.
>
> But maybe I'm remembering wrong?
>
>
>
> -Hoss
> http://www.lucidworks.com/

Re: Sort order for *:* query

Posted by Chris Hostetter <ho...@fucit.org>.
: 
: Not sure I understand... _version_ is time based and hence will give
: roughly the same accuracy as something like
: TimestampUpdateProcessorFactory that you recommend below.  Both

Hmmm... last time i looked, i thought _version_ numbers were allocated & 
incremented on a per-shard basis and "time" was only used for initial 
seeding when the leader started up -- so in a stable system running for 
a long time, if shardA gets signifcantly more updates then shardB the 
_version_ numbers can get skewed and a new doc in shardB might be updated 
with a _version_ less then the _version_ of a document added to shardA 
well before that.

But maybe I'm remembering wrong?



-Hoss
http://www.lucidworks.com/

Re: Sort order for *:* query

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Apr 4, 2016 at 2:24 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : You can sort like this (I believe that _version_ is the internal id/index
> : number for the document, but you might want to verify)
>
> that is not true, and i strongly advise you not to try to sort on the
> _version_ field ... for some queries/testing it may deceptively *look*
> like it's sorting by the order the documents are added, but it will not
> actaully sort in any useful way -- two documents added in sequence A, B
> may have version values that are not in ascending sequence (depending on
> the hash bucket their uniqueKeys fall in for routing purposes) so sorting
> on that field will not give you any sort of meaningful order

Not sure I understand... _version_ is time based and hence will give
roughly the same accuracy as something like
TimestampUpdateProcessorFactory that you recommend below.  Both
methods will not be strictly equivalent to indexed order due to
parallelism / thread scheduling, etc., but will generally be pretty
close.
_version_ has the added benefit of being unique in an index (hence a
sort on _version_ won't resort to a tie-break by unstable
internal-id).

-Yonik


> If you want to sort by "recency" or "date added you need to add a
> date based field to capture this.  see for example the
> TimestampUpdateProcessorFactory...
>
> https://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html
>
>
>
> -Hoss
> http://www.lucidworks.com/

Re: Sort order for *:* query

Posted by Chris Hostetter <ho...@fucit.org>.
: You can sort like this (I believe that _version_ is the internal id/index
: number for the document, but you might want to verify)

that is not true, and i strongly advise you not to try to sort on the 
_version_ field ... for some queries/testing it may deceptively *look* 
like it's sorting by the order the documents are added, but it will not 
actaully sort in any useful way -- two documents added in sequence A, B 
may have version values that are not in ascending sequence (depending on 
the hash bucket their uniqueKeys fall in for routing purposes) so sorting 
on that field will not give you any sort of meaningful order

If you want to sort by "recency" or "date added you need to add a 
date based field to capture this.  see for example the 
TimestampUpdateProcessorFactory...

https://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html



-Hoss
http://www.lucidworks.com/

Re: Sort order for *:* query

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
You can sort like this (I believe that _version_ is the internal id/index
number for the document, but you might want to verify)

In the Admin UI, enter the following in the sort field:

_version_ asc

You could also put an entry in the default searchHandler in solrconfig.xml
to do this to every incoming query...

This is the one that gets hit from "/select"

It would look something like this although I haven't tested...  Don't know
if a colon is necessary or not between the fieldname and desc.

<str name="sort">_version_ desc</str>

And, of course, you can put it on the URL you are hitting if that's what
you need to do.



On Mon, Apr 4, 2016 at 11:34 AM, Steven White <sw...@gmail.com> wrote:

> Hi everyone,
>
> When I send Solr the query *:* the result I get back is sorted based on
> Lucene's internal DocID which is oldest to most recent (can someone correct
> me if I get this wrong?)  Given this, the most recently added / updated
> document is at the bottom of the list.  Is there a way to reverse this sort
> order?  If so, how can I make this the default in Solr's solrconfig.xml
> file?
>
> Thanks
>
> Steve
>

Re: Sort order for *:* query

Posted by Chris Hostetter <ho...@fucit.org>.
1) The hard coded implicit default sort order is "score desc" 

2) Whenever a sort results in ties, the final ordering of tied documents 
is non-deterministic

3) currently the behavior is that tied documents are returned in "index 
order" but that can change as segments are merged

4) if you wish to change the beahvior when there is a tie, just add 
additional deterministic sort clauses to your sort param.  This can be 
done at the request level, or as a user specified "default" for the 
request handler...

https://cwiki.apache.org/confluence/display/solr/InitParams+in+SolrConfig


: Date: Mon, 4 Apr 2016 13:34:27 -0400
: From: Steven White <sw...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Sort order for *:* query
: 
: Hi everyone,
: 
: When I send Solr the query *:* the result I get back is sorted based on
: Lucene's internal DocID which is oldest to most recent (can someone correct
: me if I get this wrong?)  Given this, the most recently added / updated
: document is at the bottom of the list.  Is there a way to reverse this sort
: order?  If so, how can I make this the default in Solr's solrconfig.xml
: file?
: 
: Thanks
: 
: Steve
: 

-Hoss
http://www.lucidworks.com/