You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2013/11/12 18:00:31 UTC

Sorting memory-efficiently by any numeric field (dates too?)

Before I go and pat myself on the back, what do people think about this
trick? The base problem is "Is there a space-efficient way to return the
top N documents, sorted by a numeric field". The numeric field includes
dates.

It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If
we could return the numeric field in question as the score of a document it
should work without allocating the internal arrays for holding all the
timestamps.

So what about something like this?
/select?q={!boost b=manufacturedate_dt}text:*
and reverse order by
/select?q={!boost b=div(1,manufacturedate_dt)}text:*

It works on the test data. So let's assume that we're space constrained. It
_seems_ like this would only allocate enough space for the top N documents
in the result set which is insignificant in terms of memory consumption for
a large number of documents in a core. Any obvious problems that people see?

I see a couple of shortcomings:

1>  You only get one field. Unless you can create a really clever function
that incorporates all the values in multiple fields, this is going to be
hard to use with more than one field.

2> The boost syntax doesn't allow for a *:*, so you have to specify an
existing field. If there happen to be documents that don't have anything in
the field, you'll miss them.

3> I'm not sure what the performance issues are, especially in the case
where _every_ document scores better than the current top-N

Erick

RE: Sorting memory-efficiently by any numeric field (dates too?)

Posted by "Petersen, Robert" <ro...@mail.rakuten.com>.

Hi Erick,

I like your idea, FWIW please also leave room for boost by function query which takes many numeric fields as input but results in a single value.  I don't know if this counts as a really clever function but here's one that I currently use:

{!boost b=pow(sum(log(sum(product(boosted,9000),product(product(image,stocked),300),product(product(image,taxonomyCategoryTypeId),300),product(product(image,sales),150),product(stocked,2),product(sales,2),views)),1),3)}

Note, image is an int/bool field:  1=has image, 0=no image, hence all the product(product(image,...),...) terms above as they negate the boosts if there isn't an image!

Thanks
Robi

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, November 12, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Sorting memory-efficiently by any numeric field (dates too?)

Before I go and pat myself on the back, what do people think about this trick? The base problem is "Is there a space-efficient way to return the top N documents, sorted by a numeric field". The numeric field includes dates.

It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If we could return the numeric field in question as the score of a document it should work without allocating the internal arrays for holding all the timestamps.

So what about something like this?
/select?q={!boost b=manufacturedate_dt}text:* and reverse order by /select?q={!boost b=div(1,manufacturedate_dt)}text:*

It works on the test data. So let's assume that we're space constrained. It _seems_ like this would only allocate enough space for the top N documents in the result set which is insignificant in terms of memory consumption for a large number of documents in a core. Any obvious problems that people see?

I see a couple of shortcomings:

1>  You only get one field. Unless you can create a really clever 
1> function
that incorporates all the values in multiple fields, this is going to be hard to use with more than one field.

2> The boost syntax doesn't allow for a *:*, so you have to specify an
existing field. If there happen to be documents that don't have anything in the field, you'll miss them.

3> I'm not sure what the performance issues are, especially in the case
where _every_ document scores better than the current top-N

Erick

Re: Sorting memory-efficiently by any numeric field (dates too?)

Posted by Erick Erickson <er...@gmail.com>.

Siiigggh. Yet another "brilliant" idea bites the dust.

Thanks!
Erick


On Tue, Nov 12, 2013 at 8:49 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

> On Tue, Nov 12, 2013 at 7:01 PM, Erick Erickson <er...@gmail.com>
> wrote:
> > Yonik:
> >
> > Of course I'm not really up on the details of sorting, but aren't there
> > various control structures that are allocated for a sort but not for
> > scoring? I'm thinking of long[maxDoc] type structures in addition to
> > the actual values in the FieldCache.
>
> Nope, the source for the values is the same (FieldCache).
> There will be extra temporary space for each sort field value, but
> that's only O(topN) and not O(maxDoc)
>
> -Yonik
> http://heliosearch.com -- making solr shine
>

Re: Sorting memory-efficiently by any numeric field (dates too?)

Posted by Yonik Seeley <yo...@heliosearch.com>.

On Tue, Nov 12, 2013 at 7:01 PM, Erick Erickson <er...@gmail.com> wrote:
> Yonik:
>
> Of course I'm not really up on the details of sorting, but aren't there
> various control structures that are allocated for a sort but not for
> scoring? I'm thinking of long[maxDoc] type structures in addition to
> the actual values in the FieldCache.

Nope, the source for the values is the same (FieldCache).
There will be extra temporary space for each sort field value, but
that's only O(topN) and not O(maxDoc)

-Yonik
http://heliosearch.com -- making solr shine

Re: Sorting memory-efficiently by any numeric field (dates too?)

Posted by Erick Erickson <er...@gmail.com>.

Yonik:

Of course I'm not really up on the details of sorting, but aren't there
various control structures that are allocated for a sort but not for
scoring? I'm thinking of long[maxDoc] type structures in addition to
the actual values in the FieldCache.

I've been thinking about docValues for this as well, which may make it
all moot anyway.

Erick


On Tue, Nov 12, 2013 at 4:16 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

> For a reasonable top-N, the space efficiency should still be the same
> as it is really just dominated by the FieldCache representation (is it
> in-memory or disk-docvalue based).  Directly sorting on that numeric
> field vs deriving a score from the field and sorting on that shouldn't
> really be that different.
>
> -Yonik
> http://heliosearch.com -- making solr shine
>
>
> On Tue, Nov 12, 2013 at 12:00 PM, Erick Erickson
> <er...@gmail.com> wrote:
> > Before I go and pat myself on the back, what do people think about this
> > trick? The base problem is "Is there a space-efficient way to return the
> > top N documents, sorted by a numeric field". The numeric field includes
> > dates.
> >
> > It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If
> > we could return the numeric field in question as the score of a document
> it
> > should work without allocating the internal arrays for holding all the
> > timestamps.
> >
> > So what about something like this?
> > /select?q={!boost b=manufacturedate_dt}text:*
> > and reverse order by
> > /select?q={!boost b=div(1,manufacturedate_dt)}text:*
> >
> > It works on the test data. So let's assume that we're space constrained.
> It
> > _seems_ like this would only allocate enough space for the top N
> documents
> > in the result set which is insignificant in terms of memory consumption
> for
> > a large number of documents in a core. Any obvious problems that people
> see?
> >
> > I see a couple of shortcomings:
> >
> > 1>  You only get one field. Unless you can create a really clever
> function
> > that incorporates all the values in multiple fields, this is going to be
> > hard to use with more than one field.
> >
> > 2> The boost syntax doesn't allow for a *:*, so you have to specify an
> > existing field. If there happen to be documents that don't have anything
> in
> > the field, you'll miss them.
> >
> > 3> I'm not sure what the performance issues are, especially in the case
> > where _every_ document scores better than the current top-N
> >
> > Erick
>

Re: Sorting memory-efficiently by any numeric field (dates too?)

Posted by Yonik Seeley <yo...@heliosearch.com>.

For a reasonable top-N, the space efficiency should still be the same
as it is really just dominated by the FieldCache representation (is it
in-memory or disk-docvalue based).  Directly sorting on that numeric
field vs deriving a score from the field and sorting on that shouldn't
really be that different.

-Yonik
http://heliosearch.com -- making solr shine


On Tue, Nov 12, 2013 at 12:00 PM, Erick Erickson
<er...@gmail.com> wrote:
> Before I go and pat myself on the back, what do people think about this
> trick? The base problem is "Is there a space-efficient way to return the
> top N documents, sorted by a numeric field". The numeric field includes
> dates.
>
> It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If
> we could return the numeric field in question as the score of a document it
> should work without allocating the internal arrays for holding all the
> timestamps.
>
> So what about something like this?
> /select?q={!boost b=manufacturedate_dt}text:*
> and reverse order by
> /select?q={!boost b=div(1,manufacturedate_dt)}text:*
>
> It works on the test data. So let's assume that we're space constrained. It
> _seems_ like this would only allocate enough space for the top N documents
> in the result set which is insignificant in terms of memory consumption for
> a large number of documents in a core. Any obvious problems that people see?
>
> I see a couple of shortcomings:
>
> 1>  You only get one field. Unless you can create a really clever function
> that incorporates all the values in multiple fields, this is going to be
> hard to use with more than one field.
>
> 2> The boost syntax doesn't allow for a *:*, so you have to specify an
> existing field. If there happen to be documents that don't have anything in
> the field, you'll miss them.
>
> 3> I'm not sure what the performance issues are, especially in the case
> where _every_ document scores better than the current top-N
>
> Erick