You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew Ingram <an...@tangentlabs.co.uk> on 2011/11/21 12:22:41 UTC

Efficient title sorting on large result sets.

Hi everyone,

We have a large product catalogue (currently 9 million, but soon to inflate to around 25 million) with each product have a unicode title. We're offering the facility to sort by title, but often within quite large result sets, eg 1 million fiction books (we are correctly using filters). Aside from the obvious questionable use of sorting over such a large set of results, I'm wondering if there's any steps I can take to optimise title sorting and minimise memory use.

Solr also crashes with OutOfMemoryErrors every couple of days, could this be related to the sorting by title? Or should I be looking for another cause? The machine Solr is on has 8gb ram, 7 of which is given to Solr. We have other sites with larger catalogues and similar spec hardware that aren't having any issues, the title sorting seems to be the only major difference in functionality.

I'll be very grateful for any assistance.

Regards,
Andy Ingram

Re: Efficient title sorting on large result sets.

Posted by Andrew Ingram <an...@tangentlabs.co.uk>.

On 21 Nov 2011, at 23:17, Chris Hostetter wrote:

> 
> : The way that I've solved this in the past is to make a field
> : specifically for sorting and then truncate the string to a small number
> : of characters and sort on that. You have to accept that in some cases
> 
> Something to consider is the ICUCollationKeyFilterFactory.  As noted on 
> the wiki...
> 
> 	This filter works like ?CollationKeyFilterFactory, except it uses ICU 
> 	for collation. This makes smaller and faster sort keys, ...
> 
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory
> 
> 
> -Hoss

Thanks for your help. So it's seeming like accurate string sorting over a large result set is always going to be problematic. My preferred solution is to not expose sorting functionality until the number of results is sufficiently small (eg less than 1000). I'll feed this back to the powers that be.

Regards,
Andrew Ingram

RE: Efficient title sorting on large result sets.

Posted by Chris Hostetter <ho...@fucit.org>.

: The way that I've solved this in the past is to make a field
: specifically for sorting and then truncate the string to a small number
: of characters and sort on that. You have to accept that in some cases

Something to consider is the ICUCollationKeyFilterFactory.  As noted on 
the wiki...

	This filter works like ?CollationKeyFilterFactory, except it uses ICU 
	for collation. This makes smaller and faster sort keys, ...

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory


-Hoss

RE: Efficient title sorting on large result sets.

Posted by "Young, Cody" <Co...@move.com>.

Hi Andrew,

When you request a sort on a field, Lucene stores every unique value in
a field cache, which stays in ram. If you have a large index and you're
sorting on a Unicode string field, this can be very memory intensive.
The way that I've solved this in the past is to make a field
specifically for sorting and then truncate the string to a small number
of characters and sort on that. You have to accept that in some cases
sort order will be wrong. (If you truncate to 6 characters and then sort
Thisisastring and Thisisnotastring) you're not guaranteed to get the
correct sort order. 

The memory benefits to this are two-fold though, you have a shorter
string which takes up less memory, and you have a decreased number of
unique values.

Cody

-----Original Message-----
From: Andrew Ingram [mailto:andrew.ingram@tangentlabs.co.uk] 
Sent: Monday, November 21, 2011 3:23 AM
To: solr-user@lucene.apache.org
Subject: Efficient title sorting on large result sets.

Hi everyone,

We have a large product catalogue (currently 9 million, but soon to
inflate to around 25 million) with each product have a unicode title.
We're offering the facility to sort by title, but often within quite
large result sets, eg 1 million fiction books (we are correctly using
filters). Aside from the obvious questionable use of sorting over such a
large set of results, I'm wondering if there's any steps I can take to
optimise title sorting and minimise memory use.

Solr also crashes with OutOfMemoryErrors every couple of days, could
this be related to the sorting by title? Or should I be looking for
another cause? The machine Solr is on has 8gb ram, 7 of which is given
to Solr. We have other sites with larger catalogues and similar spec
hardware that aren't having any issues, the title sorting seems to be
the only major difference in functionality.

I'll be very grateful for any assistance.

Regards,
Andy Ingram