You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Dueber <bi...@dueber.com> on 2009/06/23 17:54:45 UTC

Trie vs long string for sorting

I've having trouble understanding how the Trie type compares (speed- and
memory-wise) with dealing with long *string* (as opposed to integers).

My data are library call numbers, normalized to be comparable, resulting in
(maximum) 21-character strings of the form "RK 052180H359~999~999"

Now, these are fine -- they work for sorting and ranges and the whole thing,
but right now I can't use them because I've got two or three for each of my
6M documents and on a 32-bit machine I run out of heap.

Another option would be to turn them into longs (using roughly 56 bits of
the 64 bit space) and use a trie type. Is there any sort of a win involved
there?

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: Trie vs long string for sorting

Posted by Mark Miller <ma...@gmail.com>.
Trie has a custom parser that can load the FieldCache for sorting. Its
basically a built in type now, that supports fieldcache, sorting, stored
fields, etc.

On Sat, Jul 4, 2009 at 3:27 PM, Chris Hostetter <ho...@fucit.org>wrote:

>
> : My data are library call numbers, normalized to be comparable, resulting
> in
> : (maximum) 21-character strings of the form "RK 052180H359~999~999"
> :
> : Now, these are fine -- they work for sorting and ranges and the whole
> thing,
> : but right now I can't use them because I've got two or three for each of
> my
> : 6M documents and on a 32-bit machine I run out of heap.
> :
> : Another option would be to turn them into longs (using roughly 56 bits of
> : the 64 bit space) and use a trie type. Is there any sort of a win
> involved
> : there?
>
> I don't think Trie fields can be used for sorting (because they result in
> multiple terms per doc) but i could be wrong about that, smarter people
> then me may have done something cool with the TreiField that i'm not aware
> of.
>
> As a general rule: if you have character data that fits a rigid enough set
> of constraints that you can encode any legal value into a single
> numberic value (either int, or long) such that they still sort properly,
> sorting on those encoded values is going to be more memory efficient (and
> probably just as fast) as sorting on the string values.
>
>
> -Hoss
>
>


-- 
-- 
- Mark

http://www.lucidimagination.com

Re: Trie vs long string for sorting

Posted by Chris Hostetter <ho...@fucit.org>.
: My data are library call numbers, normalized to be comparable, resulting in
: (maximum) 21-character strings of the form "RK 052180H359~999~999"
: 
: Now, these are fine -- they work for sorting and ranges and the whole thing,
: but right now I can't use them because I've got two or three for each of my
: 6M documents and on a 32-bit machine I run out of heap.
: 
: Another option would be to turn them into longs (using roughly 56 bits of
: the 64 bit space) and use a trie type. Is there any sort of a win involved
: there?

I don't think Trie fields can be used for sorting (because they result in 
multiple terms per doc) but i could be wrong about that, smarter people 
then me may have done something cool with the TreiField that i'm not aware 
of.

As a general rule: if you have character data that fits a rigid enough set 
of constraints that you can encode any legal value into a single 
numberic value (either int, or long) such that they still sort properly, 
sorting on those encoded values is going to be more memory efficient (and 
probably just as fast) as sorting on the string values.


-Hoss