You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2008/07/29 21:05:45 UTC

Sort suggestion

Guys.

I've noticed many having trouble with sorting and OOM. Eventually they solve
it by throwing more memory at the problem.

Should'nt a solution which can sort on disk when neccessary be implemented
in core Lucene ?
Something like this:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

Since you obviously know the result size you can calculate how much memory
is needed for the sort and if the calculated value s higher then a
configurable threshold an external on disk sort is performed and perhaps a
logging message which states something on a WARN level.

Just a thought since I'm about to implement something which could sort any
Comparable object but on disk.

Guess the Hadoop project have the perfect tools for this since everything
the mapred inputfiles are sorted, on disk and huge.

Kindly

//Marcus


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Sort suggestion

Posted by Marcus Herou <ma...@tailsweep.com>.

Yep a disk sort is slow as hell compared to mem sort. What I was thinking
was something like a db thinks.

MySQL for example does exactly this. If the resultset do not fit properly in
mem spool it on disk and sort it.

The thing is that it would allow you to continue adding docs to the index
even though you should invest in more memory asap.

Kindly

//Marcus

On Tue, Jul 29, 2008 at 9:17 PM, Mark Miller <ma...@gmail.com> wrote:

> I think you'll find it slow to add disk seeks in the sort on each search.
> Something you might be able to work from though (though I doubt it still
> applys cleanly) is Hoss' issue
> https://issues.apache.org/jira/browse/LUCENE-831. This allows for a
> pluggable cache implementation for sorting. Also allows for much faster
> reopening in most cases - hasn't seen any activity, and I think they are
> looking to get the reopen gains elsewhere, but it may be worth playing with.
>
> - Mark
>
>
> Marcus Herou wrote:
>
>> Guys.
>>
>> I've noticed many having trouble with sorting and OOM. Eventually they
>> solve
>> it by throwing more memory at the problem.
>>
>> Should'nt a solution which can sort on disk when neccessary be implemented
>> in core Lucene ?
>> Something like this:
>> http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
>>
>> Since you obviously know the result size you can calculate how much memory
>> is needed for the sort and if the calculated value s higher then a
>> configurable threshold an external on disk sort is performed and perhaps a
>> logging message which states something on a WARN level.
>>
>> Just a thought since I'm about to implement something which could sort any
>> Comparable object but on disk.
>>
>> Guess the Hadoop project have the perfect tools for this since everything
>> the mapred inputfiles are sorted, on disk and huge.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>>
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Sort suggestion

Posted by Marcus Herou <ma...@tailsweep.com>.

Yep a disk sort is slow as hell compared to mem sort. What I was thinking
was something like a db thinks.

MySQL for example does exactly this. If the resultset do not fit properly in
mem spool it on disk and sort it.

The thing is that it would allow you to continue adding docs to the index
even though you should invest in more memory asap.

Kindly

//Marcus

On Tue, Jul 29, 2008 at 9:17 PM, Mark Miller <ma...@gmail.com> wrote:

> I think you'll find it slow to add disk seeks in the sort on each search.
> Something you might be able to work from though (though I doubt it still
> applys cleanly) is Hoss' issue
> https://issues.apache.org/jira/browse/LUCENE-831. This allows for a
> pluggable cache implementation for sorting. Also allows for much faster
> reopening in most cases - hasn't seen any activity, and I think they are
> looking to get the reopen gains elsewhere, but it may be worth playing with.
>
> - Mark
>
>
> Marcus Herou wrote:
>
>> Guys.
>>
>> I've noticed many having trouble with sorting and OOM. Eventually they
>> solve
>> it by throwing more memory at the problem.
>>
>> Should'nt a solution which can sort on disk when neccessary be implemented
>> in core Lucene ?
>> Something like this:
>> http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
>>
>> Since you obviously know the result size you can calculate how much memory
>> is needed for the sort and if the calculated value s higher then a
>> configurable threshold an external on disk sort is performed and perhaps a
>> logging message which states something on a WARN level.
>>
>> Just a thought since I'm about to implement something which could sort any
>> Comparable object but on disk.
>>
>> Guess the Hadoop project have the perfect tools for this since everything
>> the mapred inputfiles are sorted, on disk and huge.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>>
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Sort suggestion

Posted by Mark Miller <ma...@gmail.com>.

I think you'll find it slow to add disk seeks in the sort on each 
search. Something you might be able to work from though (though I doubt 
it still applys cleanly) is Hoss' issue 
https://issues.apache.org/jira/browse/LUCENE-831. This allows for a 
pluggable cache implementation for sorting. Also allows for much faster 
reopening in most cases - hasn't seen any activity, and I think they are 
looking to get the reopen gains elsewhere, but it may be worth playing with.

- Mark

Marcus Herou wrote:
> Guys.
>
> I've noticed many having trouble with sorting and OOM. Eventually they solve
> it by throwing more memory at the problem.
>
> Should'nt a solution which can sort on disk when neccessary be implemented
> in core Lucene ?
> Something like this:
> http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
>
> Since you obviously know the result size you can calculate how much memory
> is needed for the sort and if the calculated value s higher then a
> configurable threshold an external on disk sort is performed and perhaps a
> logging message which states something on a WARN level.
>
> Just a thought since I'm about to implement something which could sort any
> Comparable object but on disk.
>
> Guess the Hadoop project have the perfect tools for this since everything
> the mapred inputfiles are sorted, on disk and huge.
>
> Kindly
>
> //Marcus
>
>
>

Re: Sort suggestion

Posted by Yonik Seeley <yo...@apache.org>.

The problem isn't sorting per-se... the problem is quickly retrieving
the sort value for a document.  For that, we currently have the
FieldCache.... that's what takes up the memory. There are more memory
efficient ways, but they just haven't been implemented yet.

-Yonik

On Tue, Jul 29, 2008 at 3:05 PM, Marcus Herou
<ma...@tailsweep.com> wrote:
> Guys.
>
> I've noticed many having trouble with sorting and OOM. Eventually they solve
> it by throwing more memory at the problem.
>
> Should'nt a solution which can sort on disk when neccessary be implemented
> in core Lucene ?
> Something like this:
> http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
>
> Since you obviously know the result size you can calculate how much memory
> is needed for the sort and if the calculated value s higher then a
> configurable threshold an external on disk sort is performed and perhaps a
> logging message which states something on a WARN level.
>
> Just a thought since I'm about to implement something which could sort any
> Comparable object but on disk.
>
> Guess the Hadoop project have the perfect tools for this since everything
> the mapred inputfiles are sorted, on disk and huge.
>
> Kindly
>
> //Marcus
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org