You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sandeep Khanzode <sa...@yahoo.com.INVALID> on 2014/06/24 19:31:33 UTC

Custom Sorting

Hi,

I am trying to implement a sort order for search results in Lucene 4.7.2.

If I want to use data for ordering that is not stored in Lucene as Fields, is there any way this can be done?
Basically, I would have certain data that is associated logically to a document but stored elsewhere, like a DB. Can I create a Custom Sort function on the lines of a FieldComparator to sort based on this data by plugging it inside the sort function?  

I have tested the performance of the Sort function for String and numeric types, and as mentioned in some blog, it seems that the numeric type is much faster compared to the string type. However, if I sort on a number of fields from multiple clients, the memory footprint, due to the FieldCache.DEFAULT impl, increases approximately 5-6 times. If I run this on a machine which does not have this capacity, will I get a OOM or will there be intense thrashing for the memory?

 
-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Custom Sorting

Posted by Vitaly Funstein <vf...@gmail.com>.
As a compromise, you can base your custom sort function on values of stored
fields in the same index - as opposed to fetching them from an external
data store, or relying on internal sorting implementation in Lucene. It
will still be relatively slow, but not nearly as slow as going out to a
DB... though you can also do some smart lazy caching (and selective
removal) of sort field values, as you go along with the sorting.

If I understand this correctly, FieldCache slurps in the value for every
sort field, for each document in the index up front, and holds on to them
at least for the duration of the search (or until the reader is closed
which may actually be even later) ... although there's probably more to it
than I am describing which I'll leave up to the experts to elaborate on.


On Wed, Jun 25, 2014 at 5:10 PM, Erick Erickson <er...@gmail.com>
wrote:

> Sure, you can  write a custom function, see:
> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>
> And you can invoke your custom function since sorting by function is
> supported.
>
> But my point remains. To be performant, you'll have to cache the
> results. Which is what's happening already.
> If you do something clever that tries to purge old values that you're
> sorting by, then you'll probably run into
> performance issues. At least that's my guess.
>
> I think this will be a dead-end for you, but would love to be proved
> wrong about that....
>
> Best
> Erick
>
> On Wed, Jun 25, 2014 at 4:34 AM, Sandeep Khanzode
> <sa...@yahoo.com.invalid> wrote:
> > Hi,
> >
> > Thanks for your reply.
> > Actually, I am evaluating both approaches.
> >
> > With the sort being performed on a field indexed in Lucene itself, my
> concern is with the FieldCache. Very quickly, for multiple clients
> executing in parallel, it bumps up to 8-10GB. This is for 4-5 different
> Sort fields using an index corpus of 50M documents. The problem is not so
> much the memory consumption, as mush as controlling it. If the max heap
> argument for the JVM is scaled back to 2-3GB, then all clients throw an
> OOM. How should the FieldCache scale based on the amount of available max
> memory to the JVM or can it be selectively turned off, or implement a LRU
> type of algorithm to purge old entries?
> >
> > Secondly, the the DB approach, yes, it will not perform. However, I just
> wanted to know whether such a custom sort function exists that allows one
> to write their own sort on a field that is not indexed by Lucene.
> >
> > Thanks again,
> >
> > -----------------------
> > Thanks n Regards,
> > Sandeep Ramesh Khanzode
> >
> >
> > On Wednesday, June 25, 2014 1:21 AM, Erick Erickson <
> erickerickson@gmail.com> wrote:
> >
> >
> >
> > I'm a little confused here. Sure, sorting on a number of fields will
> > increase memory, the basic idea here is that you need to cache all the
> > sort values (plus support structures) for performance reasons.
> >
> > If you create your own custom sort that goes out to a DB and gets the
> > doc, you have to be prepared for
> > q=*:*&sort=custom_function
> > Which means you'll have to fetch the value for each and every document
> > in the index. If this is a DB call, it will NOT perform.
> >
> > In order to be performant, you'll need to cache the values. Which is
> > what is being done _for_ you by the FieldCache.
> >
> > So I think this is really a false path, or an "XY" problem. Why do you
> > think you need to do this?
> >
> > Best,
> > Erick
> >
> >
> > On Tue, Jun 24, 2014 at 10:31 AM, Sandeep Khanzode
> > <sa...@yahoo.com.invalid> wrote:
> >> Hi,
> >>
> >> I am trying to implement a sort order for search results in Lucene
> 4.7.2.
> >>
> >> If I want to use data for ordering that is not stored in Lucene as
> Fields, is there any way this can be done?
> >> Basically, I would have certain data that is associated logically to a
> document but stored elsewhere, like a DB. Can I create a Custom Sort
> function on the lines of a FieldComparator to sort based on this data by
> plugging it inside the sort function?
> >>
> >> I have tested the performance of the Sort function for String and
> numeric types, and as mentioned in some blog, it seems that the numeric
> type is much faster compared to the string type. However, if I sort on a
> number of fields from multiple clients, the memory footprint, due to the
> FieldCache.DEFAULT impl, increases approximately 5-6 times. If I run this
> on a machine which does not have this capacity, will I get a OOM or will
> there be intense thrashing for the memory?
> >>
> >>
> >> -----------------------
> >> Thanks n Regards,
> >> Sandeep Ramesh Khanzode
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.
Sure, you can  write a custom function, see:
https://cwiki.apache.org/confluence/display/solr/Function+Queries

And you can invoke your custom function since sorting by function is supported.

But my point remains. To be performant, you'll have to cache the
results. Which is what's happening already.
If you do something clever that tries to purge old values that you're
sorting by, then you'll probably run into
performance issues. At least that's my guess.

I think this will be a dead-end for you, but would love to be proved
wrong about that....

Best
Erick

On Wed, Jun 25, 2014 at 4:34 AM, Sandeep Khanzode
<sa...@yahoo.com.invalid> wrote:
> Hi,
>
> Thanks for your reply.
> Actually, I am evaluating both approaches.
>
> With the sort being performed on a field indexed in Lucene itself, my concern is with the FieldCache. Very quickly, for multiple clients executing in parallel, it bumps up to 8-10GB. This is for 4-5 different Sort fields using an index corpus of 50M documents. The problem is not so much the memory consumption, as mush as controlling it. If the max heap argument for the JVM is scaled back to 2-3GB, then all clients throw an OOM. How should the FieldCache scale based on the amount of available max memory to the JVM or can it be selectively turned off, or implement a LRU type of algorithm to purge old entries?
>
> Secondly, the the DB approach, yes, it will not perform. However, I just wanted to know whether such a custom sort function exists that allows one to write their own sort on a field that is not indexed by Lucene.
>
> Thanks again,
>
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode
>
>
> On Wednesday, June 25, 2014 1:21 AM, Erick Erickson <er...@gmail.com> wrote:
>
>
>
> I'm a little confused here. Sure, sorting on a number of fields will
> increase memory, the basic idea here is that you need to cache all the
> sort values (plus support structures) for performance reasons.
>
> If you create your own custom sort that goes out to a DB and gets the
> doc, you have to be prepared for
> q=*:*&sort=custom_function
> Which means you'll have to fetch the value for each and every document
> in the index. If this is a DB call, it will NOT perform.
>
> In order to be performant, you'll need to cache the values. Which is
> what is being done _for_ you by the FieldCache.
>
> So I think this is really a false path, or an "XY" problem. Why do you
> think you need to do this?
>
> Best,
> Erick
>
>
> On Tue, Jun 24, 2014 at 10:31 AM, Sandeep Khanzode
> <sa...@yahoo.com.invalid> wrote:
>> Hi,
>>
>> I am trying to implement a sort order for search results in Lucene 4.7.2.
>>
>> If I want to use data for ordering that is not stored in Lucene as Fields, is there any way this can be done?
>> Basically, I would have certain data that is associated logically to a document but stored elsewhere, like a DB. Can I create a Custom Sort function on the lines of a FieldComparator to sort based on this data by plugging it inside the sort function?
>>
>> I have tested the performance of the Sort function for String and numeric types, and as mentioned in some blog, it seems that the numeric type is much faster compared to the string type. However, if I sort on a number of fields from multiple clients, the memory footprint, due to the FieldCache.DEFAULT impl, increases approximately 5-6 times. If I run this on a machine which does not have this capacity, will I get a OOM or will there be intense thrashing for the memory?
>>
>>
>> -----------------------
>> Thanks n Regards,
>> Sandeep Ramesh Khanzode
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom Sorting

Posted by Sandeep Khanzode <sa...@yahoo.com.INVALID>.
Hi,

Thanks for your reply. 
Actually, I am evaluating both approaches.

With the sort being performed on a field indexed in Lucene itself, my concern is with the FieldCache. Very quickly, for multiple clients executing in parallel, it bumps up to 8-10GB. This is for 4-5 different Sort fields using an index corpus of 50M documents. The problem is not so much the memory consumption, as mush as controlling it. If the max heap argument for the JVM is scaled back to 2-3GB, then all clients throw an OOM. How should the FieldCache scale based on the amount of available max memory to the JVM or can it be selectively turned off, or implement a LRU type of algorithm to purge old entries?

Secondly, the the DB approach, yes, it will not perform. However, I just wanted to know whether such a custom sort function exists that allows one to write their own sort on a field that is not indexed by Lucene.

Thanks again,

-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode


On Wednesday, June 25, 2014 1:21 AM, Erick Erickson <er...@gmail.com> wrote:
 


I'm a little confused here. Sure, sorting on a number of fields will
increase memory, the basic idea here is that you need to cache all the
sort values (plus support structures) for performance reasons.

If you create your own custom sort that goes out to a DB and gets the
doc, you have to be prepared for
q=*:*&sort=custom_function
Which means you'll have to fetch the value for each and every document
in the index. If this is a DB call, it will NOT perform.

In order to be performant, you'll need to cache the values. Which is
what is being done _for_ you by the FieldCache.

So I think this is really a false path, or an "XY" problem. Why do you
think you need to do this?

Best,
Erick


On Tue, Jun 24, 2014 at 10:31 AM, Sandeep Khanzode
<sa...@yahoo.com.invalid> wrote:
> Hi,
>
> I am trying to implement a sort order for search results in Lucene 4.7.2.
>
> If I want to use data for ordering that is not stored in Lucene as Fields, is there any way this can be done?
> Basically, I would have certain data that is associated logically to a document but stored elsewhere, like a DB. Can I create a Custom Sort function on the lines of a FieldComparator to sort based on this data by plugging it inside the sort function?
>
> I have tested the performance of the Sort function for String and numeric types, and as mentioned in some blog, it seems that the numeric type is much faster compared to the string type. However, if I sort on a number of fields from multiple clients, the memory footprint, due to the FieldCache.DEFAULT impl, increases approximately 5-6 times. If I run this on a machine which does not have this capacity, will I get a OOM or will there be intense thrashing for the memory?
>
>
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Custom Sorting

Posted by Erick Erickson <er...@gmail.com>.
I'm a little confused here. Sure, sorting on a number of fields will
increase memory, the basic idea here is that you need to cache all the
sort values (plus support structures) for performance reasons.

If you create your own custom sort that goes out to a DB and gets the
doc, you have to be prepared for
q=*:*&sort=custom_function
Which means you'll have to fetch the value for each and every document
in the index. If this is a DB call, it will NOT perform.

In order to be performant, you'll need to cache the values. Which is
what is being done _for_ you by the FieldCache.

So I think this is really a false path, or an "XY" problem. Why do you
think you need to do this?

Best,
Erick

On Tue, Jun 24, 2014 at 10:31 AM, Sandeep Khanzode
<sa...@yahoo.com.invalid> wrote:
> Hi,
>
> I am trying to implement a sort order for search results in Lucene 4.7.2.
>
> If I want to use data for ordering that is not stored in Lucene as Fields, is there any way this can be done?
> Basically, I would have certain data that is associated logically to a document but stored elsewhere, like a DB. Can I create a Custom Sort function on the lines of a FieldComparator to sort based on this data by plugging it inside the sort function?
>
> I have tested the performance of the Sort function for String and numeric types, and as mentioned in some blog, it seems that the numeric type is much faster compared to the string type. However, if I sort on a number of fields from multiple clients, the memory footprint, due to the FieldCache.DEFAULT impl, increases approximately 5-6 times. If I run this on a machine which does not have this capacity, will I get a OOM or will there be intense thrashing for the memory?
>
>
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org