You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Claudio ." <sp...@hotmail.com> on 2009/06/22 01:39:19 UTC

Optimization of memory usage in PriorityQueue

Hi,

    
The PriorityQueue used in TopDocCollector does not optimize the memory usage. If I do this search:

TopDocs topDocs = searcher.search(query, null, 10000);    
    
And only 1 document is returned. The PriorityQueue will create an Object[] of size 10000 + 1, but only one position of the array is used.
    
The Lucene could implement a PriorityQueue with size extensible.

The PriorityQueue of the Sun is size extensible. Why does not use it?

Thanks
_________________________________________________________________
Drag n’ drop—Get easy photo sharing with Windows Live™ Photos.

http://www.microsoft.com/windows/windowslive/products/photos.aspx

Re: Optimization of memory usage in PriorityQueue

Posted by Shai Erera <se...@gmail.com>.
You mean that if we add it to Collector all developers will need to impl it?
You're right, therefore I'm not sure it's such a good idea. Also, a unlike
PQ, a Collector seems less reusable, as it doesn't declare to hold any data
structures that can/should be reused.

What about PQ?

Shai

On Mon, Jun 22, 2009 at 5:22 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Mon, Jun 22, 2009 at 10:05 AM, Michael
> McCandless<lu...@mikemccandless.com> wrote:
> > On Mon, Jun 22, 2009 at 9:40 AM, Shai Erera<se...@gmail.com> wrote:
> >> Do you think it's worth it? If so, should we also add reset() to
> Collector?
> >
> > Seems like it'd be good to have the option?  But it's certainly a very
> > expert thing since it's presumably rare that one makes such a large
> > queue to make re-use worthwhile.
>
> Adding the option isn't free though - all developers must then
> implement it, and it possibly adds a runtime cost to those not
> reusing.  Attempting to reuse Collector seems super special case -
> those that want to can always develop their own custom reusable
> Collector.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Optimization of memory usage in PriorityQueue

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Jun 22, 2009 at 10:29 AM, Shai Erera<se...@gmail.com> wrote:
> You mean that if we add it to Collector all developers will need to impl it?
> You're right, therefore I'm not sure it's such a good idea. Also, a unlike
> PQ, a Collector seems less reusable, as it doesn't declare to hold any data
> structures that can/should be reused.
>
> What about PQ?

PQ already has a clear().

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Optimization of memory usage in PriorityQueue

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Jun 22, 2009 at 10:05 AM, Michael
McCandless<lu...@mikemccandless.com> wrote:
> On Mon, Jun 22, 2009 at 9:40 AM, Shai Erera<se...@gmail.com> wrote:
>> Do you think it's worth it? If so, should we also add reset() to Collector?
>
> Seems like it'd be good to have the option?  But it's certainly a very
> expert thing since it's presumably rare that one makes such a large
> queue to make re-use worthwhile.

Adding the option isn't free though - all developers must then
implement it, and it possibly adds a runtime cost to those not
reusing.  Attempting to reuse Collector seems super special case -
those that want to can always develop their own custom reusable
Collector.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Optimization of memory usage in PriorityQueue

Posted by Shai Erera <se...@gmail.com>.
Agreed. I'll see if I can patch it sometime soon.

Shai

On Mon, Jun 22, 2009 at 5:05 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Jun 22, 2009 at 9:40 AM, Shai Erera<se...@gmail.com> wrote:
> >> Though, are Lucene's core collectors reusable?  If you did really want
> >> say 100K results out of each search (very unusual), it'd be nice to
> >> not have to throw away the Collector/PQ each time.
> >
> > PQ has clear(), but it does not really allow you to reuse it, it just
> > removes all the elements. So if we want to reuse a PQ, then we need to
> add a
> > reset() (with a default impl of calling clear()) to both TSDC and PQ.
> TSDC
> > will delegate the call to PQ, which is actually HitQueue, which will
> iterate
> > on all the elements and reset them to sentinel values.
> >
> > TopFieldCollector's reset() will do the same, delegating the call to its
> own
> > PQ (FieldValueHitQueue), which will do nothing (call clear()).
> >
> > Do you think it's worth it? If so, should we also add reset() to
> Collector?
>
> Seems like it'd be good to have the option?  But it's certainly a very
> expert thing since it's presumably rare that one makes such a large
> queue to make re-use worthwhile.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Optimization of memory usage in PriorityQueue

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Jun 22, 2009 at 9:40 AM, Shai Erera<se...@gmail.com> wrote:
>> Though, are Lucene's core collectors reusable?  If you did really want
>> say 100K results out of each search (very unusual), it'd be nice to
>> not have to throw away the Collector/PQ each time.
>
> PQ has clear(), but it does not really allow you to reuse it, it just
> removes all the elements. So if we want to reuse a PQ, then we need to add a
> reset() (with a default impl of calling clear()) to both TSDC and PQ. TSDC
> will delegate the call to PQ, which is actually HitQueue, which will iterate
> on all the elements and reset them to sentinel values.
>
> TopFieldCollector's reset() will do the same, delegating the call to its own
> PQ (FieldValueHitQueue), which will do nothing (call clear()).
>
> Do you think it's worth it? If so, should we also add reset() to Collector?

Seems like it'd be good to have the option?  But it's certainly a very
expert thing since it's presumably rare that one makes such a large
queue to make re-use worthwhile.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Optimization of memory usage in PriorityQueue

Posted by Shai Erera <se...@gmail.com>.
>
> Though, are Lucene's core collectors reusable?  If you did really want
> say 100K results out of each search (very unusual), it'd be nice to
> not have to throw away the Collector/PQ each time.
>

PQ has clear(), but it does not really allow you to reuse it, it just
removes all the elements. So if we want to reuse a PQ, then we need to add a
reset() (with a default impl of calling clear()) to both TSDC and PQ. TSDC
will delegate the call to PQ, which is actually HitQueue, which will iterate
on all the elements and reset them to sentinel values.

TopFieldCollector's reset() will do the same, delegating the call to its own
PQ (FieldValueHitQueue), which will do nothing (call clear()).

Do you think it's worth it? If so, should we also add reset() to Collector?

Shai

On Mon, Jun 22, 2009 at 12:29 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Jun 22, 2009 at 3:25 AM, Shai Erera<se...@gmail.com> wrote:
>
> > Or ... we can do nothing, and say that one can write his own Collector,
> and
> > use Sun's PQ (or any other), if one has a case like "I need 10K results,
> but
> > I don't know how many are there, and specifically I want to optimize for
> the
> > case of 1 result".
>
> +1
>
> I think Lucene's current PQ is optimized for the [very] common case,
> and if someone would like to eg swap to Sun's PQ impl, the custom
> Collector API is the best route.  (And, I'd love to hear back on how
> the performance compares!  If Sun has a faster PQ than Lucene, we
> should fix that ;) ).
>
> Though, are Lucene's core collectors reusable?  If you did really want
> say 100K results out of each search (very unusual), it'd be nice to
> not have to throw away the Collector/PQ each time.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Optimization of memory usage in PriorityQueue

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Jun 22, 2009 at 3:25 AM, Shai Erera<se...@gmail.com> wrote:

> Or ... we can do nothing, and say that one can write his own Collector, and
> use Sun's PQ (or any other), if one has a case like "I need 10K results, but
> I don't know how many are there, and specifically I want to optimize for the
> case of 1 result".

+1

I think Lucene's current PQ is optimized for the [very] common case,
and if someone would like to eg swap to Sun's PQ impl, the custom
Collector API is the best route.  (And, I'd love to hear back on how
the performance compares!  If Sun has a faster PQ than Lucene, we
should fix that ;) ).

Though, are Lucene's core collectors reusable?  If you did really want
say 100K results out of each search (very unusual), it'd be nice to
not have to throw away the Collector/PQ each time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Optimization of memory usage in PriorityQueue

Posted by Simon Willnauer <si...@googlemail.com>.
See r787628

--- lucene/java/trunk/src/java/org/apache/lucene/search/FieldValueHitQueue.java
(original)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/FieldValueHitQueue.java
Tue Jun 23 10:48:55 2009
@@ -173,6 +173,9 @@
  /**
   * Creates a hit queue sorted by the given list of fields.
   *
+   * <p><b>NOTE</b>: The instances returned by this method
+   * pre-allocate a full array of length <code>numHits</code>.
+   *
   * @param fields
   *          SortField array we are sorting by in priority order (highest
   *          priority first); cannot be <code>null</code> or empty

Modified: lucene/java/trunk/src/java/org/apache/lucene/search/HitQueue.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/HitQueue.java?rev=787628&r1=787627&r2=787628&view=diff

you looked at the 2.4.1 javadoc.

simon

On Tue, Jun 23, 2009 at 9:58 PM, Claudio .<sp...@hotmail.com> wrote:
> Hi Mike,
>
> I looked at in javadoc:
>
> http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Searcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int)
>
> and in trunk version of Lucene and I did not find your NOTE.
> Can you confirm?
> Thanks!
>
>
>> Date: Tue, 23 Jun 2009 06:49:00 -0400
>> Subject: Re: Optimization of memory usage in PriorityQueue
>> From: lucene@mikemccandless.com
>> To: java-dev@lucene.apache.org
>>
>> OK I've added a NOTE to the javadocs for this...
>>
>> Mike
>>
>> On Mon, Jun 22, 2009 at 8:57 PM, Claudio .<sp...@hotmail.com> wrote:
>> >>> I think the common use case of TopScoreDocCollector is to request 10
>> >>> results, then ask for 20 and so on. You ask for N results because you
>> >>> want
>> >>> to display them, or manipulate them in some way.
>> >
>> >>> However, if we do add this to the base PQ, it means an extra check for
>> >>> every put() call. We've tried very hard recently to remove 'if' checks
>> >>> in
>> >>> TopScoreDocCollector and TopFieldCollector, so adding one back does
>> >>> not seem
>> >>> very appealing to me, especially since I IMO the use case of
>> >>> requesting 10K
>> >>> results w/o knowing first there are more or so that match the query,
>> >>> is not
>> >>> so common.
>> >
>> > Hi serera,
>> >
>> > Ok, I understood.
>> > I think the Lucene team could add a warning in the javadoc, warning that
>> > will be created internally an Object[] with the size of top.
>> > I thought that the Lucene would create internally an array with the
>> > number
>> > of results.
>> > The users can set high values of top (like me) without knowing that they
>> > are
>> > wasting memory.
>> > Thanks!
>> >
>> > ________________________________
>> > check out the rest of the Windows Live™. More than mail–Windows Live™
>> > goes
>> > way beyond your inbox. More than messages
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ________________________________
> See all the ways you can stay connected to friends and family

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: Optimization of memory usage in PriorityQueue

Posted by "Claudio ." <sp...@hotmail.com>.
Hi Mike,

I looked at in javadoc:

http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Searcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int)

and in trunk version of Lucene and I did not find your NOTE.
Can you confirm?
Thanks!


> Date: Tue, 23 Jun 2009 06:49:00 -0400
> Subject: Re: Optimization of memory usage in PriorityQueue
> From: lucene@mikemccandless.com
> To: java-dev@lucene.apache.org
> 
> OK I've added a NOTE to the javadocs for this...
> 
> Mike
> 
> On Mon, Jun 22, 2009 at 8:57 PM, Claudio .<sp...@hotmail.com> wrote:
> >>> I think the common use case of TopScoreDocCollector is to request 10
> >>> results, then ask for 20 and so on. You ask for N results because you want
> >>> to display them, or manipulate them in some way.
> >
> >>> However, if we do add this to the base PQ, it means an extra check for
> >>> every put() call. We've tried very hard recently to remove 'if' checks in
> >>> TopScoreDocCollector and TopFieldCollector, so adding one back does not seem
> >>> very appealing to me, especially since I IMO the use case of requesting 10K
> >>> results w/o knowing first there are more or so that match the query, is not
> >>> so common.
> >
> > Hi serera,
> >
> > Ok, I understood.
> > I think the Lucene team could add a warning in the javadoc, warning that
> > will be created internally an Object[] with the size of top.
> > I thought that the Lucene would create internally an array with the number
> > of results.
> > The users can set high values of top (like me) without knowing that they are
> > wasting memory.
> > Thanks!
> >
> > ________________________________
> > check out the rest of the Windows Live™. More than mail–Windows Live™ goes
> > way beyond your inbox. More than messages
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

_________________________________________________________________
Show them the way! Add maps and directions to your party invites. 
http://www.microsoft.com/windows/windowslive/products/events.aspx

Re: Optimization of memory usage in PriorityQueue

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK I've added a NOTE to the javadocs for this...

Mike

On Mon, Jun 22, 2009 at 8:57 PM, Claudio .<sp...@hotmail.com> wrote:
>>> I think the common use case of TopScoreDocCollector is to request 10
>>> results, then ask for 20 and so on. You ask for N results because you want
>>> to display them, or manipulate them in some way.
>
>>> However, if we do add this to the base PQ, it means an extra check for
>>> every put() call. We've tried very hard recently to remove 'if' checks in
>>> TopScoreDocCollector and TopFieldCollector, so adding one back does not seem
>>> very appealing to me, especially since I IMO the use case of requesting 10K
>>> results w/o knowing first there are more or so that match the query, is not
>>> so common.
>
> Hi serera,
>
> Ok, I understood.
> I think the Lucene team could add a warning in the javadoc, warning that
> will be created internally an Object[] with the size of top.
> I thought that the Lucene would create internally an array with the number
> of results.
> The users can set high values of top (like me) without knowing that they are
> wasting memory.
> Thanks!
>
> ________________________________
> check out the rest of the Windows Live™. More than mail–Windows Live™ goes
> way beyond your inbox. More than messages

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: Optimization of memory usage in PriorityQueue

Posted by "Claudio ." <sp...@hotmail.com>.
>> I think the common use case of TopScoreDocCollector is to request
10 results, then ask for 20 and so on. You ask for N results because
you want to display them, or manipulate them in some way.

>>
However, if we do add this to the base PQ, it means an extra check for
every put() call. We've tried very hard recently to remove 'if' checks
in TopScoreDocCollector and TopFieldCollector, so adding one back does
not seem very appealing to me, especially since I IMO the use case
of requesting 10K results w/o knowing first there are more or so that
match the query, is not so common.

Hi serera,

Ok, I understood.
I think the Lucene team could add a warning in the javadoc, warning that will be created internally an Object[] with the size of top. 
I thought that the Lucene would create internally an array with the number of results. 
The users can set high values of top (like me) without knowing that they are wasting memory.
Thanks!

_________________________________________________________________
More than messages–check out the rest of the Windows Live™.
http://www.microsoft.com/windows/windowslive/

Re: Optimization of memory usage in PriorityQueue

Posted by Shai Erera <se...@gmail.com>.
I think the common use case of TopScoreDocCollector is to request 10
results, then ask for 20 and so on. You ask for N results because you want
to display them, or manipulate them in some way.

However, if we do add this to the base PQ, it means an extra check for every
put() call. We've tried very hard recently to remove 'if' checks in
TopScoreDocCollector and TopFieldCollector, so adding one back does not seem
very appealing to me, especially since I IMO the use case of requesting 10K
results w/o knowing first there are more or so that match the query, is not
so common.

Of course, for the what I call common case, if I do know there are 10K
results, and request for 10K, but PQ starts w/ 10 (for example) and
gradually grows its array, I also lose performance, not just because of
extra 'if', but I also allocate ~20K slots (counting re-allocations).

You could write your own PQ which does that, although to achieve that we'll
need to make some methods on PQ not final. You will also need to write your
own TopScoreDocCollector version, since the latter does not accept PQ and is
a final class.

But perhaps others think it's worth it to add to PQ an extra 'if' and avoid
unnecessary large allocations? Or ... can we make PQ's methods not final,
and let others extend it and do what they want, not just impl lessThan()?
We can also have a factory method for PQ which returns a gradually growing
PQ, or an "init once" PQ, the latter will be used by TSDC and TFC and the
former can be asked for specifically.

Or ... we can do nothing, and say that one can write his own Collector, and
use Sun's PQ (or any other), if one has a case like "I need 10K results, but
I don't know how many are there, and specifically I want to optimize for the
case of 1 result".

BTW, notice that Sun's PQ does not have some of the methods we use today,
such as insertWithOverflow(), add() and updateTop(). Those were also
recently added for better performance.

Shai

On Mon, Jun 22, 2009 at 2:39 AM, Claudio . <sp...@hotmail.com> wrote:

>  Hi,
>
>
> The PriorityQueue used in TopDocCollector does not optimize the memory
> usage. If I do this search:
>
> TopDocs topDocs = searcher.search(query, null, 10000);
>
> And only 1 document is returned. The PriorityQueue will create an Object[]
> of size 10000 + 1, but only one position of the array is used.
>
> The Lucene could implement a PriorityQueue with size extensible.
>
> The PriorityQueue of the Sun is size extensible. Why does not use it?
>
> Thanks
> ------------------------------
> What can you do with the new Windows Live? Find out<http://www.microsoft.com/windows/windowslive/default.aspx>
>