You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jake Mannix <ja...@gmail.com> on 2009/10/26 18:14:23 UTC

Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

> Also, I really do not like the large perf hits at highish topN sizes.
> Admittedly it's unusual (though I suspect not THAT rare) to use such
> a large topN, but, I esepically don't like that it doesn't degrade
> gracefully – it's surprising. Likewise I'd like to see with highish
> number of segments how things degrade.

Well the degradation should not be surprising - it's coded into the
subleading complexity: redoing the complexity argument carefully
for the average case, probabilistically, the flop count for single PQ,
for top T docs out of S segments of H hits in each, is^(footnote1)

  O(S*H + T*log(T)*log(S*H))

(note that it is a *sum* of the two terms)

while inserting in S different queues is^(footnote2)

  O(S*H + T*log(T)*log(H)*S )

This means that multi-PQ has a subleading term with
S*log(H) / log(S*H) worse performance, as a function of T.

To see whether this is "bad", we have to consider the full
range of S = numSegments, H = numHitsPerSegment
and T = topNumHitsToReturn.

If you fix S and H (which fixes the dominating factor as
constant also), then as T increases, as the tests show,
multi-PQ diverges, getting worse as T goes up.

On the other hand, if you fix T, and instead let S and H
vary, then the flop-count ratio is the following:

flops_multiPQ / flops_singlePQ =

S*H + T*log(T)*log(S*H)
-------------------------------- =
S*H + T*log(T)*log(H)*S

1 + T*log(T)*log(S*H)/S*H
---------------------------------- =
1 + T*log(T)*log(H)*S/S*H

under the always true (for us) case that S*H is the
dominating factor, and using 1/(1+x) =~ 1-x, for small
x, we have:

(1 + T*log(T)*log(S*H)/S*H ) * (1 - T*log(T)*log(H)/H )

using (1+x)*(1-y) = 1+x-y for small x, y, and pulling out
the common factor of T*log(T)/H, we have:

1 - ( T*log(T)/H ) * ( log(H)  - (log(S*H) / S) )

What's the end result?  If you fix the queue size (T),
then as the number of hits and the number of hits
per segment grows, the percentage performance difference
should go to *zero* proportional to 1/H =
1/numHitsPerSegment.

This to me is pretty important to note, and should be
pretty easy to see in a test run: doing it on a run with
5 million documents hit by the query should work out
much better for the multiPQ than singlePQ, whereas
*decreasing* the hit count, to say 10,000 hits, should
penalize multiPQ versus singlePQ.

The average user I would imagine doesn't have *tons*
of variation in queue-size or number of segments
(meaning I doubt either grows often above 100, and
very rarely up to 1000), but the variation in total
number of hits is extremely variable, and as even
recent discussions on java-user show, people even
look at multi-TB indexes, so performance as a function
of index size is probably the most important.

I can run these numbers on my mac desktop, because
this analysis is independent of hardware or even
java version, since it's a scaling argument, but we should
really see what the results are like on reasonable
hardware for both the small hit-count case (where the
performance might be really bad for singlePQ, and
rule it out altogether), and the very high hit-count case
(where the difference should be much more in favor
of multiPQ, and in fact could rule out singlePQ if the
constant factors (like the cost of the more string
compares that singlePQ has to do) become more
important than the complexity, because the asymptotics
are the same in this regime).

  -jake

------

(1) To see this scaling, if you have a queue of size
T, and you have currently looked at X hits, then the
probability in the average case of the next hit beating
the bottom element of the queue is T/X.  Then the
total number of insertions in the queue you have to
do in a set of H hits is sum_X=1...H [ T/X ].  Replacing
the sum with an integral, it's just T log(H).  The
cost of an insertion into a PQ of size T is log(T), giving
the flop count based on insertions of log(T) * T * log(H)
for insertions while walking H hits, and log(T)*T*log(S*H)
for insertions while walking S*H hits.

(2) we're neglecting the final merge for the multiPQ case,
because its sub-sub-leading in complexity: merging
S sorted lists of T elements is O(T log(S)) on average,
which is smaller than the next bigger term in this counting
by a multiplicative factor of log(T) * log(H).

Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

Posted by Michael McCandless <lu...@mikemccandless.com>.
Thanks for the thorough analysis Jake.

This makes intuitive sense.  Early on, there are many insertions to
the PQ, but over time, it "converges" and the insertions eventually
become rare.

So, the total number of insertions that must be handled (= the
sub-leading cost) will necessarily be higher for the multi-PQ case,
because each PQ in the multi-PQ case must separately converge per
segment.  The single-PQ converges faster.

But that cost becomes nearly a constant once the number of hits gets
high enough (ie, insertions become rare), and so is a vanishing
contribution vs the overall leading S*H cost.

With a low number of hits, both single and multi PQ are handling many
insertions.  Going to a higher number of hits, single PQ has
converged but multi PQ hasn't and is thus more costly.  Going to an
even higher number of hits, both have converged and the insertion cost
of either becomes "in the noise".

I'm testing on 5M doc indexes (wikipedia & random) now to test queries
with more hits.

Also, I think the constant factors for single-PQ insertion are higher,
because it has to manage indirection (slots) and so must use a
separate class to track that (Entry), plus it makes 4 method calls in
the insertion case (compareBottom, copy, updateBottom, setBottom).

Mike

On Mon, Oct 26, 2009 at 1:14 PM, Jake Mannix <ja...@gmail.com> wrote:
>> Also, I really do not like the large perf hits at highish topN sizes.
>> Admittedly it's unusual (though I suspect not THAT rare) to use such
>> a large topN, but, I esepically don't like that it doesn't degrade
>> gracefully – it's surprising. Likewise I'd like to see with highish
>> number of segments how things degrade.
>
> Well the degradation should not be surprising - it's coded into the
> subleading complexity: redoing the complexity argument carefully
> for the average case, probabilistically, the flop count for single PQ,
> for top T docs out of S segments of H hits in each, is^(footnote1)
>
>   O(S*H + T*log(T)*log(S*H))
>
> (note that it is a *sum* of the two terms)
>
> while inserting in S different queues is^(footnote2)
>
>   O(S*H + T*log(T)*log(H)*S )
>
> This means that multi-PQ has a subleading term with
> S*log(H) / log(S*H) worse performance, as a function of T.
>
> To see whether this is "bad", we have to consider the full
> range of S = numSegments, H = numHitsPerSegment
> and T = topNumHitsToReturn.
>
> If you fix S and H (which fixes the dominating factor as
> constant also), then as T increases, as the tests show,
> multi-PQ diverges, getting worse as T goes up.
>
> On the other hand, if you fix T, and instead let S and H
> vary, then the flop-count ratio is the following:
>
> flops_multiPQ / flops_singlePQ =
>
> S*H + T*log(T)*log(S*H)
> -------------------------------- =
> S*H + T*log(T)*log(H)*S
>
> 1 + T*log(T)*log(S*H)/S*H
> ---------------------------------- =
> 1 + T*log(T)*log(H)*S/S*H
>
> under the always true (for us) case that S*H is the
> dominating factor, and using 1/(1+x) =~ 1-x, for small
> x, we have:
>
> (1 + T*log(T)*log(S*H)/S*H ) * (1 - T*log(T)*log(H)/H )
>
> using (1+x)*(1-y) = 1+x-y for small x, y, and pulling out
> the common factor of T*log(T)/H, we have:
>
> 1 - ( T*log(T)/H ) * ( log(H)  - (log(S*H) / S) )
>
> What's the end result?  If you fix the queue size (T),
> then as the number of hits and the number of hits
> per segment grows, the percentage performance difference
> should go to *zero* proportional to 1/H =
> 1/numHitsPerSegment.
>
> This to me is pretty important to note, and should be
> pretty easy to see in a test run: doing it on a run with
> 5 million documents hit by the query should work out
> much better for the multiPQ than singlePQ, whereas
> *decreasing* the hit count, to say 10,000 hits, should
> penalize multiPQ versus singlePQ.
>
> The average user I would imagine doesn't have *tons*
> of variation in queue-size or number of segments
> (meaning I doubt either grows often above 100, and
> very rarely up to 1000), but the variation in total
> number of hits is extremely variable, and as even
> recent discussions on java-user show, people even
> look at multi-TB indexes, so performance as a function
> of index size is probably the most important.
>
> I can run these numbers on my mac desktop, because
> this analysis is independent of hardware or even
> java version, since it's a scaling argument, but we should
> really see what the results are like on reasonable
> hardware for both the small hit-count case (where the
> performance might be really bad for singlePQ, and
> rule it out altogether), and the very high hit-count case
> (where the difference should be much more in favor
> of multiPQ, and in fact could rule out singlePQ if the
> constant factors (like the cost of the more string
> compares that singlePQ has to do) become more
> important than the complexity, because the asymptotics
> are the same in this regime).
>
>   -jake
>
> ------
>
> (1) To see this scaling, if you have a queue of size
> T, and you have currently looked at X hits, then the
> probability in the average case of the next hit beating
> the bottom element of the queue is T/X.  Then the
> total number of insertions in the queue you have to
> do in a set of H hits is sum_X=1...H [ T/X ].  Replacing
> the sum with an integral, it's just T log(H).  The
> cost of an insertion into a PQ of size T is log(T), giving
> the flop count based on insertions of log(T) * T * log(H)
> for insertions while walking H hits, and log(T)*T*log(S*H)
> for insertions while walking S*H hits.
>
> (2) we're neglecting the final merge for the multiPQ case,
> because its sub-sub-leading in complexity: merging
> S sorted lists of T elements is O(T log(S)) on average,
> which is smaller than the next bigger term in this counting
> by a multiplicative factor of log(T) * log(H).
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org