You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2009/10/15 05:12:18 UTC

lucene 2.9 sorting algorithm

Hi guys:
    Looking at the 2.9 sorting algorithm, and while trying to understand
FieldComparator class, I was wondering about the following optimization: (I
am using StringOrdValComparator as an example)

Currently we have 1 instance of per segment data structure, e.g. (ords,vals
etc.), and we keep track of the reader context using the readerGen
array. We keep track of the top numHits elemnents by keeping a
reference to the bottom element. To convert from readerContext1 to
readerContext2, we need a mapping from value->ord, and the cost is a binary
search (this is can be done N*numhits times). For some instances of custom
sort extension, this lookup maybe expensive.

How about instead, we keep a N instances of segment data structure, all
compares within segment is done with ord compare. We never to across segment
compare until the end. Where a merge is done on N instances using the
Comparator. This way, we avoid the lookup from value->ord, and can keep the
simpler API: ScoreDocComparator, which is much easier to extend for custom
sorting. It is a higher memory cost, but it is an extra (N-1)*(numhits)
times the data structure, where N is the number of segments, and is not too
large.

I am probably missing some intricacies with the current sorting algorithm.
If so, please shine some light.

Thanks

-John

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Oct 15, 2009 at 5:51 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>
> On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Nice results!  Comments below...
>>
>> > Here are the numbers (times are measured in nanoseconds):
>> >
>> > numHits = 50:
>> >
>> > Lucene 2.9/OneComparatorNonScoringCollector:
>> > num string compares: 251
>> > num conversions: 34
>> > time: 15272049
>> > cpu time:  161105
>> > num inserts into PQ: 211
>> >
>> > My test sort algorithm:
>> > num string compares: 51
>> > num conversions: 0
>> > time: 14943172
>> > cpu time: 153722
>> > num inserts into PQ: 1500
>> >
>> >
>> > numHits = 100:
>> >
>> >
>> > Lucene 2.9/OneComparatorNonScoringCollector:
>> > num string compares: 2285
>> > num conversions: 172
>> > time: 17703866
>> > cpu time:  187073
>> > num inserts into PQ: 982
>> >
>> > My test sort algorithm:
>> > num string compares: 210
>> > num conversions: 0
>> > time: 15823473
>> > cpu time: 160737
>> > num inserts into PQ: 6011
>>
>> These are nice results.  Single PQ does looks faster for this case.
>
> You mean multiPQ, right?

Woops, sorry, right!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Nice results!  Comments below...
>
> > Here are the numbers (times are measured in nanoseconds):
> >
> > numHits = 50:
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 251
> > num conversions: 34
> > time: 15272049
> > cpu time:  161105
> > num inserts into PQ: 211
> >
> > My test sort algorithm:
> > num string compares: 51
> > num conversions: 0
> > time: 14943172
> > cpu time: 153722
> > num inserts into PQ: 1500
> >
> >
> > numHits = 100:
> >
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 2285
> > num conversions: 172
> > time: 17703866
> > cpu time:  187073
> > num inserts into PQ: 982
> >
> > My test sort algorithm:
> > num string compares: 210
> > num conversions: 0
> > time: 15823473
> > cpu time: 160737
> > num inserts into PQ: 6011
>
> These are nice results.  Single PQ does looks faster for this case.
>

You mean multiPQ, right?

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Mike:

     Here are the results for numHits = 10:

Lucene 2.9:
num string compares: 86
num conversions: 21
num inserts: 115
time: 15069705
cpu: 174294

my test sort:
num string compares: 49
num conversions: 0
num inserts: 778
time: 14665375
cpu: 156442

     This is how the test data is indexed, basically a random positive
number assigned to each doc:
            int num = Math.abs(rand.nextInt());
            String val = _formatter.get().format(num);
            Document doc = new Document();

            Field f = new
Field("id",String.valueOf(docNum),Store.YES,Index.NOT_ANALYZED_NO_NORMS);
            f.setOmitTf(true);

            Field f2 = new
Field("num",val,Store.NO,Index.NOT_ANALYZED_NO_NORMS);
            f2.setOmitTf(true);

            String filterVal;
            if (docNum%2 == 0){
                filterVal = "even";
            }
            else{
                filterVal = "odd";
            }
            Field f3 = new
Field("filter",filterVal,Store.NO,Index.NOT_ANALYZED_NO_NORMS);
            f3.setOmitTf(true);

            doc.add(f);
            doc.add(f2);
            doc.add(f3);

Hope this helps

about reverse mapping and conversion:

basically the contract is convert a document's meta information that is
associated with a segment to a new segment. The only comparable part of the
data is the value. For fast sort, usually ordinal compare is used, which
requires a reverse mapping.

As for the "mistery" why the old all fieldcache implementation was slower I
think was the lack of tuning in the collector as well as use of the PQ.
Which means a larger constant on the "dominating" part. (I must say, the
level of code-tuning in the 2.9 code is impressive!)

-John

On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Nice results!  Comments below...
>
> On Thu, Oct 15, 2009 at 3:58 PM, John Wang <jo...@gmail.com> wrote:
> > Hi guys:
> >
> >     I did some Big O math a few times and reached the same conclusion
> Jake
> > had.
> >
> >     I was not sure about the code tuning opportunities we could have done
> > with the MergeAtTheEnd method as Yonik mentioned and the internal
> behavior
> > with PQ Mike suggested, so I went ahead and implemented the algorithm and
> > compared against the current Lucene 2.9 implementation:
> >
> > set up:
> > index size: 999992 docs
> > segment count: 8
> > sorting on 1 field, no score, e.g. comparing against
> > "OneComparatorNonScoringCollector" class.
> > TermQuery used on field with only 2 values (even/odd), returning 499992
> > docs.
>
> What text field are you sorting on?  (Ie where did it come from -- is
> it a natural or synthetic source?).
>
> > I broke the tests into two parts, numHits = 20 and numHits = 50, for
> each, I
> > ran 100 iterations, took the time from 6 - 100 (to avoid field cache load
> > time) and take the average. Did it 3 times and took that average.
>
> Can you try numHits=10 and 20?  I'm curious how this difference scales
> down.  I believe most searches don't go beyond page 1.
>
> It'd be great to see sorting by other typed (simple numerics) fields,
> and queries matching different # results, too.
>
> > Here are the numbers (times are measured in nanoseconds):
> >
> > numHits = 50:
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 251
> > num conversions: 34
> > time: 15272049
> > cpu time:  161105
> > num inserts into PQ: 211
> >
> > My test sort algorithm:
> > num string compares: 51
> > num conversions: 0
> > time: 14943172
> > cpu time: 153722
> > num inserts into PQ: 1500
> >
> >
> > numHits = 100:
> >
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 2285
> > num conversions: 172
> > time: 17703866
> > cpu time:  187073
> > num inserts into PQ: 982
> >
> > My test sort algorithm:
> > num string compares: 210
> > num conversions: 0
> > time: 15823473
> > cpu time: 160737
> > num inserts into PQ: 6011
>
> These are nice results.  Single PQ does looks faster for this case.
>
> > Conclusion:
> >
> > My measurement is consistent with mine and Jake's Big O analysis, where
> the
> > dominating factor is the same between both algorithms. And in terms of
> > timing/cpu, they are very similar with the second algorithm consistently
> > perform a slightly better than Lucene 2.9.
> > Mike is absolutely correct about the number of PQ insertions, but that
> win
> > seems to be shadowed by the cost of extra string compares and
> conversions.
> > If you look at the "growth" between numHits=20 and 100, the time gap
> > increases, e.g. as the PQ gets larger (case where user navigates to a
> higher
> > page of the search result) the num string compares increase faster than
> the
> > increase of the size of PQ.
> > There is an assumption that the conversion cost is low, which may not be
> > true for custom sort extensions, and could easily nominate the entire
> query
> > time. (more analysis below)
> >
> > Given this analysis, seems like at the least the second algorithm is not
> > worse in terms of performance, and I would argue it is actually better.
>
> I agree, so far!
>
> > With the current API which ties into this algorithm in Lucene 2.9
> > (FieldComparator), there is actually a subtle behavior change, such that,
> to
> > implement a custom sort, user would have to be able to do this
> conversion,
> > which involves a reverse mapping from value to ordinals. This is a change
> on
> > the contract for writing custom sort comparison, where before,
> > ordinal->value mapping can be 1 to many, and now, with the requirement
> for
> > this reverse mapping, it is much more difficult and expensive to do.
>
> But a custom comparator need not do the reverse conversion when
> implementing FieldComparator?  Ie, it can implement the conversion
> however it wants?  Also one can always implement their own Collector
> to do the sorting.
>
> > If you agree, I think it is worthwhile to open a jira ticket for this
> while
> > this FieldComparator API is still fresh. And I'd be happy to provide my
> > implementation, test code and my index.
>
> +1
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Nice results!  Comments below...

On Thu, Oct 15, 2009 at 3:58 PM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>
>     I did some Big O math a few times and reached the same conclusion Jake
> had.
>
>     I was not sure about the code tuning opportunities we could have done
> with the MergeAtTheEnd method as Yonik mentioned and the internal behavior
> with PQ Mike suggested, so I went ahead and implemented the algorithm and
> compared against the current Lucene 2.9 implementation:
>
> set up:
> index size: 999992 docs
> segment count: 8
> sorting on 1 field, no score, e.g. comparing against
> "OneComparatorNonScoringCollector" class.
> TermQuery used on field with only 2 values (even/odd), returning 499992
> docs.

What text field are you sorting on?  (Ie where did it come from -- is
it a natural or synthetic source?).

> I broke the tests into two parts, numHits = 20 and numHits = 50, for each, I
> ran 100 iterations, took the time from 6 - 100 (to avoid field cache load
> time) and take the average. Did it 3 times and took that average.

Can you try numHits=10 and 20?  I'm curious how this difference scales
down.  I believe most searches don't go beyond page 1.

It'd be great to see sorting by other typed (simple numerics) fields,
and queries matching different # results, too.

> Here are the numbers (times are measured in nanoseconds):
>
> numHits = 50:
>
> Lucene 2.9/OneComparatorNonScoringCollector:
> num string compares: 251
> num conversions: 34
> time: 15272049
> cpu time:  161105
> num inserts into PQ: 211
>
> My test sort algorithm:
> num string compares: 51
> num conversions: 0
> time: 14943172
> cpu time: 153722
> num inserts into PQ: 1500
>
>
> numHits = 100:
>
>
> Lucene 2.9/OneComparatorNonScoringCollector:
> num string compares: 2285
> num conversions: 172
> time: 17703866
> cpu time:  187073
> num inserts into PQ: 982
>
> My test sort algorithm:
> num string compares: 210
> num conversions: 0
> time: 15823473
> cpu time: 160737
> num inserts into PQ: 6011

These are nice results.  Single PQ does looks faster for this case.

> Conclusion:
>
> My measurement is consistent with mine and Jake's Big O analysis, where the
> dominating factor is the same between both algorithms. And in terms of
> timing/cpu, they are very similar with the second algorithm consistently
> perform a slightly better than Lucene 2.9.
> Mike is absolutely correct about the number of PQ insertions, but that win
> seems to be shadowed by the cost of extra string compares and conversions.
> If you look at the "growth" between numHits=20 and 100, the time gap
> increases, e.g. as the PQ gets larger (case where user navigates to a higher
> page of the search result) the num string compares increase faster than the
> increase of the size of PQ.
> There is an assumption that the conversion cost is low, which may not be
> true for custom sort extensions, and could easily nominate the entire query
> time. (more analysis below)
>
> Given this analysis, seems like at the least the second algorithm is not
> worse in terms of performance, and I would argue it is actually better.

I agree, so far!

> With the current API which ties into this algorithm in Lucene 2.9
> (FieldComparator), there is actually a subtle behavior change, such that, to
> implement a custom sort, user would have to be able to do this conversion,
> which involves a reverse mapping from value to ordinals. This is a change on
> the contract for writing custom sort comparison, where before,
> ordinal->value mapping can be 1 to many, and now, with the requirement for
> this reverse mapping, it is much more difficult and expensive to do.

But a custom comparator need not do the reverse conversion when
implementing FieldComparator?  Ie, it can implement the conversion
however it wants?  Also one can always implement their own Collector
to do the sorting.

> If you agree, I think it is worthwhile to open a jira ticket for this while
> this FieldComparator API is still fresh. And I'd be happy to provide my
> implementation, test code and my index.

+1

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi guys:

    I did some Big O math a few times and reached the same conclusion Jake
had.

    I was not sure about the code tuning opportunities we could have done
with the MergeAtTheEnd method as Yonik mentioned and the internal behavior
with PQ Mike suggested, so I went ahead and implemented the algorithm and
compared against the current Lucene 2.9 implementation:

set up:
index size: 999992 docs
segment count: 8
sorting on 1 field, no score, e.g. comparing against
"OneComparatorNonScoringCollector" class.
TermQuery used on field with only 2 values (even/odd), returning 499992
docs.

I broke the tests into two parts, numHits = 20 and numHits = 50, for each, I
ran 100 iterations, took the time from 6 - 100 (to avoid field cache load
time) and take the average. Did it 3 times and took that average.

Here are the numbers (times are measured in nanoseconds):

numHits = 50:

Lucene 2.9/OneComparatorNonScoringCollector:
num string compares: 251
num conversions: 34
time: 15272049
cpu time:  161105
num inserts into PQ: 211

My test sort algorithm:
num string compares: 51
num conversions: 0
time: 14943172
cpu time: 153722
num inserts into PQ: 1500

numHits = 100:

Lucene 2.9/OneComparatorNonScoringCollector:
num string compares: 2285
num conversions: 172
time: 17703866
cpu time:  187073
num inserts into PQ: 982

My test sort algorithm:
num string compares: 210
num conversions: 0
time: 15823473
cpu time: 160737
num inserts into PQ: 6011

Conclusion:

My measurement is consistent with mine and Jake's Big O analysis, where the
dominating factor is the same between both algorithms. And in terms of
timing/cpu, they are very similar with the second algorithm consistently
perform a slightly better than Lucene 2.9.
Mike is absolutely correct about the number of PQ insertions, but that win
seems to be shadowed by the cost of extra string compares and conversions.
If you look at the "growth" between numHits=20 and 100, the time gap
increases, e.g. as the PQ gets larger (case where user navigates to a higher
page of the search result) the num string compares increase faster than the
increase of the size of PQ.
There is an assumption that the conversion cost is low, which may not be
true for custom sort extensions, and could easily nominate the entire query
time. (more analysis below)

Given this analysis, seems like at the least the second algorithm is not
worse in terms of performance, and I would argue it is actually better.

With the current API which ties into this algorithm in Lucene 2.9
(FieldComparator), there is actually a subtle behavior change, such that, to
implement a custom sort, user would have to be able to do this conversion,
which involves a reverse mapping from value to ordinals. This is a change on
the contract for writing custom sort comparison, where before,
ordinal->value mapping can be 1 to many, and now, with the requirement for
this reverse mapping, it is much more difficult and expensive to do.

If you agree, I think it is worthwhile to open a jira ticket for this while
this FieldComparator API is still fresh. And I'd be happy to provide my
implementation, test code and my index.

Thanks

-John

On Thu, Oct 15, 2009 at 10:06 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Thu, Oct 15, 2009 at 9:12 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:
>
>> On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>> > And it seems like a PQ per segment simply delays many of the slow
>> > lookups until the end where the PQs must be merged.
>>
>> Actually, I'm wrong about that part - one can simply merge on
>> values... there will be lots of string comparisons (and a new merge
>> queue) but no binary searches.
>>
>
> The number of string comparisons in the final merge of queues is
> O(K log(M) ) if you do as you're describing (have basically a PQ of sorted
> lists, the same way BooleanScorer works), where K is going to be roughly
> what, 10, 100 (numResultsToReturn), and M is 10-50?  This is like a couple
> thousand string compares, tops, and independent of the overall size of the
> segments - it's measured in microseconds of CPU time.
>
>   -jake
>

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 15, 2009 at 9:12 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
> > And it seems like a PQ per segment simply delays many of the slow
> > lookups until the end where the PQs must be merged.
>
> Actually, I'm wrong about that part - one can simply merge on
> values... there will be lots of string comparisons (and a new merge
> queue) but no binary searches.
>

The number of string comparisons in the final merge of queues is
O(K log(M) ) if you do as you're describing (have basically a PQ of sorted
lists, the same way BooleanScorer works), where K is going to be roughly
what, 10, 100 (numResultsToReturn), and M is 10-50?  This is like a couple
thousand string compares, tops, and independent of the overall size of the
segments - it's measured in microseconds of CPU time.

  -jake

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> And it seems like a PQ per segment simply delays many of the slow
> lookups until the end where the PQs must be merged.

Actually, I'm wrong about that part - one can simply merge on
values... there will be lots of string comparisons (and a new merge
queue) but no binary searches.

Of course a different part of the proposal  is problematic: "This way,
we avoid the lookup from value->ord, and can keep the simpler API:
ScoreDocComparator, which is much easier to extend for custom
sorting."

We really don't want two different sort API's right?  So could you
still make this faster using the new comparator APIs?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Oct 15, 2009 at 4:31 AM, Jake Mannix <ja...@gmail.com> wrote:
>> Conversion from one segment to another is only
>> done as needed... only the bottom slot is converted automatically when
>> the segment is switched.
>
> That's not what it looks like, actually: you convert the bottom slot, and
> as soon as you get another hit from the new segment which is
> competitive, it pushes what was the old bottom out, and there's a new
> bottom, which often will be from one of the previous segments, in which
> case *it* must be converted to the new ordering as well.

That's exactly what I said... on a segment switch you *only* convert
the bottom slot.  Other conversions are done "as needed" when you get
competitive hits.

>> Filling a complete priority queue for each segment is also more
>> expensive (from the big-O perspective).
>
> Huh?  If you have M roughly balanced segments of N docs each, and
> you want to find the top K (out of all M*N) sorted docs
[...]

I meant on a much simpler level (big-O doesn't normally consider
constants, that's what I was trying to get across).
A single priority queue means fewer priority queue insertions (with
the inserts being more expensive in general).

And it seems like a PQ per segment simply delays many of the slow
lookups until the end where the PQs must be merged.
And as Mike points out, you throw in things like branch prediction and
everything else, and one really just needs to try it out to know for
sure.

> After all - the multiple queues approach is what
> lucene has always done for the multi-reader case

Lucene has always used a single PQ for the collector for MultiReader.
Previous to 2.9, the merge was done at a lower level in the
TermEnum/TermDocs implementation, and the IndexSearcher code was
oblivious whether it was searching a SegmentReader or a MuiltiReader

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Oct 15, 2009 at 5:59 PM, Jake Mannix <ja...@gmail.com> wrote:

>> I don't think we do any branch tuning on the PQ insertion -- the ifs
>> involved in re-heapifying the PQ are simply hard for the CPU to
>> predict (though, apparently, not as hard as comparing strings ;).
>
> But it does look like you do some nice short-circuting where you compare
> to bottom instead of doing the naive insertWithOverflow() and check the
> result, which is cool.

Right, every little bit helps :)

>> > The single-PQ approach requires only O(K * log(K) * log(M))
>> > insertions, but each one requires, instead of O(log(K)) int compares,
>> > O(log(K)) *string* compares, and a segment conversion.
>>
>> Right, except, for a large segment, if it has enough insertions, as
>> more entries in the PQ belong to that segment, more of the comparisons
>> become ints.
>
> Assuming you're processing the larger segments first, as Yonik said
> was done, then this won't typically help very much, right?  If the
> second segment is much smaller than the first, then not as many
> replacements will happen, leaving the majority of the PQ having
> entries from the previous segment.  Additionally - as you move to further
> and further segments, whether they are the same size or smaller than
> previous, the bottom element of the PQ is getting better and better, and
> so you're chances of actually filling up even a significant minority of it
> with entries from the current segment is very small.

True, though as the segments get smaller, and the PQ
"converges", there is less insertion being done anyway.  So the
first few insertions are costly (all string comparisons), but as
more insertions go in, more of the comparisons become ints.

>> Looking back through LUCENE-1483, we were seeing much more sizable
>> gains in the new API vs the old single-PQ-only-int-comparison approach
>> (which should be faster than both the single and multi PQ approaches
>> we are discussing here).  BUT: those tests combined searching by
>> segment AND the new FieldCollector API, plus other more basic code
>> speedups.  Though it'd be odd if the switch to searching by segment
>> really was most of the gains here.  I'm confused (for now) on where
>> the gains we were seeing in that issue came from...
>
> You don't give yourself enough credit here: it could certainly that the
> perf improvement was from more basic code-level speedups, as we're
> only dealing with the sub-leading behavior here, and the constants
> look like they most certainly matter!

Well let's get to the bottom of it, and if the simpler API is indeed faster
across the board, we should switch back.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 15, 2009 at 2:33 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> I don't think we do any branch tuning on the PQ insertion -- the ifs
> involved in re-heapifying the PQ are simply hard for the CPU to
> predict (though, apparently, not as hard as comparing strings ;).
>

But it does look like you do some nice short-circuting where you compare
to bottom instead of doing the naive insertWithOverflow() and check the
result, which is cool.

> > The single-PQ approach requires only O(K * log(K) * log(M))
> > insertions, but each one requires, instead of O(log(K)) int compares,
> > O(log(K)) *string* compares, and a segment conversion.
>
> Right, except, for a large segment, if it has enough insertions, as
> more entries in the PQ belong to that segment, more of the comparisons
> become ints.
>

Assuming you're processing the larger segments first, as Yonik said
was done, then this won't typically help very much, right?  If the
second segment is much smaller than the first, then not as many
replacements will happen, leaving the majority of the PQ having
entries from the previous segment.  Additionally - as you move to further
and further segments, whether they are the same size or smaller than
previous, the bottom element of the PQ is getting better and better, and
so you're chances of actually filling up even a significant minority of it
with
entries from the current segment is very small.

Looking back through LUCENE-1483, we were seeing much more sizable
> gains in the new API vs the old single-PQ-only-int-comparison approach
> (which should be faster than both the single and multi PQ approaches
> we are discussing here).  BUT: those tests combined searching by
> segment AND the new FieldCollector API, plus other more basic code
> speedups.  Though it'd be odd if the switch to searching by segment
> really was most of the gains here.  I'm confused (for now) on where
> the gains we were seeing in that issue came from...
>

You don't give yourself enough credit here: it could certainly that the
perf improvement was from more basic code-level speedups, as we're
only dealing with the sub-leading behavior here, and the constants
look like they most certainly matter!

  -jake

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Oct 15, 2009 at 6:04 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Though it'd be odd if the switch to searching by segment
>> really was most of the gains here.
>
> I had assumed that much of the improvement was due to ditching
> MultiTermEnum/MultiTermDocs.
> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
> with queries that use a TermEnum (range, prefix, etc).

Yeah, and all the tests I did in LUCENE-1483 were either TermQuery or
AllDocsQuery so LUCENE-1596 shouldn't have helped.

Ditching MultiTermDocs just means saving one method call, and also
saving having to do 2nd pass on each bulk-read of docs/freqs adding in
the base to the docs.  Maybe that 2nd pass was a big part of the gain.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

They are included in my last patch on LUCENE-1997.  It's somewhat
hacked up though :)  We'd have to redo it "for real" if we go forward
with this...

Mike

2009/10/23 John Wang <jo...@gmail.com>:
> Hi Mike:
>     Thank you! It would be really nice to get the optimizations you have
> done.
> -John
>
> 2009/10/23 Michael McCandless <lu...@mikemccandless.com>
>>
>> Agreed: so far I'm seeing serious performance loss with MultiPQ,
>> especially as topN gets larger, and for int sorting.
>>
>> For small queue, String sort, it sometimes wins.
>>
>> So if I were forced to decide now based on the current results, I
>> think we should keep the single PQ API.
>>
>> But: I am right now optimizing John's patch to see how fast Multi PQ
>> can get.  I'll post it once I get it working, and post output from
>> re-running on my opensolaris box.
>>
>> Mike
>>
>> 2009/10/23 Mark Miller <ma...@gmail.com>:
>> >>>I still think we should if performance is no
>> >>>better with the new one.
>> >
>> > Where is there any indication performance is not better with the new
>> > one?
>> >
>> > The benchmarks are clearly against switching back. At best they could
>> > argue for two API's - even then it depends - a loss of 10% on Java 1.5
>> > with the most recent linux for a topn:10 ? I'm all for more results, but
>> > its not looking like a good switch to me. What API do I use? Well, it
>> > depends - how many docs will you ask for back, what OS are running, how hard
>> > is it for you to grok one API over the other?
>> >
>> > And then as we make changes in the future we have to manage both APIs.
>> >
>> > bq. digging in deep and running thorough perf tests makes sense
>> >
>> > Again - no one is arguing against - dig all year - I'll help - but I
>> > don't see the treasure yet, and the hole is starting to look deep.
>> >
>> > bq. removing that if from the Multi PQ patch makes sense
>> >
>> > I didn't have a problem with that either - or other code changes - but
>> > jeeze, mention what you are seeing with the switch. I'll tell you what I
>> > saw it - not that much - a bit of improvement, but take a look at the
>> > Java 1.5 run - it ended up being a blade of grass holding up a boulder
>> > on Linux.
>> >
>> >
>> >
>> > Michael McCandless wrote:
>> >> Sheesh I go to bed and so much all of a sudden happens!!
>> >>
>> >> Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
>> >> clearly ;)
>> >>
>> >> There's no question in my mind that the new comparator API is more
>> >> complex than the old one, and I really don't like that.  I had to
>> >> rewrite the section of LIA that gives an example of a [simple] custom
>> >> sort and it wasn't pleasant!  Two compare methods (compare,
>> >> compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
>> >> grok it and get through it if you have to, but it is more complex
>> >> because it's conflated with the PQ API.
>> >>
>> >> Ease on consumption of our APIs is very important, so, only when
>> >> performance clearly warrants it should we adopt a more complex API.
>> >>
>> >> Also, yeah, it would suck to have to switch back to the old API at
>> >> this point, but net/net I still think we should if performance is no
>> >> better with the new one.
>> >>
>> >> The old API also fits cleanly with per-segment searching (John's
>> >> initial patch shows that -- it's simply another per-segment Colletor).
>> >> The two APIs (collection, comparator) are well decoupled.
>> >>
>> >> So, digging in deep and running thorough perf tests makes sense; we
>> >> need to understand the performance to make the API switch decision.
>> >> And definitely we should tune both approaches as much as possible
>> >> (removing that if from the Multi PQ patch makes sense).
>> >>
>> >> But... Multi PQ's performance isn't better in many cases... though,
>> >> we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
>> >> with the if statement removed.
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com>
>> >> wrote:
>> >>
>> >>> I did.
>> >>>
>> >>> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Yes - I've seen a handful of non core devs report back that they
>> >>>>> upgraded with no complaints on the difficulty. Its in the mailing
>> >>>>> list
>> >>>>> archives. The only core dev I've seen say its easy is Uwe. He's
>> >>>>> super
>> >>>>> sharp though, so I wasn't banking my comment on him ;)
>> >>>>>
>> >>>> Upgrade custom sorting?  Where has anyone talked about this?
>> >>>>
>> >>>> 2.9 is great, I like the new apis, they're great in general.  It's
>> >>>> just this
>> >>>> multi-segment sorting we're talking about here.
>> >>>>
>> >>>>   -jake
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>> >>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> >>> ICQ: 104465785
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>>
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > - Mark
>> >
>> > http://www.lucidimagination.com
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Mike:
    Thank you! It would be really nice to get the optimizations you have
done.

-John

2009/10/23 Michael McCandless <lu...@mikemccandless.com>

> Agreed: so far I'm seeing serious performance loss with MultiPQ,
> especially as topN gets larger, and for int sorting.
>
> For small queue, String sort, it sometimes wins.
>
> So if I were forced to decide now based on the current results, I
> think we should keep the single PQ API.
>
> But: I am right now optimizing John's patch to see how fast Multi PQ
> can get.  I'll post it once I get it working, and post output from
> re-running on my opensolaris box.
>
> Mike
>
> 2009/10/23 Mark Miller <ma...@gmail.com>:
> >>>I still think we should if performance is no
> >>>better with the new one.
> >
> > Where is there any indication performance is not better with the new one?
> >
> > The benchmarks are clearly against switching back. At best they could
> argue for two API's - even then it depends - a loss of 10% on Java 1.5
> > with the most recent linux for a topn:10 ? I'm all for more results, but
> its not looking like a good switch to me. What API do I use? Well, it
> depends - how many docs will you ask for back, what OS are running, how hard
> is it for you to grok one API over the other?
> >
> > And then as we make changes in the future we have to manage both APIs.
> >
> > bq. digging in deep and running thorough perf tests makes sense
> >
> > Again - no one is arguing against - dig all year - I'll help - but I
> don't see the treasure yet, and the hole is starting to look deep.
> >
> > bq. removing that if from the Multi PQ patch makes sense
> >
> > I didn't have a problem with that either - or other code changes - but
> > jeeze, mention what you are seeing with the switch. I'll tell you what I
> > saw it - not that much - a bit of improvement, but take a look at the
> > Java 1.5 run - it ended up being a blade of grass holding up a boulder
> > on Linux.
> >
> >
> >
> > Michael McCandless wrote:
> >> Sheesh I go to bed and so much all of a sudden happens!!
> >>
> >> Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
> >> clearly ;)
> >>
> >> There's no question in my mind that the new comparator API is more
> >> complex than the old one, and I really don't like that.  I had to
> >> rewrite the section of LIA that gives an example of a [simple] custom
> >> sort and it wasn't pleasant!  Two compare methods (compare,
> >> compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
> >> grok it and get through it if you have to, but it is more complex
> >> because it's conflated with the PQ API.
> >>
> >> Ease on consumption of our APIs is very important, so, only when
> >> performance clearly warrants it should we adopt a more complex API.
> >>
> >> Also, yeah, it would suck to have to switch back to the old API at
> >> this point, but net/net I still think we should if performance is no
> >> better with the new one.
> >>
> >> The old API also fits cleanly with per-segment searching (John's
> >> initial patch shows that -- it's simply another per-segment Colletor).
> >> The two APIs (collection, comparator) are well decoupled.
> >>
> >> So, digging in deep and running thorough perf tests makes sense; we
> >> need to understand the performance to make the API switch decision.
> >> And definitely we should tune both approaches as much as possible
> >> (removing that if from the Multi PQ patch makes sense).
> >>
> >> But... Multi PQ's performance isn't better in many cases... though,
> >> we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
> >> with the if statement removed.
> >>
> >> Mike
> >>
> >> On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com>
> wrote:
> >>
> >>> I did.
> >>>
> >>> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com>
> wrote:
> >>>
> >>>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>>>
> >>>>> Yes - I've seen a handful of non core devs report back that they
> >>>>> upgraded with no complaints on the difficulty. Its in the mailing
> list
> >>>>> archives. The only core dev I've seen say its easy is Uwe. He's super
> >>>>> sharp though, so I wasn't banking my comment on him ;)
> >>>>>
> >>>> Upgrade custom sorting?  Where has anyone talked about this?
> >>>>
> >>>> 2.9 is great, I like the new apis, they're great in general.  It's
> just this
> >>>> multi-segment sorting we're talking about here.
> >>>>
> >>>>   -jake
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> >>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> >>> ICQ: 104465785
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Agreed: so far I'm seeing serious performance loss with MultiPQ,
especially as topN gets larger, and for int sorting.

For small queue, String sort, it sometimes wins.

So if I were forced to decide now based on the current results, I
think we should keep the single PQ API.

But: I am right now optimizing John's patch to see how fast Multi PQ
can get.  I'll post it once I get it working, and post output from
re-running on my opensolaris box.

Mike

2009/10/23 Mark Miller <ma...@gmail.com>:
>>>I still think we should if performance is no
>>>better with the new one.
>
> Where is there any indication performance is not better with the new one?
>
> The benchmarks are clearly against switching back. At best they could argue for two API's - even then it depends - a loss of 10% on Java 1.5
> with the most recent linux for a topn:10 ? I'm all for more results, but its not looking like a good switch to me. What API do I use? Well, it depends - how many docs will you ask for back, what OS are running, how hard is it for you to grok one API over the other?
>
> And then as we make changes in the future we have to manage both APIs.
>
> bq. digging in deep and running thorough perf tests makes sense
>
> Again - no one is arguing against - dig all year - I'll help - but I don't see the treasure yet, and the hole is starting to look deep.
>
> bq. removing that if from the Multi PQ patch makes sense
>
> I didn't have a problem with that either - or other code changes - but
> jeeze, mention what you are seeing with the switch. I'll tell you what I
> saw it - not that much - a bit of improvement, but take a look at the
> Java 1.5 run - it ended up being a blade of grass holding up a boulder
> on Linux.
>
>
>
> Michael McCandless wrote:
>> Sheesh I go to bed and so much all of a sudden happens!!
>>
>> Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
>> clearly ;)
>>
>> There's no question in my mind that the new comparator API is more
>> complex than the old one, and I really don't like that.  I had to
>> rewrite the section of LIA that gives an example of a [simple] custom
>> sort and it wasn't pleasant!  Two compare methods (compare,
>> compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
>> grok it and get through it if you have to, but it is more complex
>> because it's conflated with the PQ API.
>>
>> Ease on consumption of our APIs is very important, so, only when
>> performance clearly warrants it should we adopt a more complex API.
>>
>> Also, yeah, it would suck to have to switch back to the old API at
>> this point, but net/net I still think we should if performance is no
>> better with the new one.
>>
>> The old API also fits cleanly with per-segment searching (John's
>> initial patch shows that -- it's simply another per-segment Colletor).
>> The two APIs (collection, comparator) are well decoupled.
>>
>> So, digging in deep and running thorough perf tests makes sense; we
>> need to understand the performance to make the API switch decision.
>> And definitely we should tune both approaches as much as possible
>> (removing that if from the Multi PQ patch makes sense).
>>
>> But... Multi PQ's performance isn't better in many cases... though,
>> we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
>> with the if statement removed.
>>
>> Mike
>>
>> On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>>
>>> I did.
>>>
>>> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com> wrote:
>>>
>>>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>>> Yes - I've seen a handful of non core devs report back that they
>>>>> upgraded with no complaints on the difficulty. Its in the mailing list
>>>>> archives. The only core dev I've seen say its easy is Uwe. He's super
>>>>> sharp though, so I wasn't banking my comment on him ;)
>>>>>
>>>> Upgrade custom sorting?  Where has anyone talked about this?
>>>>
>>>> 2.9 is great, I like the new apis, they're great in general.  It's just this
>>>> multi-segment sorting we're talking about here.
>>>>
>>>>   -jake
>>>>
>>>>
>>>>
>>>
>>> --
>>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>>> ICQ: 104465785
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Mark Miller wrote:
> bq. removing that if from the Multi PQ patch makes sense
>
> I didn't have a problem with that either - or other code changes - but
> jeeze, mention what you are seeing with the switch. I'll tell you what I
> saw it - not that much - a bit of improvement, but take a look at the
> Java 1.5 run - it ended up being a blade of grass holding up a boulder
> on Linux.
>   
In fact, looking closer at my runs - for me it didn't really help at all
- I had noticed a difference of a couple to a few percent here and there
- but now looking closer, it goes both ways on different ones - some
slightly better, some slightly worse - making me think, that at least
for my setup, it just doesn't matter.

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

>>I still think we should if performance is no
>>better with the new one.

Where is there any indication performance is not better with the new one?

The benchmarks are clearly against switching back. At best they could argue for two API's - even then it depends - a loss of 10% on Java 1.5
with the most recent linux for a topn:10 ? I'm all for more results, but its not looking like a good switch to me. What API do I use? Well, it depends - how many docs will you ask for back, what OS are running, how hard is it for you to grok one API over the other?

And then as we make changes in the future we have to manage both APIs.

bq. digging in deep and running thorough perf tests makes sense

Again - no one is arguing against - dig all year - I'll help - but I don't see the treasure yet, and the hole is starting to look deep. 

bq. removing that if from the Multi PQ patch makes sense

I didn't have a problem with that either - or other code changes - but
jeeze, mention what you are seeing with the switch. I'll tell you what I
saw it - not that much - a bit of improvement, but take a look at the
Java 1.5 run - it ended up being a blade of grass holding up a boulder
on Linux.



Michael McCandless wrote:
> Sheesh I go to bed and so much all of a sudden happens!!
>
> Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
> clearly ;)
>
> There's no question in my mind that the new comparator API is more
> complex than the old one, and I really don't like that.  I had to
> rewrite the section of LIA that gives an example of a [simple] custom
> sort and it wasn't pleasant!  Two compare methods (compare,
> compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
> grok it and get through it if you have to, but it is more complex
> because it's conflated with the PQ API.
>
> Ease on consumption of our APIs is very important, so, only when
> performance clearly warrants it should we adopt a more complex API.
>
> Also, yeah, it would suck to have to switch back to the old API at
> this point, but net/net I still think we should if performance is no
> better with the new one.
>
> The old API also fits cleanly with per-segment searching (John's
> initial patch shows that -- it's simply another per-segment Colletor).
> The two APIs (collection, comparator) are well decoupled.
>
> So, digging in deep and running thorough perf tests makes sense; we
> need to understand the performance to make the API switch decision.
> And definitely we should tune both approaches as much as possible
> (removing that if from the Multi PQ patch makes sense).
>
> But... Multi PQ's performance isn't better in many cases... though,
> we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
> with the if statement removed.
>
> Mike
>
> On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>   
>> I did.
>>
>> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com> wrote:
>>     
>>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:
>>>       
>>>> Yes - I've seen a handful of non core devs report back that they
>>>> upgraded with no complaints on the difficulty. Its in the mailing list
>>>> archives. The only core dev I've seen say its easy is Uwe. He's super
>>>> sharp though, so I wasn't banking my comment on him ;)
>>>>         
>>> Upgrade custom sorting?  Where has anyone talked about this?
>>>
>>> 2.9 is great, I like the new apis, they're great in general.  It's just this
>>> multi-segment sorting we're talking about here.
>>>
>>>   -jake
>>>
>>>
>>>       
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Yup, I'm not against the testing or the thought - and it is clearly more
complicated - I'm not saying its not. But I haven't seen anyone thats
come and said they haven't grokked it yet or that they had a hard time
with it (though they have run into limitations in what they have tried
to do). John and Jake are not the first the upgrade (and while they have
noted its more complicated, they grokked it too).

Its a matter of how difficult it is, not thats its a little more
complicated than the old. Doing this is an advanced Lucene op - a bunch
of people have done it already - these guys are not the first - no one
has really gotten tripped up.

I'm not in love with the new API, but I'm still waiting to see the list
of people that haven't grokked it.

I don't like the two idea of supporting two API's and I don't like the
idea of herding everyone back right after herding them over. If it
clearly made sense, I have no reason to be against it - but I'm just not
seeing that myself.

Michael McCandless wrote:
> Sheesh I go to bed and so much all of a sudden happens!!
>
> Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
> clearly ;)
>
> There's no question in my mind that the new comparator API is more
> complex than the old one, and I really don't like that.  I had to
> rewrite the section of LIA that gives an example of a [simple] custom
> sort and it wasn't pleasant!  Two compare methods (compare,
> compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
> grok it and get through it if you have to, but it is more complex
> because it's conflated with the PQ API.
>
> Ease on consumption of our APIs is very important, so, only when
> performance clearly warrants it should we adopt a more complex API.
>
> Also, yeah, it would suck to have to switch back to the old API at
> this point, but net/net I still think we should if performance is no
> better with the new one.
>
> The old API also fits cleanly with per-segment searching (John's
> initial patch shows that -- it's simply another per-segment Colletor).
> The two APIs (collection, comparator) are well decoupled.
>
> So, digging in deep and running thorough perf tests makes sense; we
> need to understand the performance to make the API switch decision.
> And definitely we should tune both approaches as much as possible
> (removing that if from the Multi PQ patch makes sense).
>
> But... Multi PQ's performance isn't better in many cases... though,
> we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
> with the if statement removed.
>
> Mike
>
> On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
>   
>> I did.
>>
>> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com> wrote:
>>     
>>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:
>>>       
>>>> Yes - I've seen a handful of non core devs report back that they
>>>> upgraded with no complaints on the difficulty. Its in the mailing list
>>>> archives. The only core dev I've seen say its easy is Uwe. He's super
>>>> sharp though, so I wasn't banking my comment on him ;)
>>>>         
>>> Upgrade custom sorting?  Where has anyone talked about this?
>>>
>>> 2.9 is great, I like the new apis, they're great in general.  It's just this
>>> multi-segment sorting we're talking about here.
>>>
>>>   -jake
>>>
>>>
>>>       
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Sheesh I go to bed and so much all of a sudden happens!!

Sorry Mark; I should've called out "PATCH IS ON 2.9 BRANCH" more
clearly ;)

There's no question in my mind that the new comparator API is more
complex than the old one, and I really don't like that.  I had to
rewrite the section of LIA that gives an example of a [simple] custom
sort and it wasn't pleasant!  Two compare methods (compare,
compareBottom)?  Two copy methods (copy, setBottom)?  Sure, you can
grok it and get through it if you have to, but it is more complex
because it's conflated with the PQ API.

Ease on consumption of our APIs is very important, so, only when
performance clearly warrants it should we adopt a more complex API.

Also, yeah, it would suck to have to switch back to the old API at
this point, but net/net I still think we should if performance is no
better with the new one.

The old API also fits cleanly with per-segment searching (John's
initial patch shows that -- it's simply another per-segment Colletor).
The two APIs (collection, comparator) are well decoupled.

So, digging in deep and running thorough perf tests makes sense; we
need to understand the performance to make the API switch decision.
And definitely we should tune both approaches as much as possible
(removing that if from the Multi PQ patch makes sense).

But... Multi PQ's performance isn't better in many cases... though,
we're clearly still iterating.  I'll run a 1.5 (32 & 64 bit) test,
with the if statement removed.

Mike

On Fri, Oct 23, 2009 at 3:53 AM, Earwin Burrfoot <ea...@gmail.com> wrote:
> I did.
>
> On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com> wrote:
>>
>> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:
>>>
>>> Yes - I've seen a handful of non core devs report back that they
>>> upgraded with no complaints on the difficulty. Its in the mailing list
>>> archives. The only core dev I've seen say its easy is Uwe. He's super
>>> sharp though, so I wasn't banking my comment on him ;)
>>
>> Upgrade custom sorting?  Where has anyone talked about this?
>>
>> 2.9 is great, I like the new apis, they're great in general.  It's just this
>> multi-segment sorting we're talking about here.
>>
>>   -jake
>>
>>
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Earwin Burrfoot <ea...@gmail.com>.

I did.

On Fri, Oct 23, 2009 at 09:05, Jake Mannix <ja...@gmail.com> wrote:
>
> On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:
>>
>> Yes - I've seen a handful of non core devs report back that they
>> upgraded with no complaints on the difficulty. Its in the mailing list
>> archives. The only core dev I've seen say its easy is Uwe. He's super
>> sharp though, so I wasn't banking my comment on him ;)
>
> Upgrade custom sorting?  Where has anyone talked about this?
>
> 2.9 is great, I like the new apis, they're great in general.  It's just this
> multi-segment sorting we're talking about here.
>
>   -jake
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller <ma...@gmail.com> wrote:

> Yes - I've seen a handful of non core devs report back that they
> upgraded with no complaints on the difficulty. Its in the mailing list
> archives. The only core dev I've seen say its easy is Uwe. He's super
> sharp though, so I wasn't banking my comment on him ;)
>

Upgrade custom sorting?  Where has anyone talked about this?

2.9 is great, I like the new apis, they're great in general.  It's just this
multi-segment sorting we're talking about here.

  -jake

RE: lucene 2.9 sorting algorithm

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Yes - I've seen a handful of non core devs report back that they
> upgraded with no complaints on the difficulty. Its in the mailing list
> archives. The only core dev I've seen say its easy is Uwe. He's super
> sharp though, so I wasn't banking my comment on him ;)

I didn't say it's easy -- for me it was easy if you take the core
comparators as an example, just from reading the docs it is really not easy!
For easiness, I already proposed some addon-helper classes for
Comparable'<T> or JDK Comparator<T>'s.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Jake Mannix wrote:
>
>
> On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     >> he new API is much harder for the
>     >> average user to use, and even for the experienced user, it's not
>     terribly fun,
>     >> and more importantly:
>
>     Do we have enough info to support that though? All the cases I
>     have seen
>     on the list, people have figured it out pretty easily - havn't really
>     seen any complaints in that regard (not counting you and John -
>     that is
>     two). The only other complaints I have noticed are those that happened
>     to count on unsupported behavior (eg people counting on no
>     MultiSearcher
>     use)
>
>
> John and I and TomS all found it both complex, and we're all pretty
> serious
> users of inner lucene apis.
>
> You see *core developers* saying the api seems fine.  Have you seen
> *any users*
> of the new sorting api say anything positive about it?  Do you know of
> *anyone* who
> has implemented the new comparator interface at all,  let alone
> *likes* it? 
> 3 negative votes by users, in comparison to *zero* positive votes by
> users
> together with a bunch of core developers saying, "yeah it looks easy,
> what are
> you guys complaining about?".
>
> Internal apis take a while to percolate out to the user base - we're
> only the first
> few running into this, and while the sample size is small, it
> shouldn't be discounted.
>
> Yes, of course it is possible to migrate to the new APIs - which is
> what we, as well
> as many others, were in the process of doing.  This is just an example
> of an API
> which got more complex in going to 2.9, and unlike the Collector API,
> it's possible
> that in this case it wasn't necessary for it to be as complex as it did.
>
>   -jake
>
Yes - I've seen a handful of non core devs report back that they
upgraded with no complaints on the difficulty. Its in the mailing list
archives. The only core dev I've seen say its easy is Uwe. He's super
sharp though, so I wasn't banking my comment on him ;)

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller <ma...@gmail.com> wrote:

> >> he new API is much harder for the
> >> average user to use, and even for the experienced user, it's not
> terribly fun,
> >> and more importantly:
>
> Do we have enough info to support that though? All the cases I have seen
> on the list, people have figured it out pretty easily - havn't really
> seen any complaints in that regard (not counting you and John - that is
> two). The only other complaints I have noticed are those that happened
> to count on unsupported behavior (eg people counting on no MultiSearcher
> use)
>

John and I and TomS all found it both complex, and we're all pretty serious
users of inner lucene apis.

You see *core developers* saying the api seems fine.  Have you seen *any
users*
of the new sorting api say anything positive about it?  Do you know of
*anyone* who
has implemented the new comparator interface at all,  let alone *likes* it?

3 negative votes by users, in comparison to *zero* positive votes by users
together with a bunch of core developers saying, "yeah it looks easy, what
are
you guys complaining about?".

Internal apis take a while to percolate out to the user base - we're only
the first
few running into this, and while the sample size is small, it shouldn't be
discounted.

Yes, of course it is possible to migrate to the new APIs - which is what we,
as well
as many others, were in the process of doing.  This is just an example of an
API
which got more complex in going to 2.9, and unlike the Collector API, it's
possible
that in this case it wasn't necessary for it to be as complex as it did.

  -jake

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

>> he new API is much harder for the
>> average user to use, and even for the experienced user, it's not
terribly fun,
>> and more importantly:

Do we have enough info to support that though? All the cases I have seen
on the list, people have figured it out pretty easily - havn't really
seen any complaints in that regard (not counting you and John - that is
two). The only other complaints I have noticed are those that happened
to count on unsupported behavior (eg people counting on no MultiSearcher
use)

I think Uwe had some good ideas for exposing an easier API with the new one.


Jake Mannix wrote:
> On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley
> <yonik@lucidimagination.com <ma...@lucidimagination.com>> wrote:
>
>     On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix
>     <jake.mannix@gmail.com <ma...@gmail.com>> wrote:
>     > It's hard to read the column format, but if you look up above in
>     the thread
>     > from tonight,
>     > you can see that yes, for PQ sizes less than 100 elements,
>     multiPQ is
>     > better, and only
>     > starts to be worse at around 100 for strings, and 50 for ints.
>
>     Ah, OK, I had missed John's followup with the numbers.
>
>     I assume this is for Java5 + optimizations?
>
>
> Yeah, this was for Java5 + optimizations.
>  
>
>     What does Java6 show?
>
>
> Java6 on Mac showed close to what Mike posted in his report on the
> Jira ticket -
> that single-PQ performs a little better for small pq, and more like
> 30-40% better
> for large pq. 
>  
>
>     My biggest reservation is that we've gone down the road of telling
>     people to implement a new style of comparators, and told them that the
>     old style comparators would be deleted in the next release (which is
>     where we are).  Reversing that will be a bit of a headache/question...
>     the new stuff isn't deprecated, and having *both* isn't desirable, but
>     that's a separate decision to be made apart from performance testing.
>
>
> Well the issue comes down to: if the performance is *basically comparable*
> between the two approaches, then the new API is much harder for the
> average user to use, and even for the experienced user, it's not
> terribly fun,
> and more importantly: for the user who has already implemented custom
> sorts on the old API, upgrading is enough trouble that people may decide
> it's not worth it.  It probably *is* worth it, but if you're going to
> even put that
> kind of thinking in the user's head, you've got to ask yourself:
> what's the
> reasoning for going with a more complex API if you can get equal (slightly
> better in some cases, slightly worse in others) performance with a
> simpler
> API?
>
> Yes, as Mike says, the new API is *not* breaking back-compat in a
> functional sense, but how many users have converted to the new sorting
> api already?  2.9 has barely just come out, and while it's work for the
> community as a whole to reconsider the multi-segment sorting api, and
> work to implement a change at this level, if it's the right thing to do,
> we shouldn't let the question of which method is deprecated dictate
> which one *should* be deprecated.
>
>
>     Is there also an option of using a multiPQ approach with the new style
>     comparators?
>
>
> For the record: that would be the worst of all worlds, in my view: harder
> API with only better performance in some cases, and sometimes worse
> performance.
>
>   -jake


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > It's hard to read the column format, but if you look up above in the
> thread
> > from tonight,
> > you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> > better, and only
> > starts to be worse at around 100 for strings, and 50 for ints.
>
> Ah, OK, I had missed John's followup with the numbers.
>
> I assume this is for Java5 + optimizations?
>

Yeah, this was for Java5 + optimizations.

> What does Java6 show?
>

Java6 on Mac showed close to what Mike posted in his report on the Jira
ticket -
that single-PQ performs a little better for small pq, and more like 30-40%
better
for large pq.

> My biggest reservation is that we've gone down the road of telling
> people to implement a new style of comparators, and told them that the
> old style comparators would be deleted in the next release (which is
> where we are).  Reversing that will be a bit of a headache/question...
> the new stuff isn't deprecated, and having *both* isn't desirable, but
> that's a separate decision to be made apart from performance testing.
>

Well the issue comes down to: if the performance is *basically comparable*
between the two approaches, then the new API is much harder for the
average user to use, and even for the experienced user, it's not terribly
fun,
and more importantly: for the user who has already implemented custom
sorts on the old API, upgrading is enough trouble that people may decide
it's not worth it.  It probably *is* worth it, but if you're going to even
put that
kind of thinking in the user's head, you've got to ask yourself: what's the
reasoning for going with a more complex API if you can get equal (slightly
better in some cases, slightly worse in others) performance with a simpler
API?

Yes, as Mike says, the new API is *not* breaking back-compat in a
functional sense, but how many users have converted to the new sorting
api already?  2.9 has barely just come out, and while it's work for the
community as a whole to reconsider the multi-segment sorting api, and
work to implement a change at this level, if it's the right thing to do,
we shouldn't let the question of which method is deprecated dictate
which one *should* be deprecated.

> Is there also an option of using a multiPQ approach with the new style
> comparators?
>

For the record: that would be the worst of all worlds, in my view: harder
API with only better performance in some cases, and sometimes worse
performance.

  -jake

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Yonik
    I am, but I don't think I should. Even with branching etc., I should see
that much of a consistent difference.
    I am traveling with my macbook pro, I wanted to eliminate all variables.
It really does not make sense to me...

-John

On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 22, 2009 at 10:35 PM, John Wang <jo...@gmail.com> wrote:
> >        Please be patient with me. I am seeing a difference and was
> wondering
> > if Mike would see the same thing.
>
> Some differences are bound to be seen... with your changes (JVM
> changes, branch optimizations), are you seeing better average
> performance with multiPQ?
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

Of course, John's running on his mac laptop, which also may be a factor,
which is
another reason why he wanted to see if these carried over onto a linux
desktop (for example).

  -jake

On Thu, Oct 22, 2009 at 8:11 PM, Jake Mannix <ja...@gmail.com> wrote:

> It's hard to read the column format, but if you look up above in the thread
> from tonight,
> you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> better, and only
> starts to be worse at around 100 for strings, and 50 for ints.
>
>   -jake
>
>
> On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:
>
>> On Thu, Oct 22, 2009 at 10:35 PM, John Wang <jo...@gmail.com> wrote:
>> >        Please be patient with me. I am seeing a difference and was
>> wondering
>> > if Mike would see the same thing.
>>
>> Some differences are bound to be seen... with your changes (JVM
>> changes, branch optimizations), are you seeing better average
>> performance with multiPQ?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Yonik:
    I have been head deep in this trying to find out a good solution for
better part of the past two days, it's been hard because there are so many
variables:

1) how optimized are the code from either of the implementations
2) VM difference
3) HW etc.

    Also, there are quite a few dimensions this issue is being discussed on:

Algorithm:

    I think we should NOT jump to the conclusion that my number on the
multiQ is valid until others reproduce it (which is one of the reason I
asked mike to run his benchmark again with 1.5) I am gonna try to run it on
server machines when I get back to my office next week.

    Overall, I think the single Q algorithm is better. (It however does pay
a price for some string compares etc.), Its benefit becomes more and more
significant when the product of PQ size and segment count increases, which
makes complete sense from the algorithm. However, when PQ size is small
(which is in most of the cases, the multiplier on the segment count is also
small) the benefit is not as obvious. And sometimes the trade-off for the
constant string compare cost may not be worth it. (this remains a
hypothesis)

    With Java 1.6, maybe the singleQ approach is a winner in all cases.

     I will spend more time to find out a more definitive answer.

API:

    The new FieldComparator API is not difficult to understand (especially
for Lucene experts such as yourselves), but it is more involved in
comparison to the ScoreDocComparator API. I think anyone would agree with
that. Furthermore, when implementing some custom comparators, (examples I
have given earlier in this thread), it can be difficult to implement while
maintaining performance.

    I understand changing API is hard, that is why I am trying to raise this
as soon as possible, and it could very well be that the current API is fine.

    Lucene's collector api allows anyone to plugin any sorting algorithm,
kinda like what Mike has done with the tests. So it is ok if an API selected
does not fit the needs for everyone.

    In conclusion, please understand I am not trying to be "right" on this,
just trying to learn and to understand, which I did from reading and trying
to understand the code, along with guidance from Mike and Yonik and I am
more than impressed with the thoughts and code tuning that went into it.

Thanks

-John

On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > It's hard to read the column format, but if you look up above in the
> thread
> > from tonight,
> > you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> > better, and only
> > starts to be worse at around 100 for strings, and 50 for ints.
>
> Ah, OK, I had missed John's followup with the numbers.
>
> I assume this is for Java5 + optimizations?
> What does Java6 show?
>
> bq. Point 2 does indeed make a difference, we've seen it, and it's
> only fair: the single pq comparator does this branch optimization but
> the current patch multi-pq does not, so let's level the playing field.
>
> Of course - it's not about leveling the playing field, but finding the
> best solution for the average case - so everything should be optimized
> as much as possible.  There are probably further optimizations
> possible in both the single and multi PQ cases.
>
> My biggest reservation is that we've gone down the road of telling
> people to implement a new style of comparators, and told them that the
> old style comparators would be deleted in the next release (which is
> where we are).  Reversing that will be a bit of a headache/question...
> the new stuff isn't deprecated, and having *both* isn't desirable, but
> that's a separate decision to be made apart from performance testing.
>
> Is there also an option of using a multiPQ approach with the new style
> comparators?
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix <ja...@gmail.com> wrote:
> It's hard to read the column format, but if you look up above in the thread
> from tonight,
> you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> better, and only
> starts to be worse at around 100 for strings, and 50 for ints.

Ah, OK, I had missed John's followup with the numbers.

I assume this is for Java5 + optimizations?
What does Java6 show?

bq. Point 2 does indeed make a difference, we've seen it, and it's
only fair: the single pq comparator does this branch optimization but
the current patch multi-pq does not, so let's level the playing field.

Of course - it's not about leveling the playing field, but finding the
best solution for the average case - so everything should be optimized
as much as possible.  There are probably further optimizations
possible in both the single and multi PQ cases.

My biggest reservation is that we've gone down the road of telling
people to implement a new style of comparators, and told them that the
old style comparators would be deleted in the next release (which is
where we are).  Reversing that will be a bit of a headache/question...
the new stuff isn't deprecated, and having *both* isn't desirable, but
that's a separate decision to be made apart from performance testing.

Is there also an option of using a multiPQ approach with the new style
comparators?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

It's hard to read the column format, but if you look up above in the thread
from tonight,
you can see that yes, for PQ sizes less than 100 elements, multiPQ is
better, and only
starts to be worse at around 100 for strings, and 50 for ints.

  -jake

On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 22, 2009 at 10:35 PM, John Wang <jo...@gmail.com> wrote:
> >        Please be patient with me. I am seeing a difference and was
> wondering
> > if Mike would see the same thing.
>
> Some differences are bound to be seen... with your changes (JVM
> changes, branch optimizations), are you seeing better average
> performance with multiPQ?
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Oct 22, 2009 at 10:35 PM, John Wang <jo...@gmail.com> wrote:
>        Please be patient with me. I am seeing a difference and was wondering
> if Mike would see the same thing.

Some differences are bound to be seen... with your changes (JVM
changes, branch optimizations), are you seeing better average
performance with multiPQ?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

For some reason I guess this didn't go thru and caused all the confusion.

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|<all>|1000000|rand string|10|91.76|108.63|{color:green}18.4%{color}|
|log|<all>|1000000|rand string|25|92.39|106.79|{color:green}15.6%{color}|
|log|<all>|1000000|rand string|50|91.30|104.02|{color:green}13.9%{color}|
|log|<all>|1000000|rand string|500|86.16|63.27|{color:red}-26.6%{color}|
|log|<all>|1000000|rand string|1000|76.92|64.85|{color:red}-15.7%{color}|
|log|<all>|1000000|country|10|92.42|108.78|{color:green}17.7%{color}|
|log|<all>|1000000|country|25|92.60|106.26|{color:green}14.8%{color}|
|log|<all>|1000000|country|50|92.64|103.76|{color:green}12.0%{color}|
|log|<all>|1000000|country|500|83.92|50.30|{color:red}-40.1%{color}|
|log|<all>|1000000|country|1000|74.78|46.59|{color:red}-37.7%{color}|
|log|<all>|1000000|rand int|10|114.03|114.85|{color:green}0.7%{color}|
|log|<all>|1000000|rand int|25|113.77|112.92|{color:red}-0.7%{color}|
|log|<all>|1000000|rand int|50|113.36|109.56|{color:red}-3.4%{color}|
|log|<all>|1000000|rand int|500|103.90|66.29|{color:red}-36.2%{color}|
|log|<all>|1000000|rand int|1000|89.52|70.67|{color:red}-21.1%{color}|

On Thu, Oct 22, 2009 at 7:43 PM, John Wang <jo...@gmail.com> wrote:

> Mike:
>        I did just post with what I saw, feel free to read and comment on
> it.
>
>        I am simply trying to work with Michael on this and trying to
> understand the code.
>
>        As I have expressed previously, I have seen a difference between 1.5
> and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6,
> I was hoping to eliminate all variables relating to the index and
> environment and see if he sees the same thing.
>
>         I guess I should be more clear in the email.
>
> -John
>
> On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller <ma...@gmail.com>wrote:
>
>> I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
>> wondering what your seeing. Asking Mike to rerun the tests without
>> giving any further info (you didn't even say that your seeing something
>> different) is unfair to the rest of us ;)
>>
>> Giving 0 info along with your request just makes 0 sense to me and I
>> said as much.
>>
>> John Wang wrote:
>> > Mark:
>> >
>> >        Please be patient with me. I am seeing a difference and was
>> > wondering if Mike would see the same thing. I thought Michael would be
>> > willing to because he expressed interest in understanding what the
>> > performance discrepancies are.
>> >
>> >        Again, it is only a request. It is perfectly fine if Michael
>> > refuses to. But it would be great if Michael speaks for himself.
>> >
>> > Thanks
>> >
>> > -John
>> >
>> > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <markrmiller@gmail.com
>> > <ma...@gmail.com>> wrote:
>> >
>> >     Why? What might he find? Whats with the cryptic request?
>> >
>> >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>> >     gains?
>> >
>> >     I know point 2 certainly doesn't. Cards on the table?
>> >
>> >     John Wang wrote:
>> >     > Hey Michael:
>> >     >
>> >     >        Would you mind rerunning the test you have with jdk1.5?
>> >     >
>> >     >        Also, if you would, change the comparator method to avoid
>> >     > brachning for int and string comparators, e.g.
>> >     >
>> >     >
>> >     >       return index.order[i.doc] - index.order[j.doc];
>> >     >
>> >     >
>> >     > Thanks
>> >     >
>> >     >
>> >     > -John
>> >     >
>> >     >
>> >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>> >     > <lucene@mikemccandless.com <ma...@mikemccandless.com>
>> >     <mailto:lucene@mikemccandless.com
>> >     <ma...@mikemccandless.com>>> wrote:
>> >     >
>> >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>> >     <john.wang@gmail.com <ma...@gmail.com>
>> >     >     <mailto:john.wang@gmail.com <ma...@gmail.com>>>
>> >     wrote:
>> >     >
>> >     >     >      I have been playing with the patch, and I think I
>> >     have some
>> >     >     information
>> >     >     > that you might like.
>> >     >     >      Let me spend sometime and gather some more numbers and
>> >     >     update in jira.
>> >     >
>> >     >     Excellent!
>> >     >
>> >     >     >      say bottom has ords 23, 45, 76, each corresponding to a
>> >     >     string. When
>> >     >     > moving to the next segment, you need to make bottom to
>> >     have ords
>> >     >     that can be
>> >     >     > comparable to other docs in this new segment, so you would
>> >     need
>> >     >     to find the
>> >     >     > new ords for the values in 23,45 and 76, don't you? To
>> >     find it,
>> >     >     assuming the
>> >     >     > values are s1,s2,s3, you would do a bin. search on the new
>> val
>> >     >     array, and
>> >     >     > find index for s1,s2,s3.
>> >     >
>> >     >     It's that inversion (from ord->Comparable in first seg, and
>> >     >     Comparable->ord in second seg) that I'm trying to avoid (w/
>> >     this new
>> >     >     proposal).
>> >     >
>> >     >     > Which is 3 bin searches per convert, I am not sure
>> >     >     > how you can short circuit it. Are you suggesting we call
>> >     >     Comparable on
>> >     >     > compareBottom until some doc beats it?
>> >     >
>> >     >     I'm saying on seg transition you indeed get the Comparable
>> >     for current
>> >     >     bottom, but, don't attempt to invert it.  Instead, as seg 2
>> >     finds a
>> >     >     hit, you get that hit's Comparables and compare to bottom.
>> >      If it
>> >     >     beats bottom, it goes into the queue.  If it does not, you
>> >     use the ord
>> >     >     (in seg 2's ord space) to "learn" a bottom in the ord space
>> >     of seg 2.
>> >     >
>> >     >     > That would hurt performance I lot though, no?
>> >     >
>> >     >     Yeah I think likely it would, since we're talking about a
>> binary
>> >     >     search on transition VS having to do possibly many
>> >     >     upgrade-to-Comparable and compare-Comparabls to slowly learn
>> the
>> >     >     equivalent ord in the new segment.  I was proposing it for
>> >     cases where
>> >     >     inversion is very difficult.  But realistically, since you
>> >     must keep
>> >     >     around the ful ord -> Comparable for every segment anyway
>> >     (in order to
>> >     >     merge in the end), inversion shouldn't ever actually be
>> >     "difficult" --
>> >     >     it'd just be a binary search on presumably in-RAM storage.
>> >     >
>> >     >     Mike
>> >     >
>> >     >
>> >
>> ---------------------------------------------------------------------
>> >     >     To unsubscribe, e-mail:
>> >     java-dev-unsubscribe@lucene.apache.org
>> >     <ma...@lucene.apache.org>
>> >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>> >     <ma...@lucene.apache.org>>
>> >     >     For additional commands, e-mail:
>> >     java-dev-help@lucene.apache.org
>> >     <ma...@lucene.apache.org>
>> >     >     <mailto:java-dev-help@lucene.apache.org
>> >     <ma...@lucene.apache.org>>
>> >     >
>> >     >
>> >
>> >
>> >     --
>> >     - Mark
>> >
>> >     http://www.lucidimagination.com
>> >
>> >
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >     <ma...@lucene.apache.org>
>> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >     <ma...@lucene.apache.org>
>> >
>> >
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

John Wang wrote:
> Mark:
>
>        There is no reason for me to withhold information. I just want
> to understand and share my findings.
Right, I didn't mean to accuse you of that ;) Not that you were doing it
on purpose. I was just trying to string out more :) Which I've managed
to do - in my usual awkward ending up email thread way. Success :)
>
>         My bad for not being clear.
>
>         Mike's test is actually very well written, I just followed
> instructions in the jira and got it running. I think the tests has
> good coverage and shows the symptoms the algorithms would suggest.
Yeah, I'm not complaining about his tests - I'm just trying to find a
version of Lucene that it will patch into cleanly.
>
> -John
>
> On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Thanks - thats all I'm asking for. A simple explanation of why
>     you'd ask
>     for a retest with those two things changed. Just seems its hold your
>     cards a little to close to say - please do this with 0 explanation.
>
>     As to point 2, thats fine - I'm sure it helps - I was just saying I
>     didn't buy it helps by 20-40%. Not arguing against doing it, but since
>     the request had no info, the only thing I could assume was that
>     that was
>     supposed to change things.
>
>     I was about to run some of these tests myself (if i can find what darn
>     revision to patch), and its a bit frustrating to see you guys knew
>     something but were not telling ...
>
>     Jake Mannix wrote:
>     > Mark,
>     >
>     >   We're not seeing exactly the numbers that Mike is seeing in
>     his tests,
>     > running with jdk 1.5 on intel macs, so we're trying to eliminate
>     > factors of difference.
>     >
>     >   Point 2 does indeed make a difference, we've seen it, and it's
>     only
>     > fair: the
>     > single pq comparator does this branch optimization but the current
>     > patch multi-pq
>     > does not, so let's level the playing field.
>     >
>     >   John's on the road with limited net connectivity, but we'll have
>     > some numbers to
>     > compare more over the weekend for sure.
>     >
>     >   -jake
>     >
>     > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
>     <markrmiller@gmail.com <ma...@gmail.com>
>     > <mailto:markrmiller@gmail.com <ma...@gmail.com>>>
>     wrote:
>     >
>     >     Why? What might he find? Whats with the cryptic request?
>     >
>     >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>     >     gains?
>     >
>     >     I know point 2 certainly doesn't. Cards on the table?
>     >
>     >     John Wang wrote:
>     >     > Hey Michael:
>     >     >
>     >     >        Would you mind rerunning the test you have with jdk1.5?
>     >     >
>     >     >        Also, if you would, change the comparator method to
>     avoid
>     >     > brachning for int and string comparators, e.g.
>     >     >
>     >     >
>     >     >       return index.order[i.doc] - index.order[j.doc];
>     >     >
>     >     >
>     >     > Thanks
>     >     >
>     >     >
>     >     > -John
>     >     >
>     >     >
>     >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>     >     > <lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     <mailto:lucene@mikemccandless.com <ma...@mikemccandless.com>>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>>>> wrote:
>     >     >
>     >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>     >     <john.wang@gmail.com <ma...@gmail.com>
>     <mailto:john.wang@gmail.com <ma...@gmail.com>>
>     >     >     <mailto:john.wang@gmail.com
>     <ma...@gmail.com> <mailto:john.wang@gmail.com
>     <ma...@gmail.com>>>>
>     >     wrote:
>     >     >
>     >     >     >      I have been playing with the patch, and I think I
>     >     have some
>     >     >     information
>     >     >     > that you might like.
>     >     >     >      Let me spend sometime and gather some more
>     numbers and
>     >     >     update in jira.
>     >     >
>     >     >     Excellent!
>     >     >
>     >     >     >      say bottom has ords 23, 45, 76, each
>     corresponding to a
>     >     >     string. When
>     >     >     > moving to the next segment, you need to make bottom to
>     >     have ords
>     >     >     that can be
>     >     >     > comparable to other docs in this new segment, so you
>     would
>     >     need
>     >     >     to find the
>     >     >     > new ords for the values in 23,45 and 76, don't you? To
>     >     find it,
>     >     >     assuming the
>     >     >     > values are s1,s2,s3, you would do a bin. search on
>     the new val
>     >     >     array, and
>     >     >     > find index for s1,s2,s3.
>     >     >
>     >     >     It's that inversion (from ord->Comparable in first
>     seg, and
>     >     >     Comparable->ord in second seg) that I'm trying to
>     avoid (w/
>     >     this new
>     >     >     proposal).
>     >     >
>     >     >     > Which is 3 bin searches per convert, I am not sure
>     >     >     > how you can short circuit it. Are you suggesting we call
>     >     >     Comparable on
>     >     >     > compareBottom until some doc beats it?
>     >     >
>     >     >     I'm saying on seg transition you indeed get the Comparable
>     >     for current
>     >     >     bottom, but, don't attempt to invert it.  Instead, as
>     seg 2
>     >     finds a
>     >     >     hit, you get that hit's Comparables and compare to bottom.
>     >      If it
>     >     >     beats bottom, it goes into the queue.  If it does not, you
>     >     use the ord
>     >     >     (in seg 2's ord space) to "learn" a bottom in the ord
>     space
>     >     of seg 2.
>     >     >
>     >     >     > That would hurt performance I lot though, no?
>     >     >
>     >     >     Yeah I think likely it would, since we're talking
>     about a binary
>     >     >     search on transition VS having to do possibly many
>     >     >     upgrade-to-Comparable and compare-Comparabls to slowly
>     learn the
>     >     >     equivalent ord in the new segment.  I was proposing it for
>     >     cases where
>     >     >     inversion is very difficult.  But realistically, since you
>     >     must keep
>     >     >     around the ful ord -> Comparable for every segment anyway
>     >     (in order to
>     >     >     merge in the end), inversion shouldn't ever actually be
>     >     "difficult" --
>     >     >     it'd just be a binary search on presumably in-RAM storage.
>     >     >
>     >     >     Mike
>     >     >
>     >     >
>     >    
>     ---------------------------------------------------------------------
>     >     >     To unsubscribe, e-mail:
>     >     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >     For additional commands, e-mail:
>     >     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >
>     >     >
>     >
>     >
>     >     --
>     >     - Mark
>     >
>     >     http://www.lucidimagination.com
>     >
>     >
>     >
>     >
>     >    
>     ---------------------------------------------------------------------
>     >     To unsubscribe, e-mail:
>     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     For additional commands, e-mail:
>     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >
>     >
>
>
>     --
>     - Mark
>
>     http://www.lucidimagination.com
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

bq. I just followed instructions in the jira and got it running.

Heh - I didn't read down far enough - first comment says 2.9 branch.
Thanks ; ) I've been flipping through revisions for a while now,
wondering how the heck the revs in the patch match up with trunk.


John Wang wrote:
> Mark:
>
>        There is no reason for me to withhold information. I just want
> to understand and share my findings.
>
>         My bad for not being clear.
>
>         Mike's test is actually very well written, I just followed
> instructions in the jira and got it running. I think the tests has
> good coverage and shows the symptoms the algorithms would suggest.
>
> -John
>
> On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Thanks - thats all I'm asking for. A simple explanation of why
>     you'd ask
>     for a retest with those two things changed. Just seems its hold your
>     cards a little to close to say - please do this with 0 explanation.
>
>     As to point 2, thats fine - I'm sure it helps - I was just saying I
>     didn't buy it helps by 20-40%. Not arguing against doing it, but since
>     the request had no info, the only thing I could assume was that
>     that was
>     supposed to change things.
>
>     I was about to run some of these tests myself (if i can find what darn
>     revision to patch), and its a bit frustrating to see you guys knew
>     something but were not telling ...
>
>     Jake Mannix wrote:
>     > Mark,
>     >
>     >   We're not seeing exactly the numbers that Mike is seeing in
>     his tests,
>     > running with jdk 1.5 on intel macs, so we're trying to eliminate
>     > factors of difference.
>     >
>     >   Point 2 does indeed make a difference, we've seen it, and it's
>     only
>     > fair: the
>     > single pq comparator does this branch optimization but the current
>     > patch multi-pq
>     > does not, so let's level the playing field.
>     >
>     >   John's on the road with limited net connectivity, but we'll have
>     > some numbers to
>     > compare more over the weekend for sure.
>     >
>     >   -jake
>     >
>     > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
>     <markrmiller@gmail.com <ma...@gmail.com>
>     > <mailto:markrmiller@gmail.com <ma...@gmail.com>>>
>     wrote:
>     >
>     >     Why? What might he find? Whats with the cryptic request?
>     >
>     >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>     >     gains?
>     >
>     >     I know point 2 certainly doesn't. Cards on the table?
>     >
>     >     John Wang wrote:
>     >     > Hey Michael:
>     >     >
>     >     >        Would you mind rerunning the test you have with jdk1.5?
>     >     >
>     >     >        Also, if you would, change the comparator method to
>     avoid
>     >     > brachning for int and string comparators, e.g.
>     >     >
>     >     >
>     >     >       return index.order[i.doc] - index.order[j.doc];
>     >     >
>     >     >
>     >     > Thanks
>     >     >
>     >     >
>     >     > -John
>     >     >
>     >     >
>     >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>     >     > <lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     <mailto:lucene@mikemccandless.com <ma...@mikemccandless.com>>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>>>> wrote:
>     >     >
>     >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>     >     <john.wang@gmail.com <ma...@gmail.com>
>     <mailto:john.wang@gmail.com <ma...@gmail.com>>
>     >     >     <mailto:john.wang@gmail.com
>     <ma...@gmail.com> <mailto:john.wang@gmail.com
>     <ma...@gmail.com>>>>
>     >     wrote:
>     >     >
>     >     >     >      I have been playing with the patch, and I think I
>     >     have some
>     >     >     information
>     >     >     > that you might like.
>     >     >     >      Let me spend sometime and gather some more
>     numbers and
>     >     >     update in jira.
>     >     >
>     >     >     Excellent!
>     >     >
>     >     >     >      say bottom has ords 23, 45, 76, each
>     corresponding to a
>     >     >     string. When
>     >     >     > moving to the next segment, you need to make bottom to
>     >     have ords
>     >     >     that can be
>     >     >     > comparable to other docs in this new segment, so you
>     would
>     >     need
>     >     >     to find the
>     >     >     > new ords for the values in 23,45 and 76, don't you? To
>     >     find it,
>     >     >     assuming the
>     >     >     > values are s1,s2,s3, you would do a bin. search on
>     the new val
>     >     >     array, and
>     >     >     > find index for s1,s2,s3.
>     >     >
>     >     >     It's that inversion (from ord->Comparable in first
>     seg, and
>     >     >     Comparable->ord in second seg) that I'm trying to
>     avoid (w/
>     >     this new
>     >     >     proposal).
>     >     >
>     >     >     > Which is 3 bin searches per convert, I am not sure
>     >     >     > how you can short circuit it. Are you suggesting we call
>     >     >     Comparable on
>     >     >     > compareBottom until some doc beats it?
>     >     >
>     >     >     I'm saying on seg transition you indeed get the Comparable
>     >     for current
>     >     >     bottom, but, don't attempt to invert it.  Instead, as
>     seg 2
>     >     finds a
>     >     >     hit, you get that hit's Comparables and compare to bottom.
>     >      If it
>     >     >     beats bottom, it goes into the queue.  If it does not, you
>     >     use the ord
>     >     >     (in seg 2's ord space) to "learn" a bottom in the ord
>     space
>     >     of seg 2.
>     >     >
>     >     >     > That would hurt performance I lot though, no?
>     >     >
>     >     >     Yeah I think likely it would, since we're talking
>     about a binary
>     >     >     search on transition VS having to do possibly many
>     >     >     upgrade-to-Comparable and compare-Comparabls to slowly
>     learn the
>     >     >     equivalent ord in the new segment.  I was proposing it for
>     >     cases where
>     >     >     inversion is very difficult.  But realistically, since you
>     >     must keep
>     >     >     around the ful ord -> Comparable for every segment anyway
>     >     (in order to
>     >     >     merge in the end), inversion shouldn't ever actually be
>     >     "difficult" --
>     >     >     it'd just be a binary search on presumably in-RAM storage.
>     >     >
>     >     >     Mike
>     >     >
>     >     >
>     >    
>     ---------------------------------------------------------------------
>     >     >     To unsubscribe, e-mail:
>     >     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >     For additional commands, e-mail:
>     >     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >
>     >     >
>     >
>     >
>     >     --
>     >     - Mark
>     >
>     >     http://www.lucidimagination.com
>     >
>     >
>     >
>     >
>     >    
>     ---------------------------------------------------------------------
>     >     To unsubscribe, e-mail:
>     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     For additional commands, e-mail:
>     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >
>     >
>
>
>     --
>     - Mark
>
>     http://www.lucidimagination.com
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Mark:
       There is no reason for me to withhold information. I just want to
understand and share my findings.

        My bad for not being clear.

        Mike's test is actually very well written, I just followed
instructions in the jira and got it running. I think the tests has good
coverage and shows the symptoms the algorithms would suggest.

-John

On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller <ma...@gmail.com> wrote:

> Thanks - thats all I'm asking for. A simple explanation of why you'd ask
> for a retest with those two things changed. Just seems its hold your
> cards a little to close to say - please do this with 0 explanation.
>
> As to point 2, thats fine - I'm sure it helps - I was just saying I
> didn't buy it helps by 20-40%. Not arguing against doing it, but since
> the request had no info, the only thing I could assume was that that was
> supposed to change things.
>
> I was about to run some of these tests myself (if i can find what darn
> revision to patch), and its a bit frustrating to see you guys knew
> something but were not telling ...
>
> Jake Mannix wrote:
> > Mark,
> >
> >   We're not seeing exactly the numbers that Mike is seeing in his tests,
> > running with jdk 1.5 on intel macs, so we're trying to eliminate
> > factors of difference.
> >
> >   Point 2 does indeed make a difference, we've seen it, and it's only
> > fair: the
> > single pq comparator does this branch optimization but the current
> > patch multi-pq
> > does not, so let's level the playing field.
> >
> >   John's on the road with limited net connectivity, but we'll have
> > some numbers to
> > compare more over the weekend for sure.
> >
> >   -jake
> >
> > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <markrmiller@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Why? What might he find? Whats with the cryptic request?
> >
> >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
> >     gains?
> >
> >     I know point 2 certainly doesn't. Cards on the table?
> >
> >     John Wang wrote:
> >     > Hey Michael:
> >     >
> >     >        Would you mind rerunning the test you have with jdk1.5?
> >     >
> >     >        Also, if you would, change the comparator method to avoid
> >     > brachning for int and string comparators, e.g.
> >     >
> >     >
> >     >       return index.order[i.doc] - index.order[j.doc];
> >     >
> >     >
> >     > Thanks
> >     >
> >     >
> >     > -John
> >     >
> >     >
> >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> >     > <lucene@mikemccandless.com <ma...@mikemccandless.com>
> >     <mailto:lucene@mikemccandless.com
> >     <ma...@mikemccandless.com>>> wrote:
> >     >
> >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
> >     <john.wang@gmail.com <ma...@gmail.com>
> >     >     <mailto:john.wang@gmail.com <ma...@gmail.com>>>
> >     wrote:
> >     >
> >     >     >      I have been playing with the patch, and I think I
> >     have some
> >     >     information
> >     >     > that you might like.
> >     >     >      Let me spend sometime and gather some more numbers and
> >     >     update in jira.
> >     >
> >     >     Excellent!
> >     >
> >     >     >      say bottom has ords 23, 45, 76, each corresponding to a
> >     >     string. When
> >     >     > moving to the next segment, you need to make bottom to
> >     have ords
> >     >     that can be
> >     >     > comparable to other docs in this new segment, so you would
> >     need
> >     >     to find the
> >     >     > new ords for the values in 23,45 and 76, don't you? To
> >     find it,
> >     >     assuming the
> >     >     > values are s1,s2,s3, you would do a bin. search on the new
> val
> >     >     array, and
> >     >     > find index for s1,s2,s3.
> >     >
> >     >     It's that inversion (from ord->Comparable in first seg, and
> >     >     Comparable->ord in second seg) that I'm trying to avoid (w/
> >     this new
> >     >     proposal).
> >     >
> >     >     > Which is 3 bin searches per convert, I am not sure
> >     >     > how you can short circuit it. Are you suggesting we call
> >     >     Comparable on
> >     >     > compareBottom until some doc beats it?
> >     >
> >     >     I'm saying on seg transition you indeed get the Comparable
> >     for current
> >     >     bottom, but, don't attempt to invert it.  Instead, as seg 2
> >     finds a
> >     >     hit, you get that hit's Comparables and compare to bottom.
> >      If it
> >     >     beats bottom, it goes into the queue.  If it does not, you
> >     use the ord
> >     >     (in seg 2's ord space) to "learn" a bottom in the ord space
> >     of seg 2.
> >     >
> >     >     > That would hurt performance I lot though, no?
> >     >
> >     >     Yeah I think likely it would, since we're talking about a
> binary
> >     >     search on transition VS having to do possibly many
> >     >     upgrade-to-Comparable and compare-Comparabls to slowly learn
> the
> >     >     equivalent ord in the new segment.  I was proposing it for
> >     cases where
> >     >     inversion is very difficult.  But realistically, since you
> >     must keep
> >     >     around the ful ord -> Comparable for every segment anyway
> >     (in order to
> >     >     merge in the end), inversion shouldn't ever actually be
> >     "difficult" --
> >     >     it'd just be a binary search on presumably in-RAM storage.
> >     >
> >     >     Mike
> >     >
> >     >
> >     ---------------------------------------------------------------------
> >     >     To unsubscribe, e-mail:
> >     java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>>
> >     >     For additional commands, e-mail:
> >     java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     >     <mailto:java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>>
> >     >
> >     >
> >
> >
> >     --
> >     - Mark
> >
> >     http://www.lucidimagination.com
> >
> >
> >
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Thanks - thats all I'm asking for. A simple explanation of why you'd ask
for a retest with those two things changed. Just seems its hold your
cards a little to close to say - please do this with 0 explanation.

As to point 2, thats fine - I'm sure it helps - I was just saying I
didn't buy it helps by 20-40%. Not arguing against doing it, but since
the request had no info, the only thing I could assume was that that was
supposed to change things.

I was about to run some of these tests myself (if i can find what darn
revision to patch), and its a bit frustrating to see you guys knew
something but were not telling ...

Jake Mannix wrote:
> Mark,
>  
>   We're not seeing exactly the numbers that Mike is seeing in his tests,
> running with jdk 1.5 on intel macs, so we're trying to eliminate
> factors of difference.
>
>   Point 2 does indeed make a difference, we've seen it, and it's only
> fair: the
> single pq comparator does this branch optimization but the current
> patch multi-pq
> does not, so let's level the playing field.
>
>   John's on the road with limited net connectivity, but we'll have
> some numbers to
> compare more over the weekend for sure.
>
>   -jake
>
> On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Why? What might he find? Whats with the cryptic request?
>
>     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>     gains?
>
>     I know point 2 certainly doesn't. Cards on the table?
>
>     John Wang wrote:
>     > Hey Michael:
>     >
>     >        Would you mind rerunning the test you have with jdk1.5?
>     >
>     >        Also, if you would, change the comparator method to avoid
>     > brachning for int and string comparators, e.g.
>     >
>     >
>     >       return index.order[i.doc] - index.order[j.doc];
>     >
>     >
>     > Thanks
>     >
>     >
>     > -John
>     >
>     >
>     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>     > <lucene@mikemccandless.com <ma...@mikemccandless.com>
>     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>>> wrote:
>     >
>     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>     <john.wang@gmail.com <ma...@gmail.com>
>     >     <mailto:john.wang@gmail.com <ma...@gmail.com>>>
>     wrote:
>     >
>     >     >      I have been playing with the patch, and I think I
>     have some
>     >     information
>     >     > that you might like.
>     >     >      Let me spend sometime and gather some more numbers and
>     >     update in jira.
>     >
>     >     Excellent!
>     >
>     >     >      say bottom has ords 23, 45, 76, each corresponding to a
>     >     string. When
>     >     > moving to the next segment, you need to make bottom to
>     have ords
>     >     that can be
>     >     > comparable to other docs in this new segment, so you would
>     need
>     >     to find the
>     >     > new ords for the values in 23,45 and 76, don't you? To
>     find it,
>     >     assuming the
>     >     > values are s1,s2,s3, you would do a bin. search on the new val
>     >     array, and
>     >     > find index for s1,s2,s3.
>     >
>     >     It's that inversion (from ord->Comparable in first seg, and
>     >     Comparable->ord in second seg) that I'm trying to avoid (w/
>     this new
>     >     proposal).
>     >
>     >     > Which is 3 bin searches per convert, I am not sure
>     >     > how you can short circuit it. Are you suggesting we call
>     >     Comparable on
>     >     > compareBottom until some doc beats it?
>     >
>     >     I'm saying on seg transition you indeed get the Comparable
>     for current
>     >     bottom, but, don't attempt to invert it.  Instead, as seg 2
>     finds a
>     >     hit, you get that hit's Comparables and compare to bottom.
>      If it
>     >     beats bottom, it goes into the queue.  If it does not, you
>     use the ord
>     >     (in seg 2's ord space) to "learn" a bottom in the ord space
>     of seg 2.
>     >
>     >     > That would hurt performance I lot though, no?
>     >
>     >     Yeah I think likely it would, since we're talking about a binary
>     >     search on transition VS having to do possibly many
>     >     upgrade-to-Comparable and compare-Comparabls to slowly learn the
>     >     equivalent ord in the new segment.  I was proposing it for
>     cases where
>     >     inversion is very difficult.  But realistically, since you
>     must keep
>     >     around the ful ord -> Comparable for every segment anyway
>     (in order to
>     >     merge in the end), inversion shouldn't ever actually be
>     "difficult" --
>     >     it'd just be a binary search on presumably in-RAM storage.
>     >
>     >     Mike
>     >
>     >    
>     ---------------------------------------------------------------------
>     >     To unsubscribe, e-mail:
>     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     For additional commands, e-mail:
>     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >
>     >
>
>
>     --
>     - Mark
>
>     http://www.lucidimagination.com
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

Mark,

  We're not seeing exactly the numbers that Mike is seeing in his tests,
running with jdk 1.5 on intel macs, so we're trying to eliminate factors of
difference.

  Point 2 does indeed make a difference, we've seen it, and it's only fair:
the
single pq comparator does this branch optimization but the current patch
multi-pq
does not, so let's level the playing field.

  John's on the road with limited net connectivity, but we'll have some
numbers to
compare more over the weekend for sure.

  -jake

On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <ma...@gmail.com> wrote:

> Why? What might he find? Whats with the cryptic request?
>
> Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?
>
> I know point 2 certainly doesn't. Cards on the table?
>
> John Wang wrote:
> > Hey Michael:
> >
> >        Would you mind rerunning the test you have with jdk1.5?
> >
> >        Also, if you would, change the comparator method to avoid
> > brachning for int and string comparators, e.g.
> >
> >
> >       return index.order[i.doc] - index.order[j.doc];
> >
> >
> > Thanks
> >
> >
> > -John
> >
> >
> > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> > <lucene@mikemccandless.com <ma...@mikemccandless.com>> wrote:
> >
> >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang <john.wang@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >     >      I have been playing with the patch, and I think I have some
> >     information
> >     > that you might like.
> >     >      Let me spend sometime and gather some more numbers and
> >     update in jira.
> >
> >     Excellent!
> >
> >     >      say bottom has ords 23, 45, 76, each corresponding to a
> >     string. When
> >     > moving to the next segment, you need to make bottom to have ords
> >     that can be
> >     > comparable to other docs in this new segment, so you would need
> >     to find the
> >     > new ords for the values in 23,45 and 76, don't you? To find it,
> >     assuming the
> >     > values are s1,s2,s3, you would do a bin. search on the new val
> >     array, and
> >     > find index for s1,s2,s3.
> >
> >     It's that inversion (from ord->Comparable in first seg, and
> >     Comparable->ord in second seg) that I'm trying to avoid (w/ this new
> >     proposal).
> >
> >     > Which is 3 bin searches per convert, I am not sure
> >     > how you can short circuit it. Are you suggesting we call
> >     Comparable on
> >     > compareBottom until some doc beats it?
> >
> >     I'm saying on seg transition you indeed get the Comparable for
> current
> >     bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
> >     hit, you get that hit's Comparables and compare to bottom.  If it
> >     beats bottom, it goes into the queue.  If it does not, you use the
> ord
> >     (in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.
> >
> >     > That would hurt performance I lot though, no?
> >
> >     Yeah I think likely it would, since we're talking about a binary
> >     search on transition VS having to do possibly many
> >     upgrade-to-Comparable and compare-Comparabls to slowly learn the
> >     equivalent ord in the new segment.  I was proposing it for cases
> where
> >     inversion is very difficult.  But realistically, since you must keep
> >     around the ful ord -> Comparable for every segment anyway (in order
> to
> >     merge in the end), inversion shouldn't ever actually be "difficult"
> --
> >     it'd just be a binary search on presumably in-RAM storage.
> >
> >     Mike
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

>>   I guess I should be more clear in the email.

No - If you mentioned before the other info and I missed it, just say:
Mark you don't know what your talking about it and you missed the info.
Thats what I'd do.

You just caught me at a time when I'm trying to get these tests going
myself, and a little frustrated at the lack of info. I'd consider trying
Java 6 vs Java 1.5 or something on Linux, but with no reason why I
should, its like .. come on - throw me a bone.

John Wang wrote:
> Mike:
>
>        I did just post with what I saw, feel free to read and comment
> on it.
>
>        I am simply trying to work with Michael on this and trying to
> understand the code.
>
>        As I have expressed previously, I have seen a difference
> between 1.5 and 1.6 that is significant. Since Mike has posted some
> numbers on jdk 1.6, I was hoping to eliminate all variables relating
> to the index and environment and see if he sees the same thing.
>
>         I guess I should be more clear in the email.
>
> -John
>
> On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     I am patient :) And I'm not speaking for Mike, I'm speaking for
>     me. I'm
>     wondering what your seeing. Asking Mike to rerun the tests without
>     giving any further info (you didn't even say that your seeing
>     something
>     different) is unfair to the rest of us ;)
>
>     Giving 0 info along with your request just makes 0 sense to me and I
>     said as much.
>
>     John Wang wrote:
>     > Mark:
>     >
>     >        Please be patient with me. I am seeing a difference and was
>     > wondering if Mike would see the same thing. I thought Michael
>     would be
>     > willing to because he expressed interest in understanding what the
>     > performance discrepancies are.
>     >
>     >        Again, it is only a request. It is perfectly fine if Michael
>     > refuses to. But it would be great if Michael speaks for himself.
>     >
>     > Thanks
>     >
>     > -John
>     >
>     > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
>     <markrmiller@gmail.com <ma...@gmail.com>
>     > <mailto:markrmiller@gmail.com <ma...@gmail.com>>>
>     wrote:
>     >
>     >     Why? What might he find? Whats with the cryptic request?
>     >
>     >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>     >     gains?
>     >
>     >     I know point 2 certainly doesn't. Cards on the table?
>     >
>     >     John Wang wrote:
>     >     > Hey Michael:
>     >     >
>     >     >        Would you mind rerunning the test you have with jdk1.5?
>     >     >
>     >     >        Also, if you would, change the comparator method to
>     avoid
>     >     > brachning for int and string comparators, e.g.
>     >     >
>     >     >
>     >     >       return index.order[i.doc] - index.order[j.doc];
>     >     >
>     >     >
>     >     > Thanks
>     >     >
>     >     >
>     >     > -John
>     >     >
>     >     >
>     >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>     >     > <lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     <mailto:lucene@mikemccandless.com <ma...@mikemccandless.com>>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>
>     >     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>>>> wrote:
>     >     >
>     >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>     >     <john.wang@gmail.com <ma...@gmail.com>
>     <mailto:john.wang@gmail.com <ma...@gmail.com>>
>     >     >     <mailto:john.wang@gmail.com
>     <ma...@gmail.com> <mailto:john.wang@gmail.com
>     <ma...@gmail.com>>>>
>     >     wrote:
>     >     >
>     >     >     >      I have been playing with the patch, and I think I
>     >     have some
>     >     >     information
>     >     >     > that you might like.
>     >     >     >      Let me spend sometime and gather some more
>     numbers and
>     >     >     update in jira.
>     >     >
>     >     >     Excellent!
>     >     >
>     >     >     >      say bottom has ords 23, 45, 76, each
>     corresponding to a
>     >     >     string. When
>     >     >     > moving to the next segment, you need to make bottom to
>     >     have ords
>     >     >     that can be
>     >     >     > comparable to other docs in this new segment, so you
>     would
>     >     need
>     >     >     to find the
>     >     >     > new ords for the values in 23,45 and 76, don't you? To
>     >     find it,
>     >     >     assuming the
>     >     >     > values are s1,s2,s3, you would do a bin. search on
>     the new val
>     >     >     array, and
>     >     >     > find index for s1,s2,s3.
>     >     >
>     >     >     It's that inversion (from ord->Comparable in first
>     seg, and
>     >     >     Comparable->ord in second seg) that I'm trying to
>     avoid (w/
>     >     this new
>     >     >     proposal).
>     >     >
>     >     >     > Which is 3 bin searches per convert, I am not sure
>     >     >     > how you can short circuit it. Are you suggesting we call
>     >     >     Comparable on
>     >     >     > compareBottom until some doc beats it?
>     >     >
>     >     >     I'm saying on seg transition you indeed get the Comparable
>     >     for current
>     >     >     bottom, but, don't attempt to invert it.  Instead, as
>     seg 2
>     >     finds a
>     >     >     hit, you get that hit's Comparables and compare to bottom.
>     >      If it
>     >     >     beats bottom, it goes into the queue.  If it does not, you
>     >     use the ord
>     >     >     (in seg 2's ord space) to "learn" a bottom in the ord
>     space
>     >     of seg 2.
>     >     >
>     >     >     > That would hurt performance I lot though, no?
>     >     >
>     >     >     Yeah I think likely it would, since we're talking
>     about a binary
>     >     >     search on transition VS having to do possibly many
>     >     >     upgrade-to-Comparable and compare-Comparabls to slowly
>     learn the
>     >     >     equivalent ord in the new segment.  I was proposing it for
>     >     cases where
>     >     >     inversion is very difficult.  But realistically, since you
>     >     must keep
>     >     >     around the ful ord -> Comparable for every segment anyway
>     >     (in order to
>     >     >     merge in the end), inversion shouldn't ever actually be
>     >     "difficult" --
>     >     >     it'd just be a binary search on presumably in-RAM storage.
>     >     >
>     >     >     Mike
>     >     >
>     >     >
>     >    
>     ---------------------------------------------------------------------
>     >     >     To unsubscribe, e-mail:
>     >     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >     For additional commands, e-mail:
>     >     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>>
>     >     >
>     >     >
>     >
>     >
>     >     --
>     >     - Mark
>     >
>     >     http://www.lucidimagination.com
>     >
>     >
>     >
>     >
>     >    
>     ---------------------------------------------------------------------
>     >     To unsubscribe, e-mail:
>     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     For additional commands, e-mail:
>     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >
>     >
>
>
>     --
>     - Mark
>
>     http://www.lucidimagination.com
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Mike:
       I did just post with what I saw, feel free to read and comment on it.

       I am simply trying to work with Michael on this and trying to
understand the code.

       As I have expressed previously, I have seen a difference between 1.5
and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6,
I was hoping to eliminate all variables relating to the index and
environment and see if he sees the same thing.

        I guess I should be more clear in the email.

-John

On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller <ma...@gmail.com> wrote:

> I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
> wondering what your seeing. Asking Mike to rerun the tests without
> giving any further info (you didn't even say that your seeing something
> different) is unfair to the rest of us ;)
>
> Giving 0 info along with your request just makes 0 sense to me and I
> said as much.
>
> John Wang wrote:
> > Mark:
> >
> >        Please be patient with me. I am seeing a difference and was
> > wondering if Mike would see the same thing. I thought Michael would be
> > willing to because he expressed interest in understanding what the
> > performance discrepancies are.
> >
> >        Again, it is only a request. It is perfectly fine if Michael
> > refuses to. But it would be great if Michael speaks for himself.
> >
> > Thanks
> >
> > -John
> >
> > On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <markrmiller@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Why? What might he find? Whats with the cryptic request?
> >
> >     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
> >     gains?
> >
> >     I know point 2 certainly doesn't. Cards on the table?
> >
> >     John Wang wrote:
> >     > Hey Michael:
> >     >
> >     >        Would you mind rerunning the test you have with jdk1.5?
> >     >
> >     >        Also, if you would, change the comparator method to avoid
> >     > brachning for int and string comparators, e.g.
> >     >
> >     >
> >     >       return index.order[i.doc] - index.order[j.doc];
> >     >
> >     >
> >     > Thanks
> >     >
> >     >
> >     > -John
> >     >
> >     >
> >     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> >     > <lucene@mikemccandless.com <ma...@mikemccandless.com>
> >     <mailto:lucene@mikemccandless.com
> >     <ma...@mikemccandless.com>>> wrote:
> >     >
> >     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
> >     <john.wang@gmail.com <ma...@gmail.com>
> >     >     <mailto:john.wang@gmail.com <ma...@gmail.com>>>
> >     wrote:
> >     >
> >     >     >      I have been playing with the patch, and I think I
> >     have some
> >     >     information
> >     >     > that you might like.
> >     >     >      Let me spend sometime and gather some more numbers and
> >     >     update in jira.
> >     >
> >     >     Excellent!
> >     >
> >     >     >      say bottom has ords 23, 45, 76, each corresponding to a
> >     >     string. When
> >     >     > moving to the next segment, you need to make bottom to
> >     have ords
> >     >     that can be
> >     >     > comparable to other docs in this new segment, so you would
> >     need
> >     >     to find the
> >     >     > new ords for the values in 23,45 and 76, don't you? To
> >     find it,
> >     >     assuming the
> >     >     > values are s1,s2,s3, you would do a bin. search on the new
> val
> >     >     array, and
> >     >     > find index for s1,s2,s3.
> >     >
> >     >     It's that inversion (from ord->Comparable in first seg, and
> >     >     Comparable->ord in second seg) that I'm trying to avoid (w/
> >     this new
> >     >     proposal).
> >     >
> >     >     > Which is 3 bin searches per convert, I am not sure
> >     >     > how you can short circuit it. Are you suggesting we call
> >     >     Comparable on
> >     >     > compareBottom until some doc beats it?
> >     >
> >     >     I'm saying on seg transition you indeed get the Comparable
> >     for current
> >     >     bottom, but, don't attempt to invert it.  Instead, as seg 2
> >     finds a
> >     >     hit, you get that hit's Comparables and compare to bottom.
> >      If it
> >     >     beats bottom, it goes into the queue.  If it does not, you
> >     use the ord
> >     >     (in seg 2's ord space) to "learn" a bottom in the ord space
> >     of seg 2.
> >     >
> >     >     > That would hurt performance I lot though, no?
> >     >
> >     >     Yeah I think likely it would, since we're talking about a
> binary
> >     >     search on transition VS having to do possibly many
> >     >     upgrade-to-Comparable and compare-Comparabls to slowly learn
> the
> >     >     equivalent ord in the new segment.  I was proposing it for
> >     cases where
> >     >     inversion is very difficult.  But realistically, since you
> >     must keep
> >     >     around the ful ord -> Comparable for every segment anyway
> >     (in order to
> >     >     merge in the end), inversion shouldn't ever actually be
> >     "difficult" --
> >     >     it'd just be a binary search on presumably in-RAM storage.
> >     >
> >     >     Mike
> >     >
> >     >
> >     ---------------------------------------------------------------------
> >     >     To unsubscribe, e-mail:
> >     java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     >     <mailto:java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>>
> >     >     For additional commands, e-mail:
> >     java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     >     <mailto:java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>>
> >     >
> >     >
> >
> >
> >     --
> >     - Mark
> >
> >     http://www.lucidimagination.com
> >
> >
> >
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
wondering what your seeing. Asking Mike to rerun the tests without
giving any further info (you didn't even say that your seeing something
different) is unfair to the rest of us ;)

Giving 0 info along with your request just makes 0 sense to me and I
said as much.

John Wang wrote:
> Mark:
>
>        Please be patient with me. I am seeing a difference and was
> wondering if Mike would see the same thing. I thought Michael would be
> willing to because he expressed interest in understanding what the
> performance discrepancies are.
>
>        Again, it is only a request. It is perfectly fine if Michael
> refuses to. But it would be great if Michael speaks for himself.
>
> Thanks
>
> -John
>
> On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <markrmiller@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Why? What might he find? Whats with the cryptic request?
>
>     Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
>     gains?
>
>     I know point 2 certainly doesn't. Cards on the table?
>
>     John Wang wrote:
>     > Hey Michael:
>     >
>     >        Would you mind rerunning the test you have with jdk1.5?
>     >
>     >        Also, if you would, change the comparator method to avoid
>     > brachning for int and string comparators, e.g.
>     >
>     >
>     >       return index.order[i.doc] - index.order[j.doc];
>     >
>     >
>     > Thanks
>     >
>     >
>     > -John
>     >
>     >
>     > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
>     > <lucene@mikemccandless.com <ma...@mikemccandless.com>
>     <mailto:lucene@mikemccandless.com
>     <ma...@mikemccandless.com>>> wrote:
>     >
>     >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang
>     <john.wang@gmail.com <ma...@gmail.com>
>     >     <mailto:john.wang@gmail.com <ma...@gmail.com>>>
>     wrote:
>     >
>     >     >      I have been playing with the patch, and I think I
>     have some
>     >     information
>     >     > that you might like.
>     >     >      Let me spend sometime and gather some more numbers and
>     >     update in jira.
>     >
>     >     Excellent!
>     >
>     >     >      say bottom has ords 23, 45, 76, each corresponding to a
>     >     string. When
>     >     > moving to the next segment, you need to make bottom to
>     have ords
>     >     that can be
>     >     > comparable to other docs in this new segment, so you would
>     need
>     >     to find the
>     >     > new ords for the values in 23,45 and 76, don't you? To
>     find it,
>     >     assuming the
>     >     > values are s1,s2,s3, you would do a bin. search on the new val
>     >     array, and
>     >     > find index for s1,s2,s3.
>     >
>     >     It's that inversion (from ord->Comparable in first seg, and
>     >     Comparable->ord in second seg) that I'm trying to avoid (w/
>     this new
>     >     proposal).
>     >
>     >     > Which is 3 bin searches per convert, I am not sure
>     >     > how you can short circuit it. Are you suggesting we call
>     >     Comparable on
>     >     > compareBottom until some doc beats it?
>     >
>     >     I'm saying on seg transition you indeed get the Comparable
>     for current
>     >     bottom, but, don't attempt to invert it.  Instead, as seg 2
>     finds a
>     >     hit, you get that hit's Comparables and compare to bottom.
>      If it
>     >     beats bottom, it goes into the queue.  If it does not, you
>     use the ord
>     >     (in seg 2's ord space) to "learn" a bottom in the ord space
>     of seg 2.
>     >
>     >     > That would hurt performance I lot though, no?
>     >
>     >     Yeah I think likely it would, since we're talking about a binary
>     >     search on transition VS having to do possibly many
>     >     upgrade-to-Comparable and compare-Comparabls to slowly learn the
>     >     equivalent ord in the new segment.  I was proposing it for
>     cases where
>     >     inversion is very difficult.  But realistically, since you
>     must keep
>     >     around the ful ord -> Comparable for every segment anyway
>     (in order to
>     >     merge in the end), inversion shouldn't ever actually be
>     "difficult" --
>     >     it'd just be a binary search on presumably in-RAM storage.
>     >
>     >     Mike
>     >
>     >    
>     ---------------------------------------------------------------------
>     >     To unsubscribe, e-mail:
>     java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >     For additional commands, e-mail:
>     java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>     >     <mailto:java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>>
>     >
>     >
>
>
>     --
>     - Mark
>
>     http://www.lucidimagination.com
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Mark:
       Please be patient with me. I am seeing a difference and was wondering
if Mike would see the same thing. I thought Michael would be willing to
because he expressed interest in understanding what the performance
discrepancies are.

       Again, it is only a request. It is perfectly fine if Michael refuses
to. But it would be great if Michael speaks for himself.

Thanks

-John

On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller <ma...@gmail.com> wrote:

> Why? What might he find? Whats with the cryptic request?
>
> Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?
>
> I know point 2 certainly doesn't. Cards on the table?
>
> John Wang wrote:
> > Hey Michael:
> >
> >        Would you mind rerunning the test you have with jdk1.5?
> >
> >        Also, if you would, change the comparator method to avoid
> > brachning for int and string comparators, e.g.
> >
> >
> >       return index.order[i.doc] - index.order[j.doc];
> >
> >
> > Thanks
> >
> >
> > -John
> >
> >
> > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> > <lucene@mikemccandless.com <ma...@mikemccandless.com>> wrote:
> >
> >     On Thu, Oct 22, 2009 at 2:17 AM, John Wang <john.wang@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >     >      I have been playing with the patch, and I think I have some
> >     information
> >     > that you might like.
> >     >      Let me spend sometime and gather some more numbers and
> >     update in jira.
> >
> >     Excellent!
> >
> >     >      say bottom has ords 23, 45, 76, each corresponding to a
> >     string. When
> >     > moving to the next segment, you need to make bottom to have ords
> >     that can be
> >     > comparable to other docs in this new segment, so you would need
> >     to find the
> >     > new ords for the values in 23,45 and 76, don't you? To find it,
> >     assuming the
> >     > values are s1,s2,s3, you would do a bin. search on the new val
> >     array, and
> >     > find index for s1,s2,s3.
> >
> >     It's that inversion (from ord->Comparable in first seg, and
> >     Comparable->ord in second seg) that I'm trying to avoid (w/ this new
> >     proposal).
> >
> >     > Which is 3 bin searches per convert, I am not sure
> >     > how you can short circuit it. Are you suggesting we call
> >     Comparable on
> >     > compareBottom until some doc beats it?
> >
> >     I'm saying on seg transition you indeed get the Comparable for
> current
> >     bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
> >     hit, you get that hit's Comparables and compare to bottom.  If it
> >     beats bottom, it goes into the queue.  If it does not, you use the
> ord
> >     (in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.
> >
> >     > That would hurt performance I lot though, no?
> >
> >     Yeah I think likely it would, since we're talking about a binary
> >     search on transition VS having to do possibly many
> >     upgrade-to-Comparable and compare-Comparabls to slowly learn the
> >     equivalent ord in the new segment.  I was proposing it for cases
> where
> >     inversion is very difficult.  But realistically, since you must keep
> >     around the ful ord -> Comparable for every segment anyway (in order
> to
> >     merge in the end), inversion shouldn't ever actually be "difficult"
> --
> >     it'd just be a binary search on presumably in-RAM storage.
> >
> >     Mike
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
> >     <ma...@lucene.apache.org>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Why? What might he find? Whats with the cryptic request?

Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?

I know point 2 certainly doesn't. Cards on the table?

John Wang wrote:
> Hey Michael:
>
>        Would you mind rerunning the test you have with jdk1.5?
>
>        Also, if you would, change the comparator method to avoid
> brachning for int and string comparators, e.g. 
>
>
>       return index.order[i.doc] - index.order[j.doc];
>
>
> Thanks
>
>
> -John
>
>
> On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> <lucene@mikemccandless.com <ma...@mikemccandless.com>> wrote:
>
>     On Thu, Oct 22, 2009 at 2:17 AM, John Wang <john.wang@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     >      I have been playing with the patch, and I think I have some
>     information
>     > that you might like.
>     >      Let me spend sometime and gather some more numbers and
>     update in jira.
>
>     Excellent!
>
>     >      say bottom has ords 23, 45, 76, each corresponding to a
>     string. When
>     > moving to the next segment, you need to make bottom to have ords
>     that can be
>     > comparable to other docs in this new segment, so you would need
>     to find the
>     > new ords for the values in 23,45 and 76, don't you? To find it,
>     assuming the
>     > values are s1,s2,s3, you would do a bin. search on the new val
>     array, and
>     > find index for s1,s2,s3.
>
>     It's that inversion (from ord->Comparable in first seg, and
>     Comparable->ord in second seg) that I'm trying to avoid (w/ this new
>     proposal).
>
>     > Which is 3 bin searches per convert, I am not sure
>     > how you can short circuit it. Are you suggesting we call
>     Comparable on
>     > compareBottom until some doc beats it?
>
>     I'm saying on seg transition you indeed get the Comparable for current
>     bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
>     hit, you get that hit's Comparables and compare to bottom.  If it
>     beats bottom, it goes into the queue.  If it does not, you use the ord
>     (in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.
>
>     > That would hurt performance I lot though, no?
>
>     Yeah I think likely it would, since we're talking about a binary
>     search on transition VS having to do possibly many
>     upgrade-to-Comparable and compare-Comparabls to slowly learn the
>     equivalent ord in the new segment.  I was proposing it for cases where
>     inversion is very difficult.  But realistically, since you must keep
>     around the ful ord -> Comparable for every segment anyway (in order to
>     merge in the end), inversion shouldn't ever actually be "difficult" --
>     it'd just be a binary search on presumably in-RAM storage.
>
>     Mike
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hey Michael:
       Would you mind rerunning the test you have with jdk1.5?

       Also, if you would, change the comparator method to avoid brachning
for int and string comparators, e.g.


      return index.order[i.doc] - index.order[j.doc];


Thanks


-John

On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Oct 22, 2009 at 2:17 AM, John Wang <jo...@gmail.com> wrote:
>
> >      I have been playing with the patch, and I think I have some
> information
> > that you might like.
> >      Let me spend sometime and gather some more numbers and update in
> jira.
>
> Excellent!
>
> >      say bottom has ords 23, 45, 76, each corresponding to a string. When
> > moving to the next segment, you need to make bottom to have ords that can
> be
> > comparable to other docs in this new segment, so you would need to find
> the
> > new ords for the values in 23,45 and 76, don't you? To find it, assuming
> the
> > values are s1,s2,s3, you would do a bin. search on the new val array, and
> > find index for s1,s2,s3.
>
> It's that inversion (from ord->Comparable in first seg, and
> Comparable->ord in second seg) that I'm trying to avoid (w/ this new
> proposal).
>
> > Which is 3 bin searches per convert, I am not sure
> > how you can short circuit it. Are you suggesting we call Comparable on
> > compareBottom until some doc beats it?
>
> I'm saying on seg transition you indeed get the Comparable for current
> bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
> hit, you get that hit's Comparables and compare to bottom.  If it
> beats bottom, it goes into the queue.  If it does not, you use the ord
> (in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.
>
> > That would hurt performance I lot though, no?
>
> Yeah I think likely it would, since we're talking about a binary
> search on transition VS having to do possibly many
> upgrade-to-Comparable and compare-Comparabls to slowly learn the
> equivalent ord in the new segment.  I was proposing it for cases where
> inversion is very difficult.  But realistically, since you must keep
> around the ful ord -> Comparable for every segment anyway (in order to
> merge in the end), inversion shouldn't ever actually be "difficult" --
> it'd just be a binary search on presumably in-RAM storage.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Oct 22, 2009 at 2:17 AM, John Wang <jo...@gmail.com> wrote:

>      I have been playing with the patch, and I think I have some information
> that you might like.
>      Let me spend sometime and gather some more numbers and update in jira.

Excellent!

>      say bottom has ords 23, 45, 76, each corresponding to a string. When
> moving to the next segment, you need to make bottom to have ords that can be
> comparable to other docs in this new segment, so you would need to find the
> new ords for the values in 23,45 and 76, don't you? To find it, assuming the
> values are s1,s2,s3, you would do a bin. search on the new val array, and
> find index for s1,s2,s3.

It's that inversion (from ord->Comparable in first seg, and
Comparable->ord in second seg) that I'm trying to avoid (w/ this new
proposal).

> Which is 3 bin searches per convert, I am not sure
> how you can short circuit it. Are you suggesting we call Comparable on
> compareBottom until some doc beats it?

I'm saying on seg transition you indeed get the Comparable for current
bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
hit, you get that hit's Comparables and compare to bottom.  If it
beats bottom, it goes into the queue.  If it does not, you use the ord
(in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.

> That would hurt performance I lot though, no?

Yeah I think likely it would, since we're talking about a binary
search on transition VS having to do possibly many
upgrade-to-Comparable and compare-Comparabls to slowly learn the
equivalent ord in the new segment.  I was proposing it for cases where
inversion is very difficult.  But realistically, since you must keep
around the ful ord -> Comparable for every segment anyway (in order to
merge in the end), inversion shouldn't ever actually be "difficult" --
it'd just be a binary search on presumably in-RAM storage.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Mike:
     I have been playing with the patch, and I think I have some information
that you might like.

     Let me spend sometime and gather some more numbers and update in jira.

Thanks

btw:

     About the conversion on multi values fields, I am not sure I get it
(sorry for being ignorant):

     say bottom has ords 23, 45, 76, each corresponding to a string. When
moving to the next segment, you need to make bottom to have ords that can be
comparable to other docs in this new segment, so you would need to find the
new ords for the values in 23,45 and 76, don't you? To find it, assuming the
values are s1,s2,s3, you would do a bin. search on the new val array, and
find index for s1,s2,s3. Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call Comparable on
compareBottom until some doc beats it? That would hurt performance I lot
though, no?

-John

On Wed, Oct 21, 2009 at 3:11 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Tue, Oct 20, 2009 at 11:55 AM, John Wang <jo...@gmail.com> wrote:
>
> > the simpler api places less restriction on the type of custom
> > sorting that can be done.
>
> Just to verify: this is not a back-compat break, right?
>
> Because, in 2.4, such an interesting custom sort must've been
> operating at the top-level index reader level, which is easy to carry
> over to 2.9 (you just rebase the docIDs).
>
> But, of course in moving to 2.9, you would like to also switch your
> custom sort to be per-segment (for faster reopen/near real-time perf),
> but the new sort API makes this more difficult because it requires
> that you are able to compare hits across different segments during the
> search, not just at the end.
>
> But then I don't understand the difficulty of doing that: if we had a
> Collector with the MultiPQ approach, at the end during merge, you'd
> also have to compare results across segments, ie, upgrade your ords to
> their real values.  The MultiPQ approach does this by calling
> sortValue (returns Comparable) in the end.
>
> Putting performance aside for now... when comparing bottom, you don't
> actually have to "truly invert" Comparable -> ord on segment
> transition.  You could, instead, get the Comparable for each and
> compare, but then note the smallest ord for the current segment that
> has failed to compete, and short-ciruit the compareBottom test by
> checking against that ord. That should enable carrying over the custom
> sort to the single PQ API without needing invert ord->value.
>
> We'd obviously have to test performance...
>
> Or, we could commit the MultiPQ approach as another sorting collector?
> I know it's not great having two wildly differenet sort APIs, but both
> APIs seem to have their strengths in different cases.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 8:21 AM, Mark Miller <ma...@gmail.com> wrote:
> Ahhh - I see - way at the top. Man that was early. Had forgotten about
> that stuff even before the issue was finished.

Tell me about it -- impossible to remember these things :)  I wish I
could upgrade the RAM in my brain the way I can in my computers...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 11:47 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Tue, Oct 20, 2009 at 10:49 AM, Mark Miller <ma...@gmail.com> wrote:
>> bq. One trivial thing that could be improved is to perhaps move all of
>> the methods to the top of the class?
>>
>> +1 - I think Mike and silently fought on that one once in the patches :)
>> Though I don't know how conscious it was. I prefer the methods at the
>> top myself.
>
> +1
>
> I didn't mean to fight that :)

I'll move them up...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 10:49 AM, Mark Miller <ma...@gmail.com> wrote:
> bq. One trivial thing that could be improved is to perhaps move all of
> the methods to the top of the class?
>
> +1 - I think Mike and silently fought on that one once in the patches :)
> Though I don't know how conscious it was. I prefer the methods at the
> top myself.

+1

I didn't mean to fight that :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Earwin Burrfoot <ea...@gmail.com>.

There are some advanced things that are plain impossible with stock new API.
Like having more than one HitQueue in your Collector, and stashing
overflowing values from one of them into another. Once you cross the
segment border - BOOM!

Otherwise it may look intimidating, but is pretty simple in fact.

On Tue, Oct 20, 2009 at 17:41, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> It is not bad, only harder to understand (for some people).
>
> The Javadoc is much improved since I made the switch.
> One trivial thing that could be improved is to perhaps move all of the
> methods to the top of the class?
> Right now, if I go and browse FieldComparator, I'm immediately hit
> with inner classes... and it takes time to find the actual methods one
> must override to create their own FieldComparator, and that make the
> class seem more complex at first blush (but maybe that's just me
> too...)
>
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 11:55 AM, John Wang <jo...@gmail.com> wrote:

> the simpler api places less restriction on the type of custom
> sorting that can be done.

Just to verify: this is not a back-compat break, right?

Because, in 2.4, such an interesting custom sort must've been
operating at the top-level index reader level, which is easy to carry
over to 2.9 (you just rebase the docIDs).

But, of course in moving to 2.9, you would like to also switch your
custom sort to be per-segment (for faster reopen/near real-time perf),
but the new sort API makes this more difficult because it requires
that you are able to compare hits across different segments during the
search, not just at the end.

But then I don't understand the difficulty of doing that: if we had a
Collector with the MultiPQ approach, at the end during merge, you'd
also have to compare results across segments, ie, upgrade your ords to
their real values.  The MultiPQ approach does this by calling
sortValue (returns Comparable) in the end.

Putting performance aside for now... when comparing bottom, you don't
actually have to "truly invert" Comparable -> ord on segment
transition.  You could, instead, get the Comparable for each and
compare, but then note the smallest ord for the current segment that
has failed to compete, and short-ciruit the compareBottom test by
checking against that ord. That should enable carrying over the custom
sort to the single PQ API without needing invert ord->value.

We'd obviously have to test performance...

Or, we could commit the MultiPQ approach as another sorting collector?
I know it's not great having two wildly differenet sort APIs, but both
APIs seem to have their strengths in different cases.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Sorry, mistyped again, we have a multivalued field of STRINGS, no integers.
-John

On Tue, Oct 20, 2009 at 8:55 AM, John Wang <jo...@gmail.com> wrote:

> Hi guys:
>     I am not suggesting just simply changing the deprecated signatures.
> There are some work to be done of course. In the beginning of the thread, we
> discussed two algorithms (both handling per-segment field loading), and at
> the conclusion, (to be still verified by Mike) that both algorithms perform
> the same. (We do see once the queue size increases, the performance cost
> increased more for the single Q approach, the one in the trunk, than the
> multiQ approach, please see the numbers I posted earlier in this thread.)
>
>     However, the multiQ algorithm would allow us to keep the old simpler
> api, and the simpler api places less restriction on the type of custom
> sorting that can be done.
>
> Let me provide an example:
>
>     We have a multi valued field on integers, we define a sort on this set
> of strings by defining a comparator on each value to be similar to a lex
> order, instead of compare on characters, we do on strings, we also want to
> keep the multi value representation as we do filtering and facet counting on
> it. The in memory representation is similar to the UnInvertedField in Solr.
>
>    Implementing a sort with the old API was rather simple, as we only
> needed mapping from a docid to a set of ordinals. With the new api, we
> needed to do a "conversion", which would mean mapping a set of
> String/ordinals back to a doc. Which is to me, is not trivial, let alone
> performance implications.
>
>    That actually gave us to motivation to see if the old api can handle the
> segment level changes that was made in 2.9 (which in my opinion is the best
> thing in lucene since payloads :) )
>
>    So after some investigation, with code and big O analysis, and
> discussions with Mike and Yonik, on our end, we feel given the performance
> numbers, it is unnecessary to go with the more complicated API.
>
> Thanks
>
> -John
>
>
>
> On Tue, Oct 20, 2009 at 6:00 AM, Mark Miller <ma...@gmail.com>wrote:
>
>> Actually though - how are we supposed to get back there? I don't think
>> its as simple as just not removing the deprecated API's. Doesn't even
>> seem close to that simple. Its another nightmare. It would have to be
>> some serious wins to go through that pain starting at a 3.0 release
>> wouldn't it? We just had a ton of people switch. We would have to
>> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
>> new API is certainly not that bad.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi guys:
    I am not suggesting just simply changing the deprecated signatures.
There are some work to be done of course. In the beginning of the thread, we
discussed two algorithms (both handling per-segment field loading), and at
the conclusion, (to be still verified by Mike) that both algorithms perform
the same. (We do see once the queue size increases, the performance cost
increased more for the single Q approach, the one in the trunk, than the
multiQ approach, please see the numbers I posted earlier in this thread.)

    However, the multiQ algorithm would allow us to keep the old simpler
api, and the simpler api places less restriction on the type of custom
sorting that can be done.

Let me provide an example:

    We have a multi valued field on integers, we define a sort on this set
of strings by defining a comparator on each value to be similar to a lex
order, instead of compare on characters, we do on strings, we also want to
keep the multi value representation as we do filtering and facet counting on
it. The in memory representation is similar to the UnInvertedField in Solr.

   Implementing a sort with the old API was rather simple, as we only needed
mapping from a docid to a set of ordinals. With the new api, we needed to do
a "conversion", which would mean mapping a set of String/ordinals back to a
doc. Which is to me, is not trivial, let alone performance implications.

   That actually gave us to motivation to see if the old api can handle the
segment level changes that was made in 2.9 (which in my opinion is the best
thing in lucene since payloads :) )

   So after some investigation, with code and big O analysis, and
discussions with Mike and Yonik, on our end, we feel given the performance
numbers, it is unnecessary to go with the more complicated API.

Thanks

-John



On Tue, Oct 20, 2009 at 6:00 AM, Mark Miller <ma...@gmail.com> wrote:

> Actually though - how are we supposed to get back there? I don't think
> its as simple as just not removing the deprecated API's. Doesn't even
> seem close to that simple. Its another nightmare. It would have to be
> some serious wins to go through that pain starting at a 3.0 release
> wouldn't it? We just had a ton of people switch. We would have to
> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
> new API is certainly not that bad.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

bq. One trivial thing that could be improved is to perhaps move all of
the methods to the top of the class?

+1 - I think Mike and silently fought on that one once in the patches :)
Though I don't know how conscious it was. I prefer the methods at the
top myself.

Yonik Seeley wrote:
> On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>   
>> It is not bad, only harder to understand (for some people).
>>     
>
> The Javadoc is much improved since I made the switch.
>
> Right now, if I go and browse FieldComparator, I'm immediately hit
> with inner classes... and it takes time to find the actual methods one
> must override to create their own FieldComparator, and that make the
> class seem more complex at first blush (but maybe that's just me
> too...)
>
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> It is not bad, only harder to understand (for some people).

The Javadoc is much improved since I made the switch.
One trivial thing that could be improved is to perhaps move all of the
methods to the top of the class?
Right now, if I go and browse FieldComparator, I'm immediately hit
with inner classes... and it takes time to find the actual methods one
must override to create their own FieldComparator, and that make the
class seem more complex at first blush (but maybe that's just me
too...)


-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: lucene 2.9 sorting algorithm

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Actually though - how are we supposed to get back there? I don't think
> its as simple as just not removing the deprecated API's. Doesn't even
> seem close to that simple. Its another nightmare. It would have to be
> some serious wins to go through that pain starting at a 3.0 release
> wouldn't it? We just had a ton of people switch. We would have to
> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
> new API is certainly not that bad.

It is not bad, only harder to understand (for some people). As noted in my
previous mail, we could start the same discussion like with the Collectors:

Should we provide some easy to use FieldComparators, that can be used as a
starting point? Maybe someone that looks like the comparators used for
byte/int/short and so on, but has an abstract method
getComparable(IndexReader reader, int doc) [relative to current reader] that
returns an Comparable<T>. All other methods and Arrays use the datatype
Comparable<T> and implement just the compare functions inner slot and
botton. Or maybe another wrapper that takes an java.lang.Comparator<T> and a
method T getSortValue(IndexReader reader, int doc) and implements the
comparison. We had something like this in the old API, too. It could be
implemented by a wrapper with this easy API.

These wrapper classes may be lot of slower than native FieldComparators
directly working on the native types, but would help to create simple
sorting for things like geo-distances, where the calculation of the field
values/Comparable<T> takes most time.

So some easy-access or sample comparators like this in contrib would be
fine.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Actually though - how are we supposed to get back there? I don't think
its as simple as just not removing the deprecated API's. Doesn't even
seem close to that simple. Its another nightmare. It would have to be
some serious wins to go through that pain starting at a 3.0 release
wouldn't it? We just had a ton of people switch. We would have to
deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
new API is certainly not that bad.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Uwe Schindler wrote:
>> On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller <ma...@gmail.com>
>> wrote:
>>     
>>> Hmm - perhaps I'm not remembering right. Or perhaps we had different
>>> motivations ;) I never did anything in 1483 based on search perf - and I
>>> took your tests as testing that we didn't lose perf, not that we gained
>>> any. The fact that there were some wins was just a nice surprise from my
>>> perspective.
>>>
>>> A quote from you in that issue:
>>>
>>> "I didn't expect such performance gain (I was hoping for not much
>>> performance loss, actually). I think it may be that although the
>>> initial value copy adds some cost, the within-queue comparsions are
>>> then faster because you don't have to deref back to the fieldcache
>>> array. It seems we keep accidentally discovering performance gains
>>> here"
>>>
>>> My whole memory of that issue is that we didn't do anything for
>>> performance gains. We just happened to measure a few. It was just to get
>>> to per segment. Was a long time ago though.
>>>       
>> Right, our original motitivation was fast reopen time, by doing
>> searching (and collection) per-segment so that field cache only used
>> at the segment level.
>>
>> But, that required cutting over field sorting, which was tricky.
>>
>> Our first go at it was the multi PQ approach (copying MultiSearcher),
>> but I believe that showed poor performance.  I remember being
>> depressed about it :)  So that poor performance pushed us to work out
>> the new comparator API that use a single PQ, and, after much
>> iterating, we saw better performance net/net.
>>     
>
> And the new sorting API is in line with the new Collector API! You have a
> setNextReader() method, where you e.g. load the FieldCache for the next
> segment and provide the compare functions, also you can get the scorer. My
> question: What is so hard to use this API? OK its more work when
> implementing the Comparator, but it is more intuitive for me if you think in
> terms of per-segment searches. For new users only the bottom comparison and
> so on is strange, the other is straightforward.
>
> I do not know how this flexibility can be implemented with the old API
> (scorer, reader switch)? If we want to switch back to a more simplier API,
> we should not switch back to the strange old one (I never understood it
> completely, the new one I understand)! Maybe we can provide an easy-to-use
> default implementation for Comparables in addition to custom sort, which may
> help lots of people that used Comparables with the old API. This impl may be
> slower and more memory intensive than directly implementing the new API, but
> may help.
>
> Uwe
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   
That wording had me caught up to Uwe - "switch back to the old API" -
brings a lot of baggage with it :) More accurately, they mean switch to
using multiple p-queues rather than a single p-queue. The switch to
using a single p-queue is why we had to bring in setNextReader and all
of that to begin with. The original approach was actually to work as
MultiSearcher works and do a merge after using a p-queue for each
segment. So if we went back to that, many of the "new apis" wouldn't be
needed anymore.

I agree that the new API is not too dreadful, but I think part of that
may be from being a more advanced user. Lets just not go back to caching
comparators :)

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: lucene 2.9 sorting algorithm

Posted by Uwe Schindler <uw...@thetaphi.de>.

> On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller <ma...@gmail.com>
> wrote:
> > Hmm - perhaps I'm not remembering right. Or perhaps we had different
> > motivations ;) I never did anything in 1483 based on search perf - and I
> > took your tests as testing that we didn't lose perf, not that we gained
> > any. The fact that there were some wins was just a nice surprise from my
> > perspective.
> >
> > A quote from you in that issue:
> >
> > "I didn't expect such performance gain (I was hoping for not much
> > performance loss, actually). I think it may be that although the
> > initial value copy adds some cost, the within-queue comparsions are
> > then faster because you don't have to deref back to the fieldcache
> > array. It seems we keep accidentally discovering performance gains
> > here"
> >
> > My whole memory of that issue is that we didn't do anything for
> > performance gains. We just happened to measure a few. It was just to get
> > to per segment. Was a long time ago though.
> 
> Right, our original motitivation was fast reopen time, by doing
> searching (and collection) per-segment so that field cache only used
> at the segment level.
> 
> But, that required cutting over field sorting, which was tricky.
> 
> Our first go at it was the multi PQ approach (copying MultiSearcher),
> but I believe that showed poor performance.  I remember being
> depressed about it :)  So that poor performance pushed us to work out
> the new comparator API that use a single PQ, and, after much
> iterating, we saw better performance net/net.

And the new sorting API is in line with the new Collector API! You have a
setNextReader() method, where you e.g. load the FieldCache for the next
segment and provide the compare functions, also you can get the scorer. My
question: What is so hard to use this API? OK its more work when
implementing the Comparator, but it is more intuitive for me if you think in
terms of per-segment searches. For new users only the bottom comparison and
so on is strange, the other is straightforward.

I do not know how this flexibility can be implemented with the old API
(scorer, reader switch)? If we want to switch back to a more simplier API,
we should not switch back to the strange old one (I never understood it
completely, the new one I understand)! Maybe we can provide an easy-to-use
default implementation for Comparables in addition to custom sort, which may
help lots of people that used Comparables with the old API. This impl may be
slower and more memory intensive than directly implementing the new API, but
may help.

Uwe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller <ma...@gmail.com> wrote:
> Hmm - perhaps I'm not remembering right. Or perhaps we had different
> motivations ;) I never did anything in 1483 based on search perf - and I
> took your tests as testing that we didn't lose perf, not that we gained
> any. The fact that there were some wins was just a nice surprise from my
> perspective.
>
> A quote from you in that issue:
>
> "I didn't expect such performance gain (I was hoping for not much
> performance loss, actually). I think it may be that although the
> initial value copy adds some cost, the within-queue comparsions are
> then faster because you don't have to deref back to the fieldcache
> array. It seems we keep accidentally discovering performance gains
> here"
>
> My whole memory of that issue is that we didn't do anything for
> performance gains. We just happened to measure a few. It was just to get
> to per segment. Was a long time ago though.

Right, our original motitivation was fast reopen time, by doing
searching (and collection) per-segment so that field cache only used
at the segment level.

But, that required cutting over field sorting, which was tricky.

Our first go at it was the multi PQ approach (copying MultiSearcher),
but I believe that showed poor performance.  I remember being
depressed about it :)  So that poor performance pushed us to work out
the new comparator API that use a single PQ, and, after much
iterating, we saw better performance net/net.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Ahhh - I see - way at the top. Man that was early. Had forgotten about
that stuff even before the issue was finished.

Mark Miller wrote:
> Hmm - perhaps I'm not remembering right. Or perhaps we had different
> motivations ;) I never did anything in 1483 based on search perf - and I
> took your tests as testing that we didn't lose perf, not that we gained
> any. The fact that there were some wins was just a nice surprise from my
> perspective.
>
> A quote from you in that issue:
>
> "I didn't expect such performance gain (I was hoping for not much
> performance loss, actually). I think it may be that although the
> initial value copy adds some cost, the within-queue comparsions are
> then faster because you don't have to deref back to the fieldcache
> array. It seems we keep accidentally discovering performance gains
> here"
>
> My whole memory of that issue is that we didn't do anything for
> performance gains. We just happened to measure a few. It was just to get
> to per segment. Was a long time ago though.
>
> Michael McCandless wrote:
>   
>> On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller <ma...@gmail.com> wrote:
>>   
>>     
>>> I didn't really follow that thread either - but we didn't move to the new
>>> Comp Api because of it's perfomance vs the old.
>>>     
>>>       
>> We did (LUCENE-1483), but those perf tests mixed in a number of other
>> improvements (eg, searching by segment avoids the 2nd pass of
>> MultiTermDocs.read(int[], int[]), whereas John's tests more
>> specifically test the perf difference between single-PQ vs
>> multi-PQ-then-merge (much simpler comparator API).
>>
>> So we are re-scrutinizing that difference... and if the perf gains are
>> minimal or non-existent I think we should still consider going back to
>> the simpler API.
>>
>> I'm working now to set up a full benchmark across real (wikipedia) /
>> synthetic source, different queries, different sorts, balanced vs
>> unbalanced segment sizes, etc.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>   
>>     
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

Hmm - perhaps I'm not remembering right. Or perhaps we had different
motivations ;) I never did anything in 1483 based on search perf - and I
took your tests as testing that we didn't lose perf, not that we gained
any. The fact that there were some wins was just a nice surprise from my
perspective.

A quote from you in that issue:

"I didn't expect such performance gain (I was hoping for not much
performance loss, actually). I think it may be that although the
initial value copy adds some cost, the within-queue comparsions are
then faster because you don't have to deref back to the fieldcache
array. It seems we keep accidentally discovering performance gains
here"

My whole memory of that issue is that we didn't do anything for
performance gains. We just happened to measure a few. It was just to get
to per segment. Was a long time ago though.

Michael McCandless wrote:
> On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller <ma...@gmail.com> wrote:
>   
>> I didn't really follow that thread either - but we didn't move to the new
>> Comp Api because of it's perfomance vs the old.
>>     
>
> We did (LUCENE-1483), but those perf tests mixed in a number of other
> improvements (eg, searching by segment avoids the 2nd pass of
> MultiTermDocs.read(int[], int[]), whereas John's tests more
> specifically test the perf difference between single-PQ vs
> multi-PQ-then-merge (much simpler comparator API).
>
> So we are re-scrutinizing that difference... and if the perf gains are
> minimal or non-existent I think we should still consider going back to
> the simpler API.
>
> I'm working now to set up a full benchmark across real (wikipedia) /
> synthetic source, different queries, different sorts, balanced vs
> unbalanced segment sizes, etc.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   

-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller <ma...@gmail.com> wrote:
> I didn't really follow that thread either - but we didn't move to the new
> Comp Api because of it's perfomance vs the old.

We did (LUCENE-1483), but those perf tests mixed in a number of other
improvements (eg, searching by segment avoids the 2nd pass of
MultiTermDocs.read(int[], int[]), whereas John's tests more
specifically test the perf difference between single-PQ vs
multi-PQ-then-merge (much simpler comparator API).

So we are re-scrutinizing that difference... and if the perf gains are
minimal or non-existent I think we should still consider going back to
the simpler API.

I'm working now to set up a full benchmark across real (wikipedia) /
synthetic source, different queries, different sorts, balanced vs
unbalanced segment sizes, etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Mark Miller <ma...@gmail.com>.

I didn't really follow that thread either - but we didn't move to the  
new Comp Api because of it's perfomance vs the old.

- Mark

http://www.lucidimagination.com (mobile)

On Oct 20, 2009, at 4:22 AM, "Uwe Schindler" <uw...@thetaphi.de> wrote:

> I did not follow the whole thread, but I do not understand what’s ba 
> d with the new API that rectifies to preserve the old one. The old A 
> PI does not fit very well with the segment based search and a lot of 
>  ugly stuff was done around to make both APIs work the same.
>
> For me it is not very complicated to create a new-style Comparator.  
> The only difference is that you have to implement more methods for  
> the comparison, but if you e.g. take the provided comparators for  
> the basic data types as a base, it is easy to understand how it  
> works and you can modify the examples.
>
> And: as far as I know, the old API is not really segment wise, so  
> reopen() cost is much higher and FieldCache gets slower, because the  
> top level reader must be reloaded into cache not the segments.
>
> Uwe
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> From: Jake Mannix [mailto:jake.mannix@gmail.com]
> Sent: Tuesday, October 20, 2009 8:37 AM
> To: java-dev@lucene.apache.org
> Subject: Re: lucene 2.9 sorting algorithm
>
> Given that this new API is pretty unweildy, and seems to not  
> actually perform any better than the old one... are we going to  
> consider revisiting that?
>
>   -jake
>
> On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler <uw...@thetaphi.de>  
> wrote:
> The old search API is already removed in trunk…
>
> Uwe
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> From: John Wang [mailto:john.wang@gmail.com]
> Sent: Tuesday, October 20, 2009 3:28 AM
> To: java-dev@lucene.apache.org
> Subject: Re: lucene 2.9 sorting algorithm
>
> Hi Michael:
>
>      Was wondering if you got a chance to take a look at this.
>
>      Since deprecated APIs are being removed in 3.0, I was wondering  
> if/when we would decide on keeping the ScoreDocComparator API and  
> thus would be kept for Lucene 3.0.
>
> Thanks
>
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless <lucene@mikemccandless.com 
> > wrote:
> Oh, no problem...
>
> Mike
>
> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com>  
> wrote:
> > Mike, just a clarification on my first perf report email.
> > The first section, numHits is incorrectly labeled, it should be 20  
> instead
> > of 50. Sorry about the possible confusion.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Thanks John; I'll have a look.
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>  
> wrote:
> >> > Hi Michael:
> >> >     I added classes: ScoreDocComparatorQueue and  
> OneSortNoScoreCollector
> >> > as
> >> > a more general case. I think keeping the old api for  
> ScoreDocComparator
> >> > and
> >> > SortComparatorSource would work.
> >> >   Please take a look.
> >> > Thanks
> >> > -John
> >> >
> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang  
> <jo...@gmail.com> wrote:
> >> >>
> >> >> Hi Michael:
> >> >>      It is open, http://code.google.com/p/lucene-book/source/checkout
> >> >>      I think I sent the https url instead, sorry.
> >> >>     The multi PQ sorting is fairly self-contained, I have 2  
> versions, 1
> >> >> for string and 1 for int, each are Collector impls.
> >> >>      I shouldn't say the Multi Q is faster on int sort, it is  
> within
> >> >> the
> >> >> error boundary. The diff is very very small, I would stay they  
> are more
> >> >> equal.
> >> >>      If you think it is a good thing to go this way, (if not  
> for the
> >> >> perf,
> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> Thanks
> >> >> -John
> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >>>
> >> >>> John, looks like this requires login -- any plans to open  
> that up, or,
> >> >>> post the code on an issue?
> >> >>>
> >> >>> How self-contained is your Multi PQ sorting?  EG is it a  
> standalone
> >> >>> Collector impl that I can test?
> >> >>>
> >> >>> Mike
> >> >>>
> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang  
> <jo...@gmail.com>
> >> >>> wrote:
> >> >>> > BTW, we are have a little sandbox for these experiments.  
> And all my
> >> >>> > testcode
> >> >>> > are at. They are not very polished.
> >> >>> >
> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >>> >
> >> >>> > -John
> >> >>> >
> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <john.wang@gmail.com 
> >
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Numbers Mike requested for Int types:
> >> >>> >>
> >> >>> >> only the time/cputime are posted, others are all the same  
> since the
> >> >>> >> algorithm is the same.
> >> >>> >>
> >> >>> >> Lucene 2.9:
> >> >>> >> numhits: 10
> >> >>> >> time: 14619495
> >> >>> >> cpu: 146126
> >> >>> >>
> >> >>> >> numhits: 20
> >> >>> >> time: 14550568
> >> >>> >> cpu: 163242
> >> >>> >>
> >> >>> >> numhits: 100
> >> >>> >> time: 16467647
> >> >>> >> cpu: 178379
> >> >>> >>
> >> >>> >>
> >> >>> >> my test:
> >> >>> >> numHits: 10
> >> >>> >> time: 14101094
> >> >>> >> cpu: 144715
> >> >>> >>
> >> >>> >> numHits: 20
> >> >>> >> time: 14804821
> >> >>> >> cpu: 151305
> >> >>> >>
> >> >>> >> numHits: 100
> >> >>> >> time: 15372157
> >> >>> >> cpu time: 158842
> >> >>> >>
> >> >>> >> Conclusions:
> >> >>> >> The are very similar, the differences are all within error  
> bounds,
> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >>> >> slightly
> >> >>> >> faster.
> >> >>> >>
> >> >>> >> Hope this helps.
> >> >>> >>
> >> >>> >> -John
> >> >>> >>
> >> >>> >>
> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >>> >> <yo...@lucidimagination.com>
> >> >>> >> wrote:
> >> >>> >>>
> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >> >>> >>> <lu...@mikemccandless.com> wrote:
> >> >>> >>> > Though it'd be odd if the switch to searching by segment
> >> >>> >>> > really was most of the gains here.
> >> >>> >>>
> >> >>> >>> I had assumed that much of the improvement was due to  
> ditching
> >> >>> >>> MultiTermEnum/MultiTermDocs.
> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that  
> only
> >> >>> >>> helps
> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
> >> >>> >>>
> >> >>> >>> -Yonik
> >> >>> >>> http://www.lucidimagination.com
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>  
> ---------------------------------------------------------------------
> >> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>> >>>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>  
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> For additional commands, e-mail: java-dev- 
> help@lucene.apache.org
> >> >>>
> >> >>
> >> >
> >> >
> >>
> >>  
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

RE: lucene 2.9 sorting algorithm

Posted by Uwe Schindler <uw...@thetaphi.de>.

I did not follow the whole thread, but I do not understand what's bad with
the new API that rectifies to preserve the old one. The old API does not fit
very well with the segment based search and a lot of ugly stuff was done
around to make both APIs work the same.

 

For me it is not very complicated to create a new-style Comparator. The only
difference is that you have to implement more methods for the comparison,
but if you e.g. take the provided comparators for the basic data types as a
base, it is easy to understand how it works and you can modify the examples.

 

And: as far as I know, the old API is not really segment wise, so reopen()
cost is much higher and FieldCache gets slower, because the top level reader
must be reloaded into cache not the segments.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Jake Mannix [mailto:jake.mannix@gmail.com] 
Sent: Tuesday, October 20, 2009 8:37 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

 

Given that this new API is pretty unweildy, and seems to not actually
perform any better than the old one... are we going to consider revisiting
that?

  -jake

On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

The old search API is already removed in trunk.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: John Wang [mailto:john.wang@gmail.com] 
Sent: Tuesday, October 20, 2009 3:28 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

 

Hi Michael:

 

     Was wondering if you got a chance to take a look at this.

 

     Since deprecated APIs are being removed in 3.0, I was wondering if/when
we would decide on keeping the ScoreDocComparator API and thus would be kept
for Lucene 3.0.

 

Thanks

 

-John

On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:

Oh, no problem...

Mike


On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
> Mike, just a clarification on my first perf report email.
> The first section, numHits is incorrectly labeled, it should be 20 instead
> of 50. Sorry about the possible confusion.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Thanks John; I'll have a look.
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com> wrote:
>> > Hi Michael:
>> >     I added classes: ScoreDocComparatorQueue and
OneSortNoScoreCollector
>> > as
>> > a more general case. I think keeping the old api for ScoreDocComparator
>> > and
>> > SortComparatorSource would work.
>> >   Please take a look.
>> > Thanks
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:
>> >>
>> >> Hi Michael:
>> >>      It is open, http://code.google.com/p/lucene-book/source/checkout
>> >>      I think I sent the https url instead, sorry.
>> >>     The multi PQ sorting is fairly self-contained, I have 2 versions,
1
>> >> for string and 1 for int, each are Collector impls.
>> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> the
>> >> error boundary. The diff is very very small, I would stay they are
more
>> >> equal.
>> >>      If you think it is a good thing to go this way, (if not for the
>> >> perf,
>> >> just for the simpler api) I'd be happy to work on a patch.
>> >> Thanks
>> >> -John
>> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >>>
>> >>> John, looks like this requires login -- any plans to open that up,
or,
>> >>> post the code on an issue?
>> >>>
>> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >>> Collector impl that I can test?
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >>> wrote:
>> >>> > BTW, we are have a little sandbox for these experiments. And all my
>> >>> > testcode
>> >>> > are at. They are not very polished.
>> >>> >
>> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >>> >
>> >>> > -John
>> >>> >
>> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Numbers Mike requested for Int types:
>> >>> >>
>> >>> >> only the time/cputime are posted, others are all the same since
the
>> >>> >> algorithm is the same.
>> >>> >>
>> >>> >> Lucene 2.9:
>> >>> >> numhits: 10
>> >>> >> time: 14619495
>> >>> >> cpu: 146126
>> >>> >>
>> >>> >> numhits: 20
>> >>> >> time: 14550568
>> >>> >> cpu: 163242
>> >>> >>
>> >>> >> numhits: 100
>> >>> >> time: 16467647
>> >>> >> cpu: 178379
>> >>> >>
>> >>> >>
>> >>> >> my test:
>> >>> >> numHits: 10
>> >>> >> time: 14101094
>> >>> >> cpu: 144715
>> >>> >>
>> >>> >> numHits: 20
>> >>> >> time: 14804821
>> >>> >> cpu: 151305
>> >>> >>
>> >>> >> numHits: 100
>> >>> >> time: 15372157
>> >>> >> cpu time: 158842
>> >>> >>
>> >>> >> Conclusions:
>> >>> >> The are very similar, the differences are all within error bounds,
>> >>> >> especially with lower PQ sizes, which second sort alg again
>> >>> >> slightly
>> >>> >> faster.
>> >>> >>
>> >>> >> Hope this helps.
>> >>> >>
>> >>> >> -John
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >>> >> <yo...@lucidimagination.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >>> >>> <lu...@mikemccandless.com> wrote:
>> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >>> >>> > really was most of the gains here.
>> >>> >>>
>> >>> >>> I had assumed that much of the improvement was due to ditching
>> >>> >>> MultiTermEnum/MultiTermDocs.
>> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >>> >>> helps
>> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >>> >>>
>> >>> >>> -Yonik
>> >>> >>> http://www.lucidimagination.com
>> >>> >>>
>> >>> >>>
>> >>> >>>
---------------------------------------------------------------------
>> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

Given that this new API is pretty unweildy, and seems to not actually
perform any better than the old one... are we going to consider revisiting
that?

  -jake

On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  The old search API is already removed in trunk…
>
>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* John Wang [mailto:john.wang@gmail.com]
> *Sent:* Tuesday, October 20, 2009 3:28 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Re: lucene 2.9 sorting algorithm
>
>
>
> Hi Michael:
>
>
>
>      Was wondering if you got a chance to take a look at this.
>
>
>
>      Since deprecated APIs are being removed in 3.0, I was wondering
> if/when we would decide on keeping the ScoreDocComparator API and thus would
> be kept for Lucene 3.0.
>
>
>
> Thanks
>
>
>
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> Oh, no problem...
>
> Mike
>
>
> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
> > Mike, just a clarification on my first perf report email.
> > The first section, numHits is incorrectly labeled, it should be 20
> instead
> > of 50. Sorry about the possible confusion.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Thanks John; I'll have a look.
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
> wrote:
> >> > Hi Michael:
> >> >     I added classes: ScoreDocComparatorQueue
> and OneSortNoScoreCollector
> >> > as
> >> > a more general case. I think keeping the old api for
> ScoreDocComparator
> >> > and
> >> > SortComparatorSource would work.
> >> >   Please take a look.
> >> > Thanks
> >> > -John
> >> >
> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
> wrote:
> >> >>
> >> >> Hi Michael:
> >> >>      It is open,
> http://code.google.com/p/lucene-book/source/checkout
> >> >>      I think I sent the https url instead, sorry.
> >> >>     The multi PQ sorting is fairly self-contained, I have 2 versions,
> 1
> >> >> for string and 1 for int, each are Collector impls.
> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
> >> >> the
> >> >> error boundary. The diff is very very small, I would stay they are
> more
> >> >> equal.
> >> >>      If you think it is a good thing to go this way, (if not for the
> >> >> perf,
> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> Thanks
> >> >> -John
> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >>>
> >> >>> John, looks like this requires login -- any plans to open that up,
> or,
> >> >>> post the code on an issue?
> >> >>>
> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
> >> >>> Collector impl that I can test?
> >> >>>
> >> >>> Mike
> >> >>>
> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
> >> >>> wrote:
> >> >>> > BTW, we are have a little sandbox for these experiments. And all
> my
> >> >>> > testcode
> >> >>> > are at. They are not very polished.
> >> >>> >
> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >>> >
> >> >>> > -John
> >> >>> >
> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Numbers Mike requested for Int types:
> >> >>> >>
> >> >>> >> only the time/cputime are posted, others are all the same since
> the
> >> >>> >> algorithm is the same.
> >> >>> >>
> >> >>> >> Lucene 2.9:
> >> >>> >> numhits: 10
> >> >>> >> time: 14619495
> >> >>> >> cpu: 146126
> >> >>> >>
> >> >>> >> numhits: 20
> >> >>> >> time: 14550568
> >> >>> >> cpu: 163242
> >> >>> >>
> >> >>> >> numhits: 100
> >> >>> >> time: 16467647
> >> >>> >> cpu: 178379
> >> >>> >>
> >> >>> >>
> >> >>> >> my test:
> >> >>> >> numHits: 10
> >> >>> >> time: 14101094
> >> >>> >> cpu: 144715
> >> >>> >>
> >> >>> >> numHits: 20
> >> >>> >> time: 14804821
> >> >>> >> cpu: 151305
> >> >>> >>
> >> >>> >> numHits: 100
> >> >>> >> time: 15372157
> >> >>> >> cpu time: 158842
> >> >>> >>
> >> >>> >> Conclusions:
> >> >>> >> The are very similar, the differences are all within error
> bounds,
> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >>> >> slightly
> >> >>> >> faster.
> >> >>> >>
> >> >>> >> Hope this helps.
> >> >>> >>
> >> >>> >> -John
> >> >>> >>
> >> >>> >>
> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >>> >> <yo...@lucidimagination.com>
> >> >>> >> wrote:
> >> >>> >>>
> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >> >>> >>> <lu...@mikemccandless.com> wrote:
> >> >>> >>> > Though it'd be odd if the switch to searching by segment
> >> >>> >>> > really was most of the gains here.
> >> >>> >>>
> >> >>> >>> I had assumed that much of the improvement was due to ditching
> >> >>> >>> MultiTermEnum/MultiTermDocs.
> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
> >> >>> >>> helps
> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
> >> >>> >>>
> >> >>> >>> -Yonik
> >> >>> >>> http://www.lucidimagination.com
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> ---------------------------------------------------------------------
> >> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> >>> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >>> >>>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>>
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

RE: lucene 2.9 sorting algorithm

Posted by Uwe Schindler <uw...@thetaphi.de>.

The old search API is already removed in trunk.

 

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: John Wang [mailto:john.wang@gmail.com] 
Sent: Tuesday, October 20, 2009 3:28 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

 

Hi Michael:

 

     Was wondering if you got a chance to take a look at this.

 

     Since deprecated APIs are being removed in 3.0, I was wondering if/when
we would decide on keeping the ScoreDocComparator API and thus would be kept
for Lucene 3.0.

 

Thanks

 

-John

On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:

Oh, no problem...

Mike


On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
> Mike, just a clarification on my first perf report email.
> The first section, numHits is incorrectly labeled, it should be 20 instead
> of 50. Sorry about the possible confusion.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Thanks John; I'll have a look.
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com> wrote:
>> > Hi Michael:
>> >     I added classes: ScoreDocComparatorQueue and
OneSortNoScoreCollector
>> > as
>> > a more general case. I think keeping the old api for ScoreDocComparator
>> > and
>> > SortComparatorSource would work.
>> >   Please take a look.
>> > Thanks
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:
>> >>
>> >> Hi Michael:
>> >>      It is open, http://code.google.com/p/lucene-book/source/checkout
>> >>      I think I sent the https url instead, sorry.
>> >>     The multi PQ sorting is fairly self-contained, I have 2 versions,
1
>> >> for string and 1 for int, each are Collector impls.
>> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> the
>> >> error boundary. The diff is very very small, I would stay they are
more
>> >> equal.
>> >>      If you think it is a good thing to go this way, (if not for the
>> >> perf,
>> >> just for the simpler api) I'd be happy to work on a patch.
>> >> Thanks
>> >> -John
>> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >>>
>> >>> John, looks like this requires login -- any plans to open that up,
or,
>> >>> post the code on an issue?
>> >>>
>> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >>> Collector impl that I can test?
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >>> wrote:
>> >>> > BTW, we are have a little sandbox for these experiments. And all my
>> >>> > testcode
>> >>> > are at. They are not very polished.
>> >>> >
>> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >>> >
>> >>> > -John
>> >>> >
>> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Numbers Mike requested for Int types:
>> >>> >>
>> >>> >> only the time/cputime are posted, others are all the same since
the
>> >>> >> algorithm is the same.
>> >>> >>
>> >>> >> Lucene 2.9:
>> >>> >> numhits: 10
>> >>> >> time: 14619495
>> >>> >> cpu: 146126
>> >>> >>
>> >>> >> numhits: 20
>> >>> >> time: 14550568
>> >>> >> cpu: 163242
>> >>> >>
>> >>> >> numhits: 100
>> >>> >> time: 16467647
>> >>> >> cpu: 178379
>> >>> >>
>> >>> >>
>> >>> >> my test:
>> >>> >> numHits: 10
>> >>> >> time: 14101094
>> >>> >> cpu: 144715
>> >>> >>
>> >>> >> numHits: 20
>> >>> >> time: 14804821
>> >>> >> cpu: 151305
>> >>> >>
>> >>> >> numHits: 100
>> >>> >> time: 15372157
>> >>> >> cpu time: 158842
>> >>> >>
>> >>> >> Conclusions:
>> >>> >> The are very similar, the differences are all within error bounds,
>> >>> >> especially with lower PQ sizes, which second sort alg again
>> >>> >> slightly
>> >>> >> faster.
>> >>> >>
>> >>> >> Hope this helps.
>> >>> >>
>> >>> >> -John
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >>> >> <yo...@lucidimagination.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >>> >>> <lu...@mikemccandless.com> wrote:
>> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >>> >>> > really was most of the gains here.
>> >>> >>>
>> >>> >>> I had assumed that much of the improvement was due to ditching
>> >>> >>> MultiTermEnum/MultiTermDocs.
>> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >>> >>> helps
>> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >>> >>>
>> >>> >>> -Yonik
>> >>> >>> http://www.lucidimagination.com
>> >>> >>>
>> >>> >>>
>> >>> >>>
---------------------------------------------------------------------
>> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Sorry, I have been digging into it, just didn't get far enough to post
patch/results.  I'll try to do so today.

I did find one bug in OneSortNoScoreCollector, in the getTop() method
in the inner compare() method, to break ties it should be:

  if (v==0 {
    v = o1.doc + o1.comparatorQueue._base - o2.doc - o2.comparatorQueue._base;
  }

not

  if (v==0) {
    v=o1.doc+o2.comparatorQueue._base-o2.doc+o2.comparatorQueue._base;
  }

I've folded the "multi PQ" approach into contrib/benchmark so we can
run our tests...

Mike

On Mon, Oct 19, 2009 at 9:28 PM, John Wang <jo...@gmail.com> wrote:
> Hi Michael:
>      Was wondering if you got a chance to take a look at this.
>      Since deprecated APIs are being removed in 3.0, I was wondering if/when
> we would decide on keeping the ScoreDocComparator API and thus would be kept
> for Lucene 3.0.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Oh, no problem...
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
>> > Mike, just a clarification on my first perf report email.
>> > The first section, numHits is incorrectly labeled, it should be 20
>> > instead
>> > of 50. Sorry about the possible confusion.
>> > Thanks
>> > -John
>> >
>> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Thanks John; I'll have a look.
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
>> >> wrote:
>> >> > Hi Michael:
>> >> >     I added classes: ScoreDocComparatorQueue
>> >> > and OneSortNoScoreCollector
>> >> > as
>> >> > a more general case. I think keeping the old api for
>> >> > ScoreDocComparator
>> >> > and
>> >> > SortComparatorSource would work.
>> >> >   Please take a look.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Michael:
>> >> >>      It is
>> >> >> open, http://code.google.com/p/lucene-book/source/checkout
>> >> >>      I think I sent the https url instead, sorry.
>> >> >>     The multi PQ sorting is fairly self-contained, I have 2
>> >> >> versions, 1
>> >> >> for string and 1 for int, each are Collector impls.
>> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> >> the
>> >> >> error boundary. The diff is very very small, I would stay they are
>> >> >> more
>> >> >> equal.
>> >> >>      If you think it is a good thing to go this way, (if not for the
>> >> >> perf,
>> >> >> just for the simpler api) I'd be happy to work on a patch.
>> >> >> Thanks
>> >> >> -John
>> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> >> <lu...@mikemccandless.com> wrote:
>> >> >>>
>> >> >>> John, looks like this requires login -- any plans to open that up,
>> >> >>> or,
>> >> >>> post the code on an issue?
>> >> >>>
>> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >> >>> Collector impl that I can test?
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >> >>> wrote:
>> >> >>> > BTW, we are have a little sandbox for these experiments. And all
>> >> >>> > my
>> >> >>> > testcode
>> >> >>> > are at. They are not very polished.
>> >> >>> >
>> >> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >> >>> >
>> >> >>> > -John
>> >> >>> >
>> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Numbers Mike requested for Int types:
>> >> >>> >>
>> >> >>> >> only the time/cputime are posted, others are all the same since
>> >> >>> >> the
>> >> >>> >> algorithm is the same.
>> >> >>> >>
>> >> >>> >> Lucene 2.9:
>> >> >>> >> numhits: 10
>> >> >>> >> time: 14619495
>> >> >>> >> cpu: 146126
>> >> >>> >>
>> >> >>> >> numhits: 20
>> >> >>> >> time: 14550568
>> >> >>> >> cpu: 163242
>> >> >>> >>
>> >> >>> >> numhits: 100
>> >> >>> >> time: 16467647
>> >> >>> >> cpu: 178379
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> my test:
>> >> >>> >> numHits: 10
>> >> >>> >> time: 14101094
>> >> >>> >> cpu: 144715
>> >> >>> >>
>> >> >>> >> numHits: 20
>> >> >>> >> time: 14804821
>> >> >>> >> cpu: 151305
>> >> >>> >>
>> >> >>> >> numHits: 100
>> >> >>> >> time: 15372157
>> >> >>> >> cpu time: 158842
>> >> >>> >>
>> >> >>> >> Conclusions:
>> >> >>> >> The are very similar, the differences are all within error
>> >> >>> >> bounds,
>> >> >>> >> especially with lower PQ sizes, which second sort alg again
>> >> >>> >> slightly
>> >> >>> >> faster.
>> >> >>> >>
>> >> >>> >> Hope this helps.
>> >> >>> >>
>> >> >>> >> -John
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >> >>> >> <yo...@lucidimagination.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >> >>> >>> <lu...@mikemccandless.com> wrote:
>> >> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >> >>> >>> > really was most of the gains here.
>> >> >>> >>>
>> >> >>> >>> I had assumed that much of the improvement was due to ditching
>> >> >>> >>> MultiTermEnum/MultiTermDocs.
>> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >> >>> >>> helps
>> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >> >>> >>>
>> >> >>> >>> -Yonik
>> >> >>> >>> http://www.lucidimagination.com
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> ---------------------------------------------------------------------
>> >> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> >>> For additional commands, e-mail:
>> >> >>> >>> java-dev-help@lucene.apache.org
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK, thanks.

I can help out if you've got questions on the python code... it's
rather straightforward: it just iterates over each set of params to
test, writes an alg file, runs it, opens the resulting output & parses
it for the best run, confirms both single & multi PQ gave precisely
the same doc IDs, and prints the results.

It's remotely possible the difference in the results is a bug/overhead
in contrib/benchmark itself, which'd be good to get to the bottom of
anyway.

Mike

On Tue, Oct 20, 2009 at 9:17 PM, John Wang <jo...@gmail.com> wrote:
> Hi Mike:
>     That's weird. Let me take a look at the patch. Need to brush up on
> python though :)
> Thanks
> -John
>
> On Tue, Oct 20, 2009 at 10:25 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> OK I posted a patch that folds the MultiPQ approach into
>> contrib/benchmark, plus a simple python wrapper to run old/new tests
>> across different queries, sort, topN, etc.
>>
>> But I got different results... MultiPQ looks generally slower than
>> SinglePQ.  So I think we now need to reconcile what's different
>> between our tests.
>>
>> Mike
>>
>> On Mon, Oct 19, 2009 at 9:28 PM, John Wang <jo...@gmail.com> wrote:
>> > Hi Michael:
>> >      Was wondering if you got a chance to take a look at this.
>> >      Since deprecated APIs are being removed in 3.0, I was wondering
>> > if/when
>> > we would decide on keeping the ScoreDocComparator API and thus would be
>> > kept
>> > for Lucene 3.0.
>> > Thanks
>> > -John
>> >
>> > On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Oh, no problem...
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com>
>> >> wrote:
>> >> > Mike, just a clarification on my first perf report email.
>> >> > The first section, numHits is incorrectly labeled, it should be 20
>> >> > instead
>> >> > of 50. Sorry about the possible confusion.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>> >> > <lu...@mikemccandless.com> wrote:
>> >> >>
>> >> >> Thanks John; I'll have a look.
>> >> >>
>> >> >> Mike
>> >> >>
>> >> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Michael:
>> >> >> >     I added classes: ScoreDocComparatorQueue
>> >> >> > and OneSortNoScoreCollector
>> >> >> > as
>> >> >> > a more general case. I think keeping the old api for
>> >> >> > ScoreDocComparator
>> >> >> > and
>> >> >> > SortComparatorSource would work.
>> >> >> >   Please take a look.
>> >> >> > Thanks
>> >> >> > -John
>> >> >> >
>> >> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi Michael:
>> >> >> >>      It is
>> >> >> >> open, http://code.google.com/p/lucene-book/source/checkout
>> >> >> >>      I think I sent the https url instead, sorry.
>> >> >> >>     The multi PQ sorting is fairly self-contained, I have 2
>> >> >> >> versions, 1
>> >> >> >> for string and 1 for int, each are Collector impls.
>> >> >> >>      I shouldn't say the Multi Q is faster on int sort, it is
>> >> >> >> within
>> >> >> >> the
>> >> >> >> error boundary. The diff is very very small, I would stay they
>> >> >> >> are
>> >> >> >> more
>> >> >> >> equal.
>> >> >> >>      If you think it is a good thing to go this way, (if not for
>> >> >> >> the
>> >> >> >> perf,
>> >> >> >> just for the simpler api) I'd be happy to work on a patch.
>> >> >> >> Thanks
>> >> >> >> -John
>> >> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> >> >> <lu...@mikemccandless.com> wrote:
>> >> >> >>>
>> >> >> >>> John, looks like this requires login -- any plans to open that
>> >> >> >>> up,
>> >> >> >>> or,
>> >> >> >>> post the code on an issue?
>> >> >> >>>
>> >> >> >>> How self-contained is your Multi PQ sorting?  EG is it a
>> >> >> >>> standalone
>> >> >> >>> Collector impl that I can test?
>> >> >> >>>
>> >> >> >>> Mike
>> >> >> >>>
>> >> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> > BTW, we are have a little sandbox for these experiments. And
>> >> >> >>> > all
>> >> >> >>> > my
>> >> >> >>> > testcode
>> >> >> >>> > are at. They are not very polished.
>> >> >> >>> >
>> >> >> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >> >> >>> >
>> >> >> >>> > -John
>> >> >> >>> >
>> >> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang
>> >> >> >>> > <jo...@gmail.com>
>> >> >> >>> > wrote:
>> >> >> >>> >>
>> >> >> >>> >> Numbers Mike requested for Int types:
>> >> >> >>> >>
>> >> >> >>> >> only the time/cputime are posted, others are all the same
>> >> >> >>> >> since
>> >> >> >>> >> the
>> >> >> >>> >> algorithm is the same.
>> >> >> >>> >>
>> >> >> >>> >> Lucene 2.9:
>> >> >> >>> >> numhits: 10
>> >> >> >>> >> time: 14619495
>> >> >> >>> >> cpu: 146126
>> >> >> >>> >>
>> >> >> >>> >> numhits: 20
>> >> >> >>> >> time: 14550568
>> >> >> >>> >> cpu: 163242
>> >> >> >>> >>
>> >> >> >>> >> numhits: 100
>> >> >> >>> >> time: 16467647
>> >> >> >>> >> cpu: 178379
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> my test:
>> >> >> >>> >> numHits: 10
>> >> >> >>> >> time: 14101094
>> >> >> >>> >> cpu: 144715
>> >> >> >>> >>
>> >> >> >>> >> numHits: 20
>> >> >> >>> >> time: 14804821
>> >> >> >>> >> cpu: 151305
>> >> >> >>> >>
>> >> >> >>> >> numHits: 100
>> >> >> >>> >> time: 15372157
>> >> >> >>> >> cpu time: 158842
>> >> >> >>> >>
>> >> >> >>> >> Conclusions:
>> >> >> >>> >> The are very similar, the differences are all within error
>> >> >> >>> >> bounds,
>> >> >> >>> >> especially with lower PQ sizes, which second sort alg again
>> >> >> >>> >> slightly
>> >> >> >>> >> faster.
>> >> >> >>> >>
>> >> >> >>> >> Hope this helps.
>> >> >> >>> >>
>> >> >> >>> >> -John
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >> >> >>> >> <yo...@lucidimagination.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >>>
>> >> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >> >> >>> >>> <lu...@mikemccandless.com> wrote:
>> >> >> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >> >> >>> >>> > really was most of the gains here.
>> >> >> >>> >>>
>> >> >> >>> >>> I had assumed that much of the improvement was due to
>> >> >> >>> >>> ditching
>> >> >> >>> >>> MultiTermEnum/MultiTermDocs.
>> >> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that
>> >> >> >>> >>> only
>> >> >> >>> >>> helps
>> >> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >> >> >>> >>>
>> >> >> >>> >>> -Yonik
>> >> >> >>> >>> http://www.lucidimagination.com
>> >> >> >>> >>>
>> >> >> >>> >>>
>> >> >> >>> >>>
>> >> >> >>> >>>
>> >> >> >>> >>> ---------------------------------------------------------------------
>> >> >> >>> >>> To unsubscribe, e-mail:
>> >> >> >>> >>> java-dev-unsubscribe@lucene.apache.org
>> >> >> >>> >>> For additional commands, e-mail:
>> >> >> >>> >>> java-dev-help@lucene.apache.org
>> >> >> >>> >>>
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> ---------------------------------------------------------------------
>> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Mike:
    That's weird. Let me take a look at the patch. Need to brush up on
python though :)
Thanks
-John

On Tue, Oct 20, 2009 at 10:25 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> OK I posted a patch that folds the MultiPQ approach into
> contrib/benchmark, plus a simple python wrapper to run old/new tests
> across different queries, sort, topN, etc.
>
> But I got different results... MultiPQ looks generally slower than
> SinglePQ.  So I think we now need to reconcile what's different
> between our tests.
>
> Mike
>
> On Mon, Oct 19, 2009 at 9:28 PM, John Wang <jo...@gmail.com> wrote:
> > Hi Michael:
> >      Was wondering if you got a chance to take a look at this.
> >      Since deprecated APIs are being removed in 3.0, I was wondering
> if/when
> > we would decide on keeping the ScoreDocComparator API and thus would be
> kept
> > for Lucene 3.0.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Oh, no problem...
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com>
> wrote:
> >> > Mike, just a clarification on my first perf report email.
> >> > The first section, numHits is incorrectly labeled, it should be 20
> >> > instead
> >> > of 50. Sorry about the possible confusion.
> >> > Thanks
> >> > -John
> >> >
> >> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> >> > <lu...@mikemccandless.com> wrote:
> >> >>
> >> >> Thanks John; I'll have a look.
> >> >>
> >> >> Mike
> >> >>
> >> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
> >> >> wrote:
> >> >> > Hi Michael:
> >> >> >     I added classes: ScoreDocComparatorQueue
> >> >> > and OneSortNoScoreCollector
> >> >> > as
> >> >> > a more general case. I think keeping the old api for
> >> >> > ScoreDocComparator
> >> >> > and
> >> >> > SortComparatorSource would work.
> >> >> >   Please take a look.
> >> >> > Thanks
> >> >> > -John
> >> >> >
> >> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Michael:
> >> >> >>      It is
> >> >> >> open, http://code.google.com/p/lucene-book/source/checkout
> >> >> >>      I think I sent the https url instead, sorry.
> >> >> >>     The multi PQ sorting is fairly self-contained, I have 2
> >> >> >> versions, 1
> >> >> >> for string and 1 for int, each are Collector impls.
> >> >> >>      I shouldn't say the Multi Q is faster on int sort, it is
> within
> >> >> >> the
> >> >> >> error boundary. The diff is very very small, I would stay they are
> >> >> >> more
> >> >> >> equal.
> >> >> >>      If you think it is a good thing to go this way, (if not for
> the
> >> >> >> perf,
> >> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> >> Thanks
> >> >> >> -John
> >> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >> >> <lu...@mikemccandless.com> wrote:
> >> >> >>>
> >> >> >>> John, looks like this requires login -- any plans to open that
> up,
> >> >> >>> or,
> >> >> >>> post the code on an issue?
> >> >> >>>
> >> >> >>> How self-contained is your Multi PQ sorting?  EG is it a
> standalone
> >> >> >>> Collector impl that I can test?
> >> >> >>>
> >> >> >>> Mike
> >> >> >>>
> >> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > BTW, we are have a little sandbox for these experiments. And
> all
> >> >> >>> > my
> >> >> >>> > testcode
> >> >> >>> > are at. They are not very polished.
> >> >> >>> >
> >> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >> >>> >
> >> >> >>> > -John
> >> >> >>> >
> >> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <
> john.wang@gmail.com>
> >> >> >>> > wrote:
> >> >> >>> >>
> >> >> >>> >> Numbers Mike requested for Int types:
> >> >> >>> >>
> >> >> >>> >> only the time/cputime are posted, others are all the same
> since
> >> >> >>> >> the
> >> >> >>> >> algorithm is the same.
> >> >> >>> >>
> >> >> >>> >> Lucene 2.9:
> >> >> >>> >> numhits: 10
> >> >> >>> >> time: 14619495
> >> >> >>> >> cpu: 146126
> >> >> >>> >>
> >> >> >>> >> numhits: 20
> >> >> >>> >> time: 14550568
> >> >> >>> >> cpu: 163242
> >> >> >>> >>
> >> >> >>> >> numhits: 100
> >> >> >>> >> time: 16467647
> >> >> >>> >> cpu: 178379
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> my test:
> >> >> >>> >> numHits: 10
> >> >> >>> >> time: 14101094
> >> >> >>> >> cpu: 144715
> >> >> >>> >>
> >> >> >>> >> numHits: 20
> >> >> >>> >> time: 14804821
> >> >> >>> >> cpu: 151305
> >> >> >>> >>
> >> >> >>> >> numHits: 100
> >> >> >>> >> time: 15372157
> >> >> >>> >> cpu time: 158842
> >> >> >>> >>
> >> >> >>> >> Conclusions:
> >> >> >>> >> The are very similar, the differences are all within error
> >> >> >>> >> bounds,
> >> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >> >>> >> slightly
> >> >> >>> >> faster.
> >> >> >>> >>
> >> >> >>> >> Hope this helps.
> >> >> >>> >>
> >> >> >>> >> -John
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >> >>> >> <yo...@lucidimagination.com>
> >> >> >>> >> wrote:
> >> >> >>> >>>
> >> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >> >> >>> >>> <lu...@mikemccandless.com> wrote:
> >> >> >>> >>> > Though it'd be odd if the switch to searching by segment
> >> >> >>> >>> > really was most of the gains here.
> >> >> >>> >>>
> >> >> >>> >>> I had assumed that much of the improvement was due to
> ditching
> >> >> >>> >>> MultiTermEnum/MultiTermDocs.
> >> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
> >> >> >>> >>> helps
> >> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
> >> >> >>> >>>
> >> >> >>> >>> -Yonik
> >> >> >>> >>> http://www.lucidimagination.com
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> ---------------------------------------------------------------------
> >> >> >>> >>> To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >>> >>> For additional commands, e-mail:
> >> >> >>> >>> java-dev-help@lucene.apache.org
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>>
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK I posted a patch that folds the MultiPQ approach into
contrib/benchmark, plus a simple python wrapper to run old/new tests
across different queries, sort, topN, etc.

But I got different results... MultiPQ looks generally slower than
SinglePQ.  So I think we now need to reconcile what's different
between our tests.

Mike

On Mon, Oct 19, 2009 at 9:28 PM, John Wang <jo...@gmail.com> wrote:
> Hi Michael:
>      Was wondering if you got a chance to take a look at this.
>      Since deprecated APIs are being removed in 3.0, I was wondering if/when
> we would decide on keeping the ScoreDocComparator API and thus would be kept
> for Lucene 3.0.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Oh, no problem...
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
>> > Mike, just a clarification on my first perf report email.
>> > The first section, numHits is incorrectly labeled, it should be 20
>> > instead
>> > of 50. Sorry about the possible confusion.
>> > Thanks
>> > -John
>> >
>> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> Thanks John; I'll have a look.
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
>> >> wrote:
>> >> > Hi Michael:
>> >> >     I added classes: ScoreDocComparatorQueue
>> >> > and OneSortNoScoreCollector
>> >> > as
>> >> > a more general case. I think keeping the old api for
>> >> > ScoreDocComparator
>> >> > and
>> >> > SortComparatorSource would work.
>> >> >   Please take a look.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Michael:
>> >> >>      It is
>> >> >> open, http://code.google.com/p/lucene-book/source/checkout
>> >> >>      I think I sent the https url instead, sorry.
>> >> >>     The multi PQ sorting is fairly self-contained, I have 2
>> >> >> versions, 1
>> >> >> for string and 1 for int, each are Collector impls.
>> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> >> the
>> >> >> error boundary. The diff is very very small, I would stay they are
>> >> >> more
>> >> >> equal.
>> >> >>      If you think it is a good thing to go this way, (if not for the
>> >> >> perf,
>> >> >> just for the simpler api) I'd be happy to work on a patch.
>> >> >> Thanks
>> >> >> -John
>> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> >> <lu...@mikemccandless.com> wrote:
>> >> >>>
>> >> >>> John, looks like this requires login -- any plans to open that up,
>> >> >>> or,
>> >> >>> post the code on an issue?
>> >> >>>
>> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >> >>> Collector impl that I can test?
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >> >>> wrote:
>> >> >>> > BTW, we are have a little sandbox for these experiments. And all
>> >> >>> > my
>> >> >>> > testcode
>> >> >>> > are at. They are not very polished.
>> >> >>> >
>> >> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >> >>> >
>> >> >>> > -John
>> >> >>> >
>> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Numbers Mike requested for Int types:
>> >> >>> >>
>> >> >>> >> only the time/cputime are posted, others are all the same since
>> >> >>> >> the
>> >> >>> >> algorithm is the same.
>> >> >>> >>
>> >> >>> >> Lucene 2.9:
>> >> >>> >> numhits: 10
>> >> >>> >> time: 14619495
>> >> >>> >> cpu: 146126
>> >> >>> >>
>> >> >>> >> numhits: 20
>> >> >>> >> time: 14550568
>> >> >>> >> cpu: 163242
>> >> >>> >>
>> >> >>> >> numhits: 100
>> >> >>> >> time: 16467647
>> >> >>> >> cpu: 178379
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> my test:
>> >> >>> >> numHits: 10
>> >> >>> >> time: 14101094
>> >> >>> >> cpu: 144715
>> >> >>> >>
>> >> >>> >> numHits: 20
>> >> >>> >> time: 14804821
>> >> >>> >> cpu: 151305
>> >> >>> >>
>> >> >>> >> numHits: 100
>> >> >>> >> time: 15372157
>> >> >>> >> cpu time: 158842
>> >> >>> >>
>> >> >>> >> Conclusions:
>> >> >>> >> The are very similar, the differences are all within error
>> >> >>> >> bounds,
>> >> >>> >> especially with lower PQ sizes, which second sort alg again
>> >> >>> >> slightly
>> >> >>> >> faster.
>> >> >>> >>
>> >> >>> >> Hope this helps.
>> >> >>> >>
>> >> >>> >> -John
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >> >>> >> <yo...@lucidimagination.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >> >>> >>> <lu...@mikemccandless.com> wrote:
>> >> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >> >>> >>> > really was most of the gains here.
>> >> >>> >>>
>> >> >>> >>> I had assumed that much of the improvement was due to ditching
>> >> >>> >>> MultiTermEnum/MultiTermDocs.
>> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >> >>> >>> helps
>> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >> >>> >>>
>> >> >>> >>> -Yonik
>> >> >>> >>> http://www.lucidimagination.com
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> ---------------------------------------------------------------------
>> >> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> >>> For additional commands, e-mail:
>> >> >>> >>> java-dev-help@lucene.apache.org
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Michael:
     Was wondering if you got a chance to take a look at this.

     Since deprecated APIs are being removed in 3.0, I was wondering if/when
we would decide on keeping the ScoreDocComparator API and thus would be kept
for Lucene 3.0.

Thanks

-John

On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Oh, no problem...
>
> Mike
>
> On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
> > Mike, just a clarification on my first perf report email.
> > The first section, numHits is incorrectly labeled, it should be 20
> instead
> > of 50. Sorry about the possible confusion.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Thanks John; I'll have a look.
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com>
> wrote:
> >> > Hi Michael:
> >> >     I added classes: ScoreDocComparatorQueue
> and OneSortNoScoreCollector
> >> > as
> >> > a more general case. I think keeping the old api for
> ScoreDocComparator
> >> > and
> >> > SortComparatorSource would work.
> >> >   Please take a look.
> >> > Thanks
> >> > -John
> >> >
> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com>
> wrote:
> >> >>
> >> >> Hi Michael:
> >> >>      It is open,
> http://code.google.com/p/lucene-book/source/checkout
> >> >>      I think I sent the https url instead, sorry.
> >> >>     The multi PQ sorting is fairly self-contained, I have 2 versions,
> 1
> >> >> for string and 1 for int, each are Collector impls.
> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
> >> >> the
> >> >> error boundary. The diff is very very small, I would stay they are
> more
> >> >> equal.
> >> >>      If you think it is a good thing to go this way, (if not for the
> >> >> perf,
> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> Thanks
> >> >> -John
> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >> <lu...@mikemccandless.com> wrote:
> >> >>>
> >> >>> John, looks like this requires login -- any plans to open that up,
> or,
> >> >>> post the code on an issue?
> >> >>>
> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
> >> >>> Collector impl that I can test?
> >> >>>
> >> >>> Mike
> >> >>>
> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
> >> >>> wrote:
> >> >>> > BTW, we are have a little sandbox for these experiments. And all
> my
> >> >>> > testcode
> >> >>> > are at. They are not very polished.
> >> >>> >
> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >>> >
> >> >>> > -John
> >> >>> >
> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Numbers Mike requested for Int types:
> >> >>> >>
> >> >>> >> only the time/cputime are posted, others are all the same since
> the
> >> >>> >> algorithm is the same.
> >> >>> >>
> >> >>> >> Lucene 2.9:
> >> >>> >> numhits: 10
> >> >>> >> time: 14619495
> >> >>> >> cpu: 146126
> >> >>> >>
> >> >>> >> numhits: 20
> >> >>> >> time: 14550568
> >> >>> >> cpu: 163242
> >> >>> >>
> >> >>> >> numhits: 100
> >> >>> >> time: 16467647
> >> >>> >> cpu: 178379
> >> >>> >>
> >> >>> >>
> >> >>> >> my test:
> >> >>> >> numHits: 10
> >> >>> >> time: 14101094
> >> >>> >> cpu: 144715
> >> >>> >>
> >> >>> >> numHits: 20
> >> >>> >> time: 14804821
> >> >>> >> cpu: 151305
> >> >>> >>
> >> >>> >> numHits: 100
> >> >>> >> time: 15372157
> >> >>> >> cpu time: 158842
> >> >>> >>
> >> >>> >> Conclusions:
> >> >>> >> The are very similar, the differences are all within error
> bounds,
> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >>> >> slightly
> >> >>> >> faster.
> >> >>> >>
> >> >>> >> Hope this helps.
> >> >>> >>
> >> >>> >> -John
> >> >>> >>
> >> >>> >>
> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >>> >> <yo...@lucidimagination.com>
> >> >>> >> wrote:
> >> >>> >>>
> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >> >>> >>> <lu...@mikemccandless.com> wrote:
> >> >>> >>> > Though it'd be odd if the switch to searching by segment
> >> >>> >>> > really was most of the gains here.
> >> >>> >>>
> >> >>> >>> I had assumed that much of the improvement was due to ditching
> >> >>> >>> MultiTermEnum/MultiTermDocs.
> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
> >> >>> >>> helps
> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
> >> >>> >>>
> >> >>> >>> -Yonik
> >> >>> >>> http://www.lucidimagination.com
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> ---------------------------------------------------------------------
> >> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> >>> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >>> >>>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>>
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Oh, no problem...

Mike

On Fri, Oct 16, 2009 at 12:33 PM, John Wang <jo...@gmail.com> wrote:
> Mike, just a clarification on my first perf report email.
> The first section, numHits is incorrectly labeled, it should be 20 instead
> of 50. Sorry about the possible confusion.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Thanks John; I'll have a look.
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com> wrote:
>> > Hi Michael:
>> >     I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector
>> > as
>> > a more general case. I think keeping the old api for ScoreDocComparator
>> > and
>> > SortComparatorSource would work.
>> >   Please take a look.
>> > Thanks
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:
>> >>
>> >> Hi Michael:
>> >>      It is open, http://code.google.com/p/lucene-book/source/checkout
>> >>      I think I sent the https url instead, sorry.
>> >>     The multi PQ sorting is fairly self-contained, I have 2 versions, 1
>> >> for string and 1 for int, each are Collector impls.
>> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> the
>> >> error boundary. The diff is very very small, I would stay they are more
>> >> equal.
>> >>      If you think it is a good thing to go this way, (if not for the
>> >> perf,
>> >> just for the simpler api) I'd be happy to work on a patch.
>> >> Thanks
>> >> -John
>> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >>>
>> >>> John, looks like this requires login -- any plans to open that up, or,
>> >>> post the code on an issue?
>> >>>
>> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >>> Collector impl that I can test?
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
>> >>> wrote:
>> >>> > BTW, we are have a little sandbox for these experiments. And all my
>> >>> > testcode
>> >>> > are at. They are not very polished.
>> >>> >
>> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >>> >
>> >>> > -John
>> >>> >
>> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Numbers Mike requested for Int types:
>> >>> >>
>> >>> >> only the time/cputime are posted, others are all the same since the
>> >>> >> algorithm is the same.
>> >>> >>
>> >>> >> Lucene 2.9:
>> >>> >> numhits: 10
>> >>> >> time: 14619495
>> >>> >> cpu: 146126
>> >>> >>
>> >>> >> numhits: 20
>> >>> >> time: 14550568
>> >>> >> cpu: 163242
>> >>> >>
>> >>> >> numhits: 100
>> >>> >> time: 16467647
>> >>> >> cpu: 178379
>> >>> >>
>> >>> >>
>> >>> >> my test:
>> >>> >> numHits: 10
>> >>> >> time: 14101094
>> >>> >> cpu: 144715
>> >>> >>
>> >>> >> numHits: 20
>> >>> >> time: 14804821
>> >>> >> cpu: 151305
>> >>> >>
>> >>> >> numHits: 100
>> >>> >> time: 15372157
>> >>> >> cpu time: 158842
>> >>> >>
>> >>> >> Conclusions:
>> >>> >> The are very similar, the differences are all within error bounds,
>> >>> >> especially with lower PQ sizes, which second sort alg again
>> >>> >> slightly
>> >>> >> faster.
>> >>> >>
>> >>> >> Hope this helps.
>> >>> >>
>> >>> >> -John
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >>> >> <yo...@lucidimagination.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >>> >>> <lu...@mikemccandless.com> wrote:
>> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >>> >>> > really was most of the gains here.
>> >>> >>>
>> >>> >>> I had assumed that much of the improvement was due to ditching
>> >>> >>> MultiTermEnum/MultiTermDocs.
>> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >>> >>> helps
>> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >>> >>>
>> >>> >>> -Yonik
>> >>> >>> http://www.lucidimagination.com
>> >>> >>>
>> >>> >>>
>> >>> >>> ---------------------------------------------------------------------
>> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Mike, just a clarification on my first perf report email.
The first section, numHits is incorrectly labeled, it should be 20 instead
of 50. Sorry about the possible confusion.

Thanks

-John

On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Thanks John; I'll have a look.
>
> Mike
>
> On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com> wrote:
> > Hi Michael:
> >     I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector
> as
> > a more general case. I think keeping the old api for ScoreDocComparator
> and
> > SortComparatorSource would work.
> >   Please take a look.
> > Thanks
> > -John
> >
> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:
> >>
> >> Hi Michael:
> >>      It is open, http://code.google.com/p/lucene-book/source/checkout
> >>      I think I sent the https url instead, sorry.
> >>     The multi PQ sorting is fairly self-contained, I have 2 versions, 1
> >> for string and 1 for int, each are Collector impls.
> >>      I shouldn't say the Multi Q is faster on int sort, it is within the
> >> error boundary. The diff is very very small, I would stay they are more
> >> equal.
> >>      If you think it is a good thing to go this way, (if not for the
> perf,
> >> just for the simpler api) I'd be happy to work on a patch.
> >> Thanks
> >> -John
> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >>>
> >>> John, looks like this requires login -- any plans to open that up, or,
> >>> post the code on an issue?
> >>>
> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
> >>> Collector impl that I can test?
> >>>
> >>> Mike
> >>>
> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com>
> wrote:
> >>> > BTW, we are have a little sandbox for these experiments. And all my
> >>> > testcode
> >>> > are at. They are not very polished.
> >>> >
> >>> > https://lucene-book.googlecode.com/svn/trunk
> >>> >
> >>> > -John
> >>> >
> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com>
> wrote:
> >>> >>
> >>> >> Numbers Mike requested for Int types:
> >>> >>
> >>> >> only the time/cputime are posted, others are all the same since the
> >>> >> algorithm is the same.
> >>> >>
> >>> >> Lucene 2.9:
> >>> >> numhits: 10
> >>> >> time: 14619495
> >>> >> cpu: 146126
> >>> >>
> >>> >> numhits: 20
> >>> >> time: 14550568
> >>> >> cpu: 163242
> >>> >>
> >>> >> numhits: 100
> >>> >> time: 16467647
> >>> >> cpu: 178379
> >>> >>
> >>> >>
> >>> >> my test:
> >>> >> numHits: 10
> >>> >> time: 14101094
> >>> >> cpu: 144715
> >>> >>
> >>> >> numHits: 20
> >>> >> time: 14804821
> >>> >> cpu: 151305
> >>> >>
> >>> >> numHits: 100
> >>> >> time: 15372157
> >>> >> cpu time: 158842
> >>> >>
> >>> >> Conclusions:
> >>> >> The are very similar, the differences are all within error bounds,
> >>> >> especially with lower PQ sizes, which second sort alg again slightly
> >>> >> faster.
> >>> >>
> >>> >> Hope this helps.
> >>> >>
> >>> >> -John
> >>> >>
> >>> >>
> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >>> >> <yo...@lucidimagination.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >>> >>> <lu...@mikemccandless.com> wrote:
> >>> >>> > Though it'd be odd if the switch to searching by segment
> >>> >>> > really was most of the gains here.
> >>> >>>
> >>> >>> I had assumed that much of the improvement was due to ditching
> >>> >>> MultiTermEnum/MultiTermDocs.
> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
> >>> >>> with queries that use a TermEnum (range, prefix, etc).
> >>> >>>
> >>> >>> -Yonik
> >>> >>> http://www.lucidimagination.com
> >>> >>>
> >>> >>>
> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>> >>>
> >>> >>
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

Thanks John; I'll have a look.

Mike

On Fri, Oct 16, 2009 at 12:57 AM, John Wang <jo...@gmail.com> wrote:
> Hi Michael:
>     I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as
> a more general case. I think keeping the old api for ScoreDocComparator and
> SortComparatorSource would work.
>   Please take a look.
> Thanks
> -John
>
> On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:
>>
>> Hi Michael:
>>      It is open, http://code.google.com/p/lucene-book/source/checkout
>>      I think I sent the https url instead, sorry.
>>     The multi PQ sorting is fairly self-contained, I have 2 versions, 1
>> for string and 1 for int, each are Collector impls.
>>      I shouldn't say the Multi Q is faster on int sort, it is within the
>> error boundary. The diff is very very small, I would stay they are more
>> equal.
>>      If you think it is a good thing to go this way, (if not for the perf,
>> just for the simpler api) I'd be happy to work on a patch.
>> Thanks
>> -John
>> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>>
>>> John, looks like this requires login -- any plans to open that up, or,
>>> post the code on an issue?
>>>
>>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>>> Collector impl that I can test?
>>>
>>> Mike
>>>
>>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com> wrote:
>>> > BTW, we are have a little sandbox for these experiments. And all my
>>> > testcode
>>> > are at. They are not very polished.
>>> >
>>> > https://lucene-book.googlecode.com/svn/trunk
>>> >
>>> > -John
>>> >
>>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com> wrote:
>>> >>
>>> >> Numbers Mike requested for Int types:
>>> >>
>>> >> only the time/cputime are posted, others are all the same since the
>>> >> algorithm is the same.
>>> >>
>>> >> Lucene 2.9:
>>> >> numhits: 10
>>> >> time: 14619495
>>> >> cpu: 146126
>>> >>
>>> >> numhits: 20
>>> >> time: 14550568
>>> >> cpu: 163242
>>> >>
>>> >> numhits: 100
>>> >> time: 16467647
>>> >> cpu: 178379
>>> >>
>>> >>
>>> >> my test:
>>> >> numHits: 10
>>> >> time: 14101094
>>> >> cpu: 144715
>>> >>
>>> >> numHits: 20
>>> >> time: 14804821
>>> >> cpu: 151305
>>> >>
>>> >> numHits: 100
>>> >> time: 15372157
>>> >> cpu time: 158842
>>> >>
>>> >> Conclusions:
>>> >> The are very similar, the differences are all within error bounds,
>>> >> especially with lower PQ sizes, which second sort alg again slightly
>>> >> faster.
>>> >>
>>> >> Hope this helps.
>>> >>
>>> >> -John
>>> >>
>>> >>
>>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>>> >> <yo...@lucidimagination.com>
>>> >> wrote:
>>> >>>
>>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>>> >>> <lu...@mikemccandless.com> wrote:
>>> >>> > Though it'd be odd if the switch to searching by segment
>>> >>> > really was most of the gains here.
>>> >>>
>>> >>> I had assumed that much of the improvement was due to ditching
>>> >>> MultiTermEnum/MultiTermDocs.
>>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
>>> >>> with queries that use a TermEnum (range, prefix, etc).
>>> >>>
>>> >>> -Yonik
>>> >>> http://www.lucidimagination.com
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>> >>>
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Michael:
    I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as
a more general case. I think keeping the old api for ScoreDocComparator and
SortComparatorSource would work.

  Please take a look.

Thanks

-John

On Thu, Oct 15, 2009 at 6:52 PM, John Wang <jo...@gmail.com> wrote:

> Hi Michael:
>      It is open, http://code.google.com/p/lucene-book/source/checkout
>
>      I think I sent the https url instead, sorry.
>
>     The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for
> string and 1 for int, each are Collector impls.
>
>      I shouldn't say the Multi Q is faster on int sort, it is within the
> error boundary. The diff is very very small, I would stay they are more
> equal.
>
>      If you think it is a good thing to go this way, (if not for the perf,
> just for the simpler api) I'd be happy to work on a patch.
>
> Thanks
>
> -John
>
> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> John, looks like this requires login -- any plans to open that up, or,
>> post the code on an issue?
>>
>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> Collector impl that I can test?
>>
>> Mike
>>
>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com> wrote:
>> > BTW, we are have a little sandbox for these experiments. And all my
>> testcode
>> > are at. They are not very polished.
>> >
>> > https://lucene-book.googlecode.com/svn/trunk
>> >
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com> wrote:
>> >>
>> >> Numbers Mike requested for Int types:
>> >>
>> >> only the time/cputime are posted, others are all the same since the
>> >> algorithm is the same.
>> >>
>> >> Lucene 2.9:
>> >> numhits: 10
>> >> time: 14619495
>> >> cpu: 146126
>> >>
>> >> numhits: 20
>> >> time: 14550568
>> >> cpu: 163242
>> >>
>> >> numhits: 100
>> >> time: 16467647
>> >> cpu: 178379
>> >>
>> >>
>> >> my test:
>> >> numHits: 10
>> >> time: 14101094
>> >> cpu: 144715
>> >>
>> >> numHits: 20
>> >> time: 14804821
>> >> cpu: 151305
>> >>
>> >> numHits: 100
>> >> time: 15372157
>> >> cpu time: 158842
>> >>
>> >> Conclusions:
>> >> The are very similar, the differences are all within error bounds,
>> >> especially with lower PQ sizes, which second sort alg again slightly
>> faster.
>> >>
>> >> Hope this helps.
>> >>
>> >> -John
>> >>
>> >>
>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <
>> yonik@lucidimagination.com>
>> >> wrote:
>> >>>
>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >>> <lu...@mikemccandless.com> wrote:
>> >>> > Though it'd be odd if the switch to searching by segment
>> >>> > really was most of the gains here.
>> >>>
>> >>> I had assumed that much of the improvement was due to ditching
>> >>> MultiTermEnum/MultiTermDocs.
>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >>>
>> >>> -Yonik
>> >>> http://www.lucidimagination.com
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>>
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Hi Michael:
     It is open, http://code.google.com/p/lucene-book/source/checkout

     I think I sent the https url instead, sorry.

    The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for
string and 1 for int, each are Collector impls.

     I shouldn't say the Multi Q is faster on int sort, it is within the
error boundary. The diff is very very small, I would stay they are more
equal.

     If you think it is a good thing to go this way, (if not for the perf,
just for the simpler api) I'd be happy to work on a patch.

Thanks

-John

On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> John, looks like this requires login -- any plans to open that up, or,
> post the code on an issue?
>
> How self-contained is your Multi PQ sorting?  EG is it a standalone
> Collector impl that I can test?
>
> Mike
>
> On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com> wrote:
> > BTW, we are have a little sandbox for these experiments. And all my
> testcode
> > are at. They are not very polished.
> >
> > https://lucene-book.googlecode.com/svn/trunk
> >
> > -John
> >
> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com> wrote:
> >>
> >> Numbers Mike requested for Int types:
> >>
> >> only the time/cputime are posted, others are all the same since the
> >> algorithm is the same.
> >>
> >> Lucene 2.9:
> >> numhits: 10
> >> time: 14619495
> >> cpu: 146126
> >>
> >> numhits: 20
> >> time: 14550568
> >> cpu: 163242
> >>
> >> numhits: 100
> >> time: 16467647
> >> cpu: 178379
> >>
> >>
> >> my test:
> >> numHits: 10
> >> time: 14101094
> >> cpu: 144715
> >>
> >> numHits: 20
> >> time: 14804821
> >> cpu: 151305
> >>
> >> numHits: 100
> >> time: 15372157
> >> cpu time: 158842
> >>
> >> Conclusions:
> >> The are very similar, the differences are all within error bounds,
> >> especially with lower PQ sizes, which second sort alg again slightly
> faster.
> >>
> >> Hope this helps.
> >>
> >> -John
> >>
> >>
> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <
> yonik@lucidimagination.com>
> >> wrote:
> >>>
> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >>> <lu...@mikemccandless.com> wrote:
> >>> > Though it'd be odd if the switch to searching by segment
> >>> > really was most of the gains here.
> >>>
> >>> I had assumed that much of the improvement was due to ditching
> >>> MultiTermEnum/MultiTermDocs.
> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
> >>> with queries that use a TermEnum (range, prefix, etc).
> >>>
> >>> -Yonik
> >>> http://www.lucidimagination.com
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

John, looks like this requires login -- any plans to open that up, or,
post the code on an issue?

How self-contained is your Multi PQ sorting?  EG is it a standalone
Collector impl that I can test?

Mike

On Thu, Oct 15, 2009 at 6:33 PM, John Wang <jo...@gmail.com> wrote:
> BTW, we are have a little sandbox for these experiments. And all my testcode
> are at. They are not very polished.
>
> https://lucene-book.googlecode.com/svn/trunk
>
> -John
>
> On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com> wrote:
>>
>> Numbers Mike requested for Int types:
>>
>> only the time/cputime are posted, others are all the same since the
>> algorithm is the same.
>>
>> Lucene 2.9:
>> numhits: 10
>> time: 14619495
>> cpu: 146126
>>
>> numhits: 20
>> time: 14550568
>> cpu: 163242
>>
>> numhits: 100
>> time: 16467647
>> cpu: 178379
>>
>>
>> my test:
>> numHits: 10
>> time: 14101094
>> cpu: 144715
>>
>> numHits: 20
>> time: 14804821
>> cpu: 151305
>>
>> numHits: 100
>> time: 15372157
>> cpu time: 158842
>>
>> Conclusions:
>> The are very similar, the differences are all within error bounds,
>> especially with lower PQ sizes, which second sort alg again slightly faster.
>>
>> Hope this helps.
>>
>> -John
>>
>>
>> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <yo...@lucidimagination.com>
>> wrote:
>>>
>>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>>> <lu...@mikemccandless.com> wrote:
>>> > Though it'd be odd if the switch to searching by segment
>>> > really was most of the gains here.
>>>
>>> I had assumed that much of the improvement was due to ditching
>>> MultiTermEnum/MultiTermDocs.
>>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
>>> with queries that use a TermEnum (range, prefix, etc).
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

BTW, we are have a little sandbox for these experiments. And all my testcode
are at. They are not very polished.

https://lucene-book.googlecode.com/svn/trunk

-John

On Thu, Oct 15, 2009 at 3:29 PM, John Wang <jo...@gmail.com> wrote:

> Numbers Mike requested for Int types:
>
> only the time/cputime are posted, others are all the same since the
> algorithm is the same.
>
> Lucene 2.9:
> numhits: 10
> time: 14619495
> cpu: 146126
>
> numhits: 20
> time: 14550568
> cpu: 163242
>
> numhits: 100
> time: 16467647
> cpu: 178379
>
>
> my test:
> numHits: 10
> time: 14101094
> cpu: 144715
>
> numHits: 20
> time: 14804821
> cpu: 151305
>
> numHits: 100
> time: 15372157
> cpu time: 158842
>
> Conclusions:
> The are very similar, the differences are all within error bounds,
> especially with lower PQ sizes, which second sort alg again slightly faster.
>
> Hope this helps.
>
> -John
>
>
>
> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:
>
>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> > Though it'd be odd if the switch to searching by segment
>> > really was most of the gains here.
>>
>> I had assumed that much of the improvement was due to ditching
>> MultiTermEnum/MultiTermDocs.
>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
>> with queries that use a TermEnum (range, prefix, etc).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK, thanks for running these.  It looks like the gains are holding up
across smaller queue sizes, and for ints.  Though, it's odd that
sorting w/ ints is also faster; I'd expect the single PQ to win there.

Mike

On Thu, Oct 15, 2009 at 6:29 PM, John Wang <jo...@gmail.com> wrote:
> Numbers Mike requested for Int types:
>
> only the time/cputime are posted, others are all the same since the
> algorithm is the same.
>
> Lucene 2.9:
> numhits: 10
> time: 14619495
> cpu: 146126
>
> numhits: 20
> time: 14550568
> cpu: 163242
>
> numhits: 100
> time: 16467647
> cpu: 178379
>
>
> my test:
> numHits: 10
> time: 14101094
> cpu: 144715
>
> numHits: 20
> time: 14804821
> cpu: 151305
>
> numHits: 100
> time: 15372157
> cpu time: 158842
>
> Conclusions:
> The are very similar, the differences are all within error bounds,
> especially with lower PQ sizes, which second sort alg again slightly faster.
>
> Hope this helps.
>
> -John
>
>
> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <yo...@lucidimagination.com>
> wrote:
>>
>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> > Though it'd be odd if the switch to searching by segment
>> > really was most of the gains here.
>>
>> I had assumed that much of the improvement was due to ditching
>> MultiTermEnum/MultiTermDocs.
>> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
>> with queries that use a TermEnum (range, prefix, etc).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by John Wang <jo...@gmail.com>.

Numbers Mike requested for Int types:

only the time/cputime are posted, others are all the same since the
algorithm is the same.

Lucene 2.9:
numhits: 10
time: 14619495
cpu: 146126

numhits: 20
time: 14550568
cpu: 163242

numhits: 100
time: 16467647
cpu: 178379


my test:
numHits: 10
time: 14101094
cpu: 144715

numHits: 20
time: 14804821
cpu: 151305

numHits: 100
time: 15372157
cpu time: 158842

Conclusions:
The are very similar, the differences are all within error bounds,
especially with lower PQ sizes, which second sort alg again slightly faster.

Hope this helps.

-John


On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > Though it'd be odd if the switch to searching by segment
> > really was most of the gains here.
>
> I had assumed that much of the improvement was due to ditching
> MultiTermEnum/MultiTermDocs.
> Note that LUCENE-1483 was before LUCENE-1596... but that only helps
> with queries that use a TermEnum (range, prefix, etc).
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Though it'd be odd if the switch to searching by segment
> really was most of the gains here.

I had assumed that much of the improvement was due to ditching
MultiTermEnum/MultiTermDocs.
Note that LUCENE-1483 was before LUCENE-1596... but that only helps
with queries that use a TermEnum (range, prefix, etc).

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Oct 15, 2009 at 3:52 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> If I remembering it right... this (matching MultiSearcher's approach)
>> was nearly the first thing we tried with LUCENE-1483.  But the CPU
>> cost was higher in our tests.  I think we had tested unbalanced and
>> balanced segments, but memory is definitely somewhat hazy at this
>> point...
>
> Yeah, from what I could tell, the way you used the PQ api, comparing
> only with the bottom until you saw it beat the bottom, instead of always
> just addWithOverflow, leads to some significant performance improvement,
> but this can be done with the MultiSearcher / multiple PQ approach as
> well - like I said, I'm pretty sure the overall complexity is the same
> (ie. leading complexity, with the factor of numDocs = N * M out front), and
> we're just talking about sub-leading behavior at this point.

I agree.

>> I suspect even in the balanced case the cost will be higher with the
>> "separate PQs then merge at the end" approach.  It's not only the int
>> ord comparisons (which are cheap) that take CPU.  It's the if
>> statements that are executed as a result of the comparison when
>> re-heapifying the PQ.  Those are far more costly especially if the CPU
>> has trouble predicting which way the if will go, which it does here.
>>
>> Ie, it's the insertions into each PQ that are expensive; by tracking a
>> separate PQ per reader you'll have many more insertions since
>> each PQ must re-converge.
>
> Insertions are expensive - totally agreed, and subtle branching tweaks
> make a big difference at subleading level.  So we need to do a careful
> analysis of the sub-leading behavior (O(K,M) but O(1) in terms of N):
>
> The multi-PQ approach requires more insertions: O(M * K log(K))
> insertions, but each one can be made very fast, with careful branch
> tuning similar to the tweaks currently in the 2.9 code - as all of the
> comparisons are ints.  The final merging requires K log(M) string
> compares as well.

I don't think we do any branch tuning on the PQ insertion -- the ifs
involved in re-heapifying the PQ are simply hard for the CPU to
predict (though, apparently, not as hard as comparing strings ;).

> The single-PQ approach requires only O(K * log(K) * log(M))
> insertions, but each one requires, instead of O(log(K)) int compares,
> O(log(K)) *string* compares, and a segment conversion.

Right, except, for a large segment, if it has enough insertions, as
more entries in the PQ belong to that segment, more of the comparisons
become ints.

> Plugging in some numbers, with K = 100, and M = 20
>
>   multiPQ : 14,000 insertions which have 7 int compares each,
>                   followed by 400 string compares for the merge
> vs
>
>   singlePQ: 2,800 insertions which have maybe 4 string compares
>                    and 3 int compares each
>
> ---------
>
> So yeah, executive summary in theory: *at sub-leading order*, Yonik
> is right, singlePQ has less overall operations (log(M) vs M), but it
> has way more slow string compares (K log(K)^2 log(M)  vs K log(M) ).
>
>>
>> But I agree it's not obvious which way it'll go... so why not simply
>> build a comparator that does this and then let the computer tell us?
>
> I agree completely - this is all talk until I see it run on an actual index.
> I know John was writing up some code to port bobo to 2.9, so perhaps
> you can try and compare these two, with a couple different values of K,
> John?
>
> I guess a bigger question, than just the sub-leading behavior (if we're
> talking 3.5 ms vs. 3.1ms, this is really academic), is if the overall
> performance is basically the same (which I'm guessing is the case),
> then for developers, writing and understanding Comparators for the
> old API is *way* easier than with the current one, and the multi-PQ
> technique makes it totally easy to use the same logic and understanding
> that already exists for MultiSearcher.
>
> Of course, if my calculations above are wrong, or the sub-leading
> behavior is dominated by the branching and string compares are super
> fast, then difficulty of API be damned, we stick with what performs
> best.

I agree, if the gains are minor (or, non-existent!), across varying
sort field types, number of hits, number of segments, etc., then the
old API is MUCH easier to grok than the new one, and we should switch
back.  API simplicity is important.

Looking back through LUCENE-1483, we were seeing much more sizable
gains in the new API vs the old single-PQ-only-int-comparison approach
(which should be faster than both the single and multi PQ approaches
we are discussing here).  BUT: those tests combined searching by
segment AND the new FieldCollector API, plus other more basic code
speedups.  Though it'd be odd if the switch to searching by segment
really was most of the gains here.  I'm confused (for now) on where
the gains we were seeing in that issue came from...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> If I remembering it right... this (matching MultiSearcher's approach)
> was nearly the first thing we tried with LUCENE-1483.  But the CPU
> cost was higher in our tests.  I think we had tested unbalanced and
> balanced segments, but memory is definitely somewhat hazy at this
> point...
>

Yeah, from what I could tell, the way you used the PQ api, comparing
only with the bottom until you saw it beat the bottom, instead of always
just addWithOverflow, leads to some significant performance improvement,
but this can be done with the MultiSearcher / multiple PQ approach as
well - like I said, I'm pretty sure the overall complexity is the same
(ie. leading complexity, with the factor of numDocs = N * M out front), and
we're just talking about sub-leading behavior at this point.

> I suspect even in the balanced case the cost will be higher with the
> "separate PQs then merge at the end" approach.  It's not only the int
> ord comparisons (which are cheap) that take CPU.  It's the if
> statements that are executed as a result of the comparison when
> re-heapifying the PQ.  Those are far more costly especially if the CPU
> has trouble predicting which way the if will go, which it does here.

> Ie, it's the insertions into each PQ that are expensive; by tracking a
> separate PQ per reader you'll have many more insertions since
> each PQ must re-converge.
>

Insertions are expensive - totally agreed, and subtle branching tweaks
make a big difference at subleading level.  So we need to do a careful
analysis of the sub-leading behavior (O(K,M) but O(1) in terms of N):

The multi-PQ approach requires more insertions: O(M * K log(K))
insertions, but each one can be made very fast, with careful branch
tuning similar to the tweaks currently in the 2.9 code - as all of the
comparisons are ints.  The final merging requires K log(M) string
compares as well.

The single-PQ approach requires only O(K * log(K) * log(M))
insertions, but each one requires, instead of O(log(K)) int compares,
O(log(K)) *string* compares, and a segment conversion.

Plugging in some numbers, with K = 100, and M = 20

  multiPQ : 14,000 insertions which have 7 int compares each,
                  followed by 400 string compares for the merge
vs

  singlePQ: 2,800 insertions which have maybe 4 string compares
                   and 3 int compares each

---------

So yeah, executive summary in theory: *at sub-leading order*, Yonik
is right, singlePQ has less overall operations (log(M) vs M), but it
has way more slow string compares (K log(K)^2 log(M)  vs K log(M) ).

> But I agree it's not obvious which way it'll go... so why not simply
> build a comparator that does this and then let the computer tell us?
>

I agree completely - this is all talk until I see it run on an actual index.
I know John was writing up some code to port bobo to 2.9, so perhaps
you can try and compare these two, with a couple different values of K,
John?

I guess a bigger question, than just the sub-leading behavior (if we're
talking 3.5 ms vs. 3.1ms, this is really academic), is if the overall
performance is basically the same (which I'm guessing is the case),
then for developers, writing and understanding Comparators for the
old API is *way* easier than with the current one, and the multi-PQ
technique makes it totally easy to use the same logic and understanding
that already exists for MultiSearcher.

Of course, if my calculations above are wrong, or the sub-leading
behavior is dominated by the branching and string compares are super
fast, then difficulty of API be damned, we stick with what performs
best.

  -jake

Re: lucene 2.9 sorting algorithm

Posted by Michael McCandless <lu...@mikemccandless.com>.

If I remembering it right... this (matching MultiSearcher's approach)
was nearly the first thing we tried with LUCENE-1483.  But the CPU
cost was higher in our tests.  I think we had tested unbalanced and
balanced segments, but memory is definitely somewhat hazy at this
point...

I suspect even in the balanced case the cost will be higher with the
"separate PQs then merge at the end" approach.  It's not only the int
ord comparisons (which are cheap) that take CPU.  It's the if
statements that are executed as a result of the comparison when
re-heapifying the PQ.  Those are far more costly especially if the CPU
has trouble predicting which way the if will go, which it does here.

Ie, it's the insertions into each PQ that are expensive; by tracking a
separate PQ per reader you'll have many more insertions since
each PQ must re-converge.

But I agree it's not obvious which way it'll go... so why not simply
build a comparator that does this and then let the computer tell us?

We played with a number of String comparators in LUCENE-1483; it could
also be that one of those might fit better for the balanced case,
too.

Mike

On Thu, Oct 15, 2009 at 4:31 AM, Jake Mannix <ja...@gmail.com> wrote:
> I had to dig through the source code (actually, walk through a unit test,
> because
> that was simpler to see what was going on in the 2.9 sorting), but I think
> John's
> way has slightly lower complexity in the balanced segment size case.
>
> On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley <yo...@lucidimagination.com>
> wrote:
>>
>> Interesting idea... though one further piece of info in the mix is
>> that large segments are typically processed first, and tend to fill up
>> the priority queue.
>
> [noted: in the unbalanced case, things work out nicely for the 2.9 case
> compared
> to the balanced case, but M roughly equal balanced segments is common enough
> use case to consider this as non-extreme (c.f. BalancedMergePolicy).
> Of course this means that the sorting is very easy using John's technique as
> well,
> as the number of string compares in the priority queue merge at the end is
> much lower]
>
>>
>> Conversion from one segment to another is only
>> done as needed... only the bottom slot is converted automatically when
>> the segment is switched.
>
> That's not what it looks like, actually: you convert the bottom slot, and
> as soon as you get another hit from the new segment which is
> competitive, it pushes what was the old bottom out, and there's a new
> bottom, which often will be from one of the previous segments, in which
> case *it* must be converted to the new ordering as well.
>
> Additionally, whenever this happens, and a doc from the new segment
> is competitive (which means that it wins in a fast int ordinal compare
> with the bottom), it needs to do O(log(K)) slow *string* compares to find
> out where in the queue this newly competitive doc belongs (where K is
> numHits = size of the queue).
>
>>
>> Filling a complete priority queue for each segment is also more
>> expensive (from the big-O perspective).
>
> Huh?  If you have M roughly balanced segments of N docs each, and
> you want to find the top K (out of all M*N) sorted docs, John is suggesting
> (correct me if I'm wrong here John), that you keep M priority queues of
> size K, filling them in sequence: on average (and when K << N) this takes
> M * N * c_i, where c_i is the cost of an integer compare, and then you
> need to merge M sorted string lists of size K, which takes K log(M) * c_s,
> where c_s is the cost of doing a string compare.  This latter factor
> is sub-leading in N, which is the biggest number around, and the
> M * N * c_i factor is the same as the dominating factor in the current
> 2.9 sorting, because M * N * c_i is minimal cost to compare with the
> bottom of the queue across all M * N docs.
>
> So the leading in M*N factors from both approaches are the same, and
> anything without a factor of N is tiny (if N might be 10^6, M is maybe
> 10-20, K is 10-100), but I worked out the sub-leading factors on a
> sheet of paper and I think I've convinced myself that John's form
> actually wins out even there, if you keep track of how often string
> compares and conversions are required.
>
> I can write up that O(K,M) analysis if you'd like, but it should be
> mostly academic - for K, M << N, both John's approach and the
> current 2.9 have the same leading behavior and should perform
> equally well.  After all - the multiple queues approach is what
> lucene has always done for the multi-reader case, why is it that
> multiple SegmentReaders should be any different?
>
>   -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: lucene 2.9 sorting algorithm

Posted by Jake Mannix <ja...@gmail.com>.

I had to dig through the source code (actually, walk through a unit test,
because
that was simpler to see what was going on in the 2.9 sorting), but I think
John's
way has slightly lower complexity in the balanced segment size case.

On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> Interesting idea... though one further piece of info in the mix is
> that large segments are typically processed first, and tend to fill up
> the priority queue.

[noted: in the unbalanced case, things work out nicely for the 2.9 case
compared
to the balanced case, but M roughly equal balanced segments is common enough
use case to consider this as non-extreme (c.f. BalancedMergePolicy).
Of course this means that the sorting is very easy using John's technique as
well,
as the number of string compares in the priority queue merge at the end is
much lower]

> Conversion from one segment to another is only
> done as needed... only the bottom slot is converted automatically when
> the segment is switched.
>

That's not what it looks like, actually: you convert the bottom slot, and
as soon as you get another hit from the new segment which is
competitive, it pushes what was the old bottom out, and there's a new
bottom, which often will be from one of the previous segments, in which
case *it* must be converted to the new ordering as well.

Additionally, whenever this happens, and a doc from the new segment
is competitive (which means that it wins in a fast int ordinal compare
with the bottom), it needs to do O(log(K)) slow *string* compares to find
out where in the queue this newly competitive doc belongs (where K is
numHits = size of the queue).

> Filling a complete priority queue for each segment is also more
> expensive (from the big-O perspective).

Huh?  If you have M roughly balanced segments of N docs each, and
you want to find the top K (out of all M*N) sorted docs, John is suggesting
(correct me if I'm wrong here John), that you keep M priority queues of
size K, filling them in sequence: on average (and when K << N) this takes
M * N * c_i, where c_i is the cost of an integer compare, and then you
need to merge M sorted string lists of size K, which takes K log(M) * c_s,
where c_s is the cost of doing a string compare.  This latter factor
is sub-leading in N, which is the biggest number around, and the
M * N * c_i factor is the same as the dominating factor in the current
2.9 sorting, because M * N * c_i is minimal cost to compare with the
bottom of the queue across all M * N docs.

So the leading in M*N factors from both approaches are the same, and
anything without a factor of N is tiny (if N might be 10^6, M is maybe
10-20, K is 10-100), but I worked out the sub-leading factors on a
sheet of paper and I think I've convinced myself that John's form
actually wins out even there, if you keep track of how often string
compares and conversions are required.

I can write up that O(K,M) analysis if you'd like, but it should be
mostly academic - for K, M << N, both John's approach and the
current 2.9 have the same leading behavior and should perform
equally well.  After all - the multiple queues approach is what
lucene has always done for the multi-reader case, why is it that
multiple SegmentReaders should be any different?

  -jake

Re: lucene 2.9 sorting algorithm

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Interesting idea... though one further piece of info in the mix is
that large segments are typically processed first, and tend to fill up
the priority queue.  Conversion from one segment to another is only
done as needed... only the bottom slot is converted automatically when
the segment is switched.

Filling a complete priority queue for each segment is also more
expensive (from the big-O perspective).  An optimization would be to
keep track of the minimum needed to make it into the queue (a
short-circuit)... and that essentially would be what the current
implementation does today with the single queue.

-Yonik
http://www.lucidimagination.com


On Wed, Oct 14, 2009 at 11:12 PM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>     Looking at the 2.9 sorting algorithm, and while trying to understand
> FieldComparator class, I was wondering about the following optimization: (I
> am using StringOrdValComparator as an example)
> Currently we have 1 instance of per segment data structure, e.g. (ords,vals
> etc.), and we keep track of the reader context using the readerGen array. We keep track of the top numHits elemnents by keeping a
> reference to the bottom element. To convert from readerContext1 to
> readerContext2, we need a mapping from value->ord, and the cost is a binary
> search (this is can be done N*numhits times). For some instances of custom
> sort extension, this lookup maybe expensive.
> How about instead, we keep a N instances of segment data structure, all
> compares within segment is done with ord compare. We never to across segment
> compare until the end. Where a merge is done on N instances using the
> Comparator. This way, we avoid the lookup from value->ord, and can keep the
> simpler API: ScoreDocComparator, which is much easier to extend for custom
> sorting. It is a higher memory cost, but it is an extra (N-1)*(numhits)
> times the data structure, where N is the number of segments, and is not too
> large.
> I am probably missing some intricacies with the current sorting algorithm.
> If so, please shine some light.
> Thanks
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org