You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2009/01/07 07:36:06 UTC

DisjunctionScorer performance

Hi guys:

     We have been building a suite of boolean operators DocIdSets (e.g.
AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We
compared our implementation on the OrDocIdSetIterator (based on
DisjunctionMaxScorer code) with some code tuning, and we see the performance
doubled in our testing. (we haven't done comparisons with ConjuctionScorer
vs. AndDocIdSetIterator, will post numbers when we do)

     We'd be happy to contribute this back to the community. But what is the
best way of going about it?

option 1: merge our change into DisjunctionMax/SumScorers.
option 2: contribute boolean operator sets, and have DisjunctionScorers
derive from OrDocIdSetIterator, ConjunctionScorer derive from
AndDocIdSetIterator etc.

     Option 2 seems to be cleaner. Thoughts?

Thanks

-John

Re: DisjunctionScorer performance

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 07 January 2009 08:57:54 John Wang wrote:
> One more thing I missed. I don't quite get your point about skip() vs
> next().
> 
> With or queries, skipping does not help as much comparing to and queries.
> 

When for example 2 out of 3 iterators/scorers are required, one can
skip 'last' one to the 'middle' one. This is basically the mechanism
used by ConjunctionScorer, but applied in a more limited way.
Currently there is no working implementation for this, only a
remark in the javadocs.

Iirc DisjunctionDISI at LUCENE-1345 is a first attempt to use
skipTo() in this way.

Regards,
Paul Elschot


Re: DisjunctionScorer performance

Posted by John Wang <jo...@gmail.com>.
One more thing I missed. I don't quite get your point about skip() vs
next().

With or queries, skipping does not help as much comparing to and queries.

-John

On Tue, Jan 6, 2009 at 11:55 PM, John Wang <jo...@gmail.com> wrote:

> Paul:
>
>        Our very simple/naive testing methodology for OrDocIdSetIterator:
>
> 5 sub iterators, each subiterators just iterate from 0 to 1,000,000.
>
> The test iterates the OrDocIdSetIterator until next() is false.
>
>       Do you want me to run the same test against DisjunctDisi?
>
> -John
>
>
> On Tue, Jan 6, 2009 at 11:48 PM, Paul Elschot <pa...@xs4all.nl>wrote:
>
>>  On Wednesday 07 January 2009 07:36:06 John Wang wrote:
>>
>> > Hi guys:
>>
>> >
>>
>> > We have been building a suite of boolean operators DocIdSets
>>
>> > (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator,
>>
>> > NotDocIdSet/Iterator). We compared our implementation on the
>>
>> > OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some
>>
>> > code tuning, and we see the performance doubled in our testing.
>>
>> That's good news.
>>
>> What data structure did you use for sorting by doc id?
>>
>> Currently a priority queue is used for that, and normally that is
>>
>> the bottleneck for performance.
>>
>> > (we
>>
>> > haven't done comparisons with ConjuctionScorer vs.
>>
>> > AndDocIdSetIterator, will post numbers when we do)
>>
>> >
>>
>> > We'd be happy to contribute this back to the community. But what
>>
>> > is the best way of going about it?
>>
>> >
>>
>> > option 1: merge our change into DisjunctionMax/SumScorers.
>>
>> > option 2: contribute boolean operator sets, and have
>>
>> > DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer
>>
>> > derive from AndDocIdSetIterator etc.
>>
>> >
>>
>> > Option 2 seems to be cleaner. Thoughts?
>>
>> Some theoretical performance improvement is possible when the
>>
>> minimum number of required scorers/iterators is higher than 1,
>>
>> by using of skipTo() (as much as possible) instead of next() in
>>
>> such cases. For the moment that's theoretical because there
>>
>> is no working implementation of this yet, but have a look at
>>
>> LUCENE-1345 .
>>
>> I'm currently working on a DisjunctionDISI, probably the same function as
>> the OrDocIdSetIterator you mentioned above. In case you have
>>
>> something faster than that, could you post it at LUCENE-1345 or at a
>>
>> new issue?
>>
>> An AndDocIdSetIterator could also be useful for the PhraseScorers and
>>
>> for the SpanNear queries, but that is of later concern.
>>
>> So I'd prefer option 2.
>>
>> Regards,
>>
>> Paul Elschot
>>
>>
>

Re: DisjunctionScorer performance

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 07 January 2009 08:55:50 John Wang wrote:
> Paul:
> 
>        Our very simple/naive testing methodology for OrDocIdSetIterator:
> 
> 5 sub iterators, each subiterators just iterate from 0 to 1,000,000.
> 
> The test iterates the OrDocIdSetIterator until next() is false.

At LUCENE-365 there is an old TestDisjunctionPerf1.java, in which
the various scorers iterate by increasing the document nrs
by a given number. Iirc the tests there use prime numbers for that,
mostly for a somewhat realistic test of the priority queue.
It will probably need some care to make it work again, quite a few
things have changed since then.
The main point to take away from it is to also test the performance
of BooleanScorer, which has the best performance I've seen so far
on disjunctions.

> 
>       Do you want me to run the same test against DisjunctDisi?

No need for that I think, the DisjunctionDISI there is still based
on basically the same priority queue that Disjunction...Scorer uses.

Regards,
Paul Elschot

Re: DisjunctionScorer performance

Posted by John Wang <jo...@gmail.com>.
Paul:

       Our very simple/naive testing methodology for OrDocIdSetIterator:

5 sub iterators, each subiterators just iterate from 0 to 1,000,000.

The test iterates the OrDocIdSetIterator until next() is false.

      Do you want me to run the same test against DisjunctDisi?

-John

On Tue, Jan 6, 2009 at 11:48 PM, Paul Elschot <pa...@xs4all.nl>wrote:

>  On Wednesday 07 January 2009 07:36:06 John Wang wrote:
>
> > Hi guys:
>
> >
>
> > We have been building a suite of boolean operators DocIdSets
>
> > (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator,
>
> > NotDocIdSet/Iterator). We compared our implementation on the
>
> > OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some
>
> > code tuning, and we see the performance doubled in our testing.
>
> That's good news.
>
> What data structure did you use for sorting by doc id?
>
> Currently a priority queue is used for that, and normally that is
>
> the bottleneck for performance.
>
> > (we
>
> > haven't done comparisons with ConjuctionScorer vs.
>
> > AndDocIdSetIterator, will post numbers when we do)
>
> >
>
> > We'd be happy to contribute this back to the community. But what
>
> > is the best way of going about it?
>
> >
>
> > option 1: merge our change into DisjunctionMax/SumScorers.
>
> > option 2: contribute boolean operator sets, and have
>
> > DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer
>
> > derive from AndDocIdSetIterator etc.
>
> >
>
> > Option 2 seems to be cleaner. Thoughts?
>
> Some theoretical performance improvement is possible when the
>
> minimum number of required scorers/iterators is higher than 1,
>
> by using of skipTo() (as much as possible) instead of next() in
>
> such cases. For the moment that's theoretical because there
>
> is no working implementation of this yet, but have a look at
>
> LUCENE-1345 .
>
> I'm currently working on a DisjunctionDISI, probably the same function as
> the OrDocIdSetIterator you mentioned above. In case you have
>
> something faster than that, could you post it at LUCENE-1345 or at a
>
> new issue?
>
> An AndDocIdSetIterator could also be useful for the PhraseScorers and
>
> for the SpanNear queries, but that is of later concern.
>
> So I'd prefer option 2.
>
> Regards,
>
> Paul Elschot
>
>

Re: DisjunctionScorer performance

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 07 January 2009 07:36:06 John Wang wrote:
> Hi guys:
>
>      We have been building a suite of boolean operators DocIdSets
> (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator,
> NotDocIdSet/Iterator). We compared our implementation on the
> OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some
> code tuning, and we see the performance doubled in our testing. 

That's good news.
What data structure did you use for sorting by doc id?
Currently a priority queue is used for that, and normally that is
the bottleneck for performance.

> (we
> haven't done comparisons with ConjuctionScorer vs.
> AndDocIdSetIterator, will post numbers when we do)
>
>      We'd be happy to contribute this back to the community. But what
> is the best way of going about it?
>
> option 1: merge our change into DisjunctionMax/SumScorers.
> option 2: contribute boolean operator sets, and have
> DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer
> derive from AndDocIdSetIterator etc.
>
>      Option 2 seems to be cleaner. Thoughts?

Some theoretical performance improvement is possible when the
minimum number of required scorers/iterators is higher than 1,
by using of skipTo() (as much as possible) instead of next() in
such cases. For the moment that's theoretical because there
is no working implementation of this yet, but have a look at
LUCENE-1345 .

I'm currently working on a DisjunctionDISI, probably the same function 
as the OrDocIdSetIterator you mentioned above. In case you have
something faster than that, could you post it at LUCENE-1345 or at a
new issue?

An AndDocIdSetIterator could also be useful for the PhraseScorers and
for the SpanNear queries, but that is of later concern.

So I'd prefer option 2.

Regards,
Paul Elschot