You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Li Li <fa...@gmail.com> on 2012/04/17 12:37:27 UTC

why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

hi all,
    I am now hacking the BooleanScorer2 to let it keep the docID() of the
leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
Why I want to do this is: When I Collect a doc, I want to know which term
is matched(especially for BooleanClause whose Occur is SHOULD). we have
discussed some solutions, such as adding bit masks in disjunction scorers.
with this method, when we finds a matched doc, we can recursively find
which leaf scorer is matched. But we think it's not very efficient and not
convenient to use(this is my proposal but not agreed by others in our
team). and then we came up with another one: Modifying DisjunctionSumScorer.
   we analysed the codes and found that the only Scorers used by
BooleanScorer2 that will make the children scorers' docID() not equal to
parent is an anonymous class inherited from DisjunctionSumScorer. All other
ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
   The implementation algorithm of DisjunctionSumScorer use a heap to find
the smallest doc. after finding a matched doc, the currentDoc is the
matched doc and all the scorers in the heap will call nextDoc() so all of
the scorers' current docID the nextDoc of currentDoc. if there are N level
DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
of the root of the scorer tree.
   So we modify the DisjuctionSumScorer and let it behavior as we expected.
And then I wrote some TestCase and it works well. And also I wrote some
random generated TermScorer and compared the nextDoc(),score() and
advance(int) method of original DisjunctionSumScorer and modified one.
nextDoc() and score() and exactly the same. But for advance(int target), we
found some interesting and strange things.
   at the beginning, I think if target is less than current docID, it will
just return current docID and do nothing. this assumption let my algorithm
go wrong. Then I read the codes of TermScorer and found each call of
advance(int) of TermScorer will call nextDoc() no matter whether current
docID is larger than target or not.
   So I am confused and then read the javadoc of DocIdSetIterator:
----------------- javadoc of DocIdSetIterator.advance(int
target)-------------

int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
IOException

Advances to the first beyond (see NOTE below) the current whose document
number is greater than or equal
 to target. Returns the current document number or NO_MORE_DOCS if there
are no more docs in the set.
Behaves as if written:
 int advance(int target) {
   int doc;
   while ((doc = nextDoc()) < target) {
   }
   return doc;
 }
 Some implementations are considerably more efficient than that.
NOTE: when target < current implementations may opt not to advance beyond
their current docID().
NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
Scorers. If your
 implementation cannot efficiently determine that it should exhaust, it is
recommended that you check for
 that value in each call to this method.
NOTE: after the iterator has exhausted you should not call this method, as
it may result in unpredicted
 behavior.
--------------------------------------
Then I modified my algorithm again and found that
DisjunctionSumScorer.advance(int target) has some strange behavior. most of
the cases, it will return currentDoc if target < currentDoc. but in some
boundary condition, it will not.
it's not a bug but let me sad. I thought my algorithm has some bug because
it's advance method is not exactly the same as original
DisjunctionSumScorer's.
----codes of DisjunctionSumScorer---
  @Override
  public int advance(int target) throws IOException {
    if (scorerDocQueue.size() < minimumNrMatchers) {
      return currentDoc = NO_MORE_DOCS;
    }
    if (target <= currentDoc) {
      return currentDoc;
    }
   ....
-------------------
for most case if (target <= currentDoc) it will return currentDoc;
but if previous advance will make sub scorers exhausted, then if may return
NO_MORE_DOCS
an example is:
   currentDoc=-1
   minimumNrMatchers=1
   subScorers:
      TermScorer: docIds: [1, 2, 6]
      TermScorer: docIds: [2, 4]
after first call advance(5)
    currentDoc=6
    only first scorer is now in the heap, scorerDocQueue.size()==1
then call advance(6)
    because scorerDocQueue.size() < minimumNrMatchers, it just return
NO_MORE_DOCS

My question is why the advance(int target) method is defined like this? for
the reason of efficient or any other reasons?

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
I got it.
I found my test codes some problem.
the new test result is:
when minShould=1, new algorithm is a little bit faster than old one. when
minShould>1, old algorithm is faster.

On Sun, Apr 22, 2012 at 3:32 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Li,
>
> I can give you my understanding of difference in DisjunctionSumScorer and
> https://issues.apache.org/jira/browse/LUCENE-2686
>
> When you choose steady children approach, even you omit call score() you
> have to enumerate top of child scorers heap twice.
> Check nextDoc() from the patch: first loop pushes top of heap first, then
> the second loop through the top is done by
> countMatches(1); countMatches(2);
>
> Current DisjunctionSumScorer enumerate top of heap once per every doc:
> while it loops through top of heap it counts number of clauses matched;
> accumulate score; and pushes top children one step forward.
>
> what's minimumNrMatchers do you have? can you upload your test github?
> (mail list rips attachments off)
>
> On Thu, Apr 19, 2012 at 7:34 AM, Li Li <fa...@gmail.com> wrote:
>
>> Michael McCandless wrote:
>>
>> So... the good news is I made a new scorer (basically copied
>> DisjunctionMaxScorer and then tweaked from there) that scores the OR-only
>> case. All tests pass w/ this new scorer.
>>
>> And more good news is that if you don't score (I sort by doctitle to do
>> that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
>> expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
>> it to use BS2 not BS – the nocommits in the patch).
>>
>> But the bad news is with scoring on it's 22.7% slower!
>>
>> And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
>> faster for this one query. I think we need to modify the criteria that
>> decides whether to use BS or BS2... maybe when there are lots of
>> lowish-docFreq terms, BS2 is better?
>> ---------------
>> why new algorithm is 22% slower than old one?
>> I read the codes and feel them almost the same.
>>
>> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I can't help with the particular question, but can share some
>>> experience. My task is roughly the same I've found the patch
>>> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
>>> (with one small addition, I'll post it in comments soon). By using it I
>>> have disjunction summing query with steady subscorers.
>>>
>>> Regards
>>>
>>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>>>
>>>> hi all,
>>>>     I am now hacking the BooleanScorer2 to let it keep the docID() of
>>>> the leaf scorer(mostly possible TermScorer) the same as the top-level
>>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know
>>>> which term is matched(especially for BooleanClause whose Occur is SHOULD).
>>>> we have discussed some solutions, such as adding bit masks in disjunction
>>>> scorers. with this method, when we finds a matched doc, we can recursively
>>>> find which leaf scorer is matched. But we think it's not very efficient and
>>>> not convenient to use(this is my proposal but not agreed by others in our
>>>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>>>    we analysed the codes and found that the only Scorers used by
>>>> BooleanScorer2 that will make the children scorers' docID() not equal to
>>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>>>> find the smallest doc. after finding a matched doc, the currentDoc is the
>>>> matched doc and all the scorers in the heap will call nextDoc() so all of
>>>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>>>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>>>> of the root of the scorer tree.
>>>>    So we modify the DisjuctionSumScorer and let it behavior as we
>>>> expected. And then I wrote some TestCase and it works well. And also I
>>>> wrote some random generated TermScorer and compared the nextDoc(),score()
>>>> and advance(int) method of original DisjunctionSumScorer and modified one.
>>>> nextDoc() and score() and exactly the same. But for advance(int target), we
>>>> found some interesting and strange things.
>>>>    at the beginning, I think if target is less than current docID, it
>>>> will just return current docID and do nothing. this assumption let my
>>>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>>>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>>>> docID is larger than target or not.
>>>>    So I am confused and then read the javadoc of DocIdSetIterator:
>>>> ----------------- javadoc of DocIdSetIterator.advance(int
>>>> target)-------------
>>>>
>>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target)
>>>> throws IOException
>>>>
>>>> Advances to the first beyond (see NOTE below) the current whose
>>>> document number is greater than or equal
>>>>  to target. Returns the current document number or NO_MORE_DOCS if
>>>> there are no more docs in the set.
>>>> Behaves as if written:
>>>>  int advance(int target) {
>>>>    int doc;
>>>>    while ((doc = nextDoc()) < target) {
>>>>    }
>>>>    return doc;
>>>>  }
>>>>  Some implementations are considerably more efficient than that.
>>>> NOTE: when target < current implementations may opt not to advance
>>>> beyond their current docID().
>>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by
>>>> some Scorers. If your
>>>>  implementation cannot efficiently determine that it should exhaust, it
>>>> is recommended that you check for
>>>>  that value in each call to this method.
>>>> NOTE: after the iterator has exhausted you should not call this method,
>>>> as it may result in unpredicted
>>>>  behavior.
>>>> --------------------------------------
>>>> Then I modified my algorithm again and found that
>>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>>>> the cases, it will return currentDoc if target < currentDoc. but in some
>>>> boundary condition, it will not.
>>>> it's not a bug but let me sad. I thought my algorithm has some bug
>>>> because it's advance method is not exactly the same as original
>>>> DisjunctionSumScorer's.
>>>> ----codes of DisjunctionSumScorer---
>>>>   @Override
>>>>   public int advance(int target) throws IOException {
>>>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>>>       return currentDoc = NO_MORE_DOCS;
>>>>     }
>>>>     if (target <= currentDoc) {
>>>>       return currentDoc;
>>>>     }
>>>>    ....
>>>> -------------------
>>>> for most case if (target <= currentDoc) it will return currentDoc;
>>>> but if previous advance will make sub scorers exhausted, then if may
>>>> return NO_MORE_DOCS
>>>> an example is:
>>>>    currentDoc=-1
>>>>    minimumNrMatchers=1
>>>>    subScorers:
>>>>       TermScorer: docIds: [1, 2, 6]
>>>>       TermScorer: docIds: [2, 4]
>>>> after first call advance(5)
>>>>     currentDoc=6
>>>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>>>> then call advance(6)
>>>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>>>> NO_MORE_DOCS
>>>>
>>>> My question is why the advance(int target) method is defined like this?
>>>> for the reason of efficient or any other reasons?
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> gedel@yandex.ru
>>>
>>> <http://www.griddynamics.com>
>>>  <mk...@griddynamics.com>
>>>
>>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Li,

I can give you my understanding of difference in DisjunctionSumScorer and
https://issues.apache.org/jira/browse/LUCENE-2686

When you choose steady children approach, even you omit call score() you
have to enumerate top of child scorers heap twice.
Check nextDoc() from the patch: first loop pushes top of heap first, then
the second loop through the top is done by
countMatches(1); countMatches(2);

Current DisjunctionSumScorer enumerate top of heap once per every doc:
while it loops through top of heap it counts number of clauses matched;
accumulate score; and pushes top children one step forward.

what's minimumNrMatchers do you have? can you upload your test github?
(mail list rips attachments off)

On Thu, Apr 19, 2012 at 7:34 AM, Li Li <fa...@gmail.com> wrote:

> Michael McCandless wrote:
>
> So... the good news is I made a new scorer (basically copied
> DisjunctionMaxScorer and then tweaked from there) that scores the OR-only
> case. All tests pass w/ this new scorer.
>
> And more good news is that if you don't score (I sort by doctitle to do
> that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
> expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
> it to use BS2 not BS – the nocommits in the patch).
>
> But the bad news is with scoring on it's 22.7% slower!
>
> And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
> faster for this one query. I think we need to modify the criteria that
> decides whether to use BS or BS2... maybe when there are lots of
> lowish-docFreq terms, BS2 is better?
> ---------------
> why new algorithm is 22% slower than old one?
> I read the codes and feel them almost the same.
>
> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello,
>>
>> I can't help with the particular question, but can share some experience.
>> My task is roughly the same I've found the patch
>> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
>> (with one small addition, I'll post it in comments soon). By using it I
>> have disjunction summing query with steady subscorers.
>>
>> Regards
>>
>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>>
>>> hi all,
>>>     I am now hacking the BooleanScorer2 to let it keep the docID() of
>>> the leaf scorer(mostly possible TermScorer) the same as the top-level
>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know
>>> which term is matched(especially for BooleanClause whose Occur is SHOULD).
>>> we have discussed some solutions, such as adding bit masks in disjunction
>>> scorers. with this method, when we finds a matched doc, we can recursively
>>> find which leaf scorer is matched. But we think it's not very efficient and
>>> not convenient to use(this is my proposal but not agreed by others in our
>>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>>    we analysed the codes and found that the only Scorers used by
>>> BooleanScorer2 that will make the children scorers' docID() not equal to
>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>>> find the smallest doc. after finding a matched doc, the currentDoc is the
>>> matched doc and all the scorers in the heap will call nextDoc() so all of
>>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>>> of the root of the scorer tree.
>>>    So we modify the DisjuctionSumScorer and let it behavior as we
>>> expected. And then I wrote some TestCase and it works well. And also I
>>> wrote some random generated TermScorer and compared the nextDoc(),score()
>>> and advance(int) method of original DisjunctionSumScorer and modified one.
>>> nextDoc() and score() and exactly the same. But for advance(int target), we
>>> found some interesting and strange things.
>>>    at the beginning, I think if target is less than current docID, it
>>> will just return current docID and do nothing. this assumption let my
>>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>>> docID is larger than target or not.
>>>    So I am confused and then read the javadoc of DocIdSetIterator:
>>> ----------------- javadoc of DocIdSetIterator.advance(int
>>> target)-------------
>>>
>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>>> IOException
>>>
>>> Advances to the first beyond (see NOTE below) the current whose document
>>> number is greater than or equal
>>>  to target. Returns the current document number or NO_MORE_DOCS if there
>>> are no more docs in the set.
>>> Behaves as if written:
>>>  int advance(int target) {
>>>    int doc;
>>>    while ((doc = nextDoc()) < target) {
>>>    }
>>>    return doc;
>>>  }
>>>  Some implementations are considerably more efficient than that.
>>> NOTE: when target < current implementations may opt not to advance
>>> beyond their current docID().
>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>>> Scorers. If your
>>>  implementation cannot efficiently determine that it should exhaust, it
>>> is recommended that you check for
>>>  that value in each call to this method.
>>> NOTE: after the iterator has exhausted you should not call this method,
>>> as it may result in unpredicted
>>>  behavior.
>>> --------------------------------------
>>> Then I modified my algorithm again and found that
>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>>> the cases, it will return currentDoc if target < currentDoc. but in some
>>> boundary condition, it will not.
>>> it's not a bug but let me sad. I thought my algorithm has some bug
>>> because it's advance method is not exactly the same as original
>>> DisjunctionSumScorer's.
>>> ----codes of DisjunctionSumScorer---
>>>   @Override
>>>   public int advance(int target) throws IOException {
>>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>>       return currentDoc = NO_MORE_DOCS;
>>>     }
>>>     if (target <= currentDoc) {
>>>       return currentDoc;
>>>     }
>>>    ....
>>> -------------------
>>> for most case if (target <= currentDoc) it will return currentDoc;
>>> but if previous advance will make sub scorers exhausted, then if may
>>> return NO_MORE_DOCS
>>> an example is:
>>>    currentDoc=-1
>>>    minimumNrMatchers=1
>>>    subScorers:
>>>       TermScorer: docIds: [1, 2, 6]
>>>       TermScorer: docIds: [2, 4]
>>> after first call advance(5)
>>>     currentDoc=6
>>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>>> then call advance(6)
>>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>>> NO_MORE_DOCS
>>>
>>> My question is why the advance(int target) method is defined like this?
>>> for the reason of efficient or any other reasons?
>>>
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> gedel@yandex.ru
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
Michael McCandless wrote:

So... the good news is I made a new scorer (basically copied
DisjunctionMaxScorer and then tweaked from there) that scores the OR-only
case. All tests pass w/ this new scorer.

And more good news is that if you don't score (I sort by doctitle to do
that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
it to use BS2 not BS – the nocommits in the patch).

But the bad news is with scoring on it's 22.7% slower!

And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
faster for this one query. I think we need to modify the criteria that
decides whether to use BS or BS2... maybe when there are lots of
lowish-docFreq terms, BS2 is better?
---------------
why new algorithm is 22% slower than old one?
I read the codes and feel them almost the same.

On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I can't help with the particular question, but can share some experience.
> My task is roughly the same I've found the patch
> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
> (with one small addition, I'll post it in comments soon). By using it I
> have disjunction summing query with steady subscorers.
>
> Regards
>
> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>
>> hi all,
>>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
>> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
>> Why I want to do this is: When I Collect a doc, I want to know which term
>> is matched(especially for BooleanClause whose Occur is SHOULD). we have
>> discussed some solutions, such as adding bit masks in disjunction scorers.
>> with this method, when we finds a matched doc, we can recursively find
>> which leaf scorer is matched. But we think it's not very efficient and not
>> convenient to use(this is my proposal but not agreed by others in our
>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>    we analysed the codes and found that the only Scorers used by
>> BooleanScorer2 that will make the children scorers' docID() not equal to
>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>> find the smallest doc. after finding a matched doc, the currentDoc is the
>> matched doc and all the scorers in the heap will call nextDoc() so all of
>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>> of the root of the scorer tree.
>>    So we modify the DisjuctionSumScorer and let it behavior as we
>> expected. And then I wrote some TestCase and it works well. And also I
>> wrote some random generated TermScorer and compared the nextDoc(),score()
>> and advance(int) method of original DisjunctionSumScorer and modified one.
>> nextDoc() and score() and exactly the same. But for advance(int target), we
>> found some interesting and strange things.
>>    at the beginning, I think if target is less than current docID, it
>> will just return current docID and do nothing. this assumption let my
>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>> docID is larger than target or not.
>>    So I am confused and then read the javadoc of DocIdSetIterator:
>> ----------------- javadoc of DocIdSetIterator.advance(int
>> target)-------------
>>
>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>> IOException
>>
>> Advances to the first beyond (see NOTE below) the current whose document
>> number is greater than or equal
>>  to target. Returns the current document number or NO_MORE_DOCS if there
>> are no more docs in the set.
>> Behaves as if written:
>>  int advance(int target) {
>>    int doc;
>>    while ((doc = nextDoc()) < target) {
>>    }
>>    return doc;
>>  }
>>  Some implementations are considerably more efficient than that.
>> NOTE: when target < current implementations may opt not to advance beyond
>> their current docID().
>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>> Scorers. If your
>>  implementation cannot efficiently determine that it should exhaust, it
>> is recommended that you check for
>>  that value in each call to this method.
>> NOTE: after the iterator has exhausted you should not call this method,
>> as it may result in unpredicted
>>  behavior.
>> --------------------------------------
>> Then I modified my algorithm again and found that
>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>> the cases, it will return currentDoc if target < currentDoc. but in some
>> boundary condition, it will not.
>> it's not a bug but let me sad. I thought my algorithm has some bug
>> because it's advance method is not exactly the same as original
>> DisjunctionSumScorer's.
>> ----codes of DisjunctionSumScorer---
>>   @Override
>>   public int advance(int target) throws IOException {
>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>       return currentDoc = NO_MORE_DOCS;
>>     }
>>     if (target <= currentDoc) {
>>       return currentDoc;
>>     }
>>    ....
>> -------------------
>> for most case if (target <= currentDoc) it will return currentDoc;
>> but if previous advance will make sub scorers exhausted, then if may
>> return NO_MORE_DOCS
>> an example is:
>>    currentDoc=-1
>>    minimumNrMatchers=1
>>    subScorers:
>>       TermScorer: docIds: [1, 2, 6]
>>       TermScorer: docIds: [2, 4]
>> after first call advance(5)
>>     currentDoc=6
>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>> then call advance(6)
>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>> NO_MORE_DOCS
>>
>> My question is why the advance(int target) method is defined like this?
>> for the reason of efficient or any other reasons?
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
I found the patch has removed the score(Collector) and
score(Collector,int,int) method. this seems to mean that this scorer is not
possibly used as top-level scorer. Why old implementation has this method?
anywhere use DisjunctionSumScorer as top level scorer?

On Wed, Apr 18, 2012 at 12:18 PM, Li Li <fa...@gmail.com> wrote:

> thanks. it's great.
>
>
> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello,
>>
>> I can't help with the particular question, but can share some experience.
>> My task is roughly the same I've found the patch
>> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
>> (with one small addition, I'll post it in comments soon). By using it I
>> have disjunction summing query with steady subscorers.
>>
>> Regards
>>
>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>>
>>> hi all,
>>>     I am now hacking the BooleanScorer2 to let it keep the docID() of
>>> the leaf scorer(mostly possible TermScorer) the same as the top-level
>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know
>>> which term is matched(especially for BooleanClause whose Occur is SHOULD).
>>> we have discussed some solutions, such as adding bit masks in disjunction
>>> scorers. with this method, when we finds a matched doc, we can recursively
>>> find which leaf scorer is matched. But we think it's not very efficient and
>>> not convenient to use(this is my proposal but not agreed by others in our
>>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>>    we analysed the codes and found that the only Scorers used by
>>> BooleanScorer2 that will make the children scorers' docID() not equal to
>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>>> find the smallest doc. after finding a matched doc, the currentDoc is the
>>> matched doc and all the scorers in the heap will call nextDoc() so all of
>>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>>> of the root of the scorer tree.
>>>    So we modify the DisjuctionSumScorer and let it behavior as we
>>> expected. And then I wrote some TestCase and it works well. And also I
>>> wrote some random generated TermScorer and compared the nextDoc(),score()
>>> and advance(int) method of original DisjunctionSumScorer and modified one.
>>> nextDoc() and score() and exactly the same. But for advance(int target), we
>>> found some interesting and strange things.
>>>    at the beginning, I think if target is less than current docID, it
>>> will just return current docID and do nothing. this assumption let my
>>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>>> docID is larger than target or not.
>>>    So I am confused and then read the javadoc of DocIdSetIterator:
>>> ----------------- javadoc of DocIdSetIterator.advance(int
>>> target)-------------
>>>
>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>>> IOException
>>>
>>> Advances to the first beyond (see NOTE below) the current whose document
>>> number is greater than or equal
>>>  to target. Returns the current document number or NO_MORE_DOCS if there
>>> are no more docs in the set.
>>> Behaves as if written:
>>>  int advance(int target) {
>>>    int doc;
>>>    while ((doc = nextDoc()) < target) {
>>>    }
>>>    return doc;
>>>  }
>>>  Some implementations are considerably more efficient than that.
>>> NOTE: when target < current implementations may opt not to advance
>>> beyond their current docID().
>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>>> Scorers. If your
>>>  implementation cannot efficiently determine that it should exhaust, it
>>> is recommended that you check for
>>>  that value in each call to this method.
>>> NOTE: after the iterator has exhausted you should not call this method,
>>> as it may result in unpredicted
>>>  behavior.
>>> --------------------------------------
>>> Then I modified my algorithm again and found that
>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>>> the cases, it will return currentDoc if target < currentDoc. but in some
>>> boundary condition, it will not.
>>> it's not a bug but let me sad. I thought my algorithm has some bug
>>> because it's advance method is not exactly the same as original
>>> DisjunctionSumScorer's.
>>> ----codes of DisjunctionSumScorer---
>>>   @Override
>>>   public int advance(int target) throws IOException {
>>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>>       return currentDoc = NO_MORE_DOCS;
>>>     }
>>>     if (target <= currentDoc) {
>>>       return currentDoc;
>>>     }
>>>    ....
>>> -------------------
>>> for most case if (target <= currentDoc) it will return currentDoc;
>>> but if previous advance will make sub scorers exhausted, then if may
>>> return NO_MORE_DOCS
>>> an example is:
>>>    currentDoc=-1
>>>    minimumNrMatchers=1
>>>    subScorers:
>>>       TermScorer: docIds: [1, 2, 6]
>>>       TermScorer: docIds: [2, 4]
>>> after first call advance(5)
>>>     currentDoc=6
>>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>>> then call advance(6)
>>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>>> NO_MORE_DOCS
>>>
>>> My question is why the advance(int target) method is defined like this?
>>> for the reason of efficient or any other reasons?
>>>
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> gedel@yandex.ru
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
thanks. it's great.

On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I can't help with the particular question, but can share some experience.
> My task is roughly the same I've found the patch
> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
> (with one small addition, I'll post it in comments soon). By using it I
> have disjunction summing query with steady subscorers.
>
> Regards
>
> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>
>> hi all,
>>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
>> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
>> Why I want to do this is: When I Collect a doc, I want to know which term
>> is matched(especially for BooleanClause whose Occur is SHOULD). we have
>> discussed some solutions, such as adding bit masks in disjunction scorers.
>> with this method, when we finds a matched doc, we can recursively find
>> which leaf scorer is matched. But we think it's not very efficient and not
>> convenient to use(this is my proposal but not agreed by others in our
>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>    we analysed the codes and found that the only Scorers used by
>> BooleanScorer2 that will make the children scorers' docID() not equal to
>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>> find the smallest doc. after finding a matched doc, the currentDoc is the
>> matched doc and all the scorers in the heap will call nextDoc() so all of
>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>> of the root of the scorer tree.
>>    So we modify the DisjuctionSumScorer and let it behavior as we
>> expected. And then I wrote some TestCase and it works well. And also I
>> wrote some random generated TermScorer and compared the nextDoc(),score()
>> and advance(int) method of original DisjunctionSumScorer and modified one.
>> nextDoc() and score() and exactly the same. But for advance(int target), we
>> found some interesting and strange things.
>>    at the beginning, I think if target is less than current docID, it
>> will just return current docID and do nothing. this assumption let my
>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>> docID is larger than target or not.
>>    So I am confused and then read the javadoc of DocIdSetIterator:
>> ----------------- javadoc of DocIdSetIterator.advance(int
>> target)-------------
>>
>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>> IOException
>>
>> Advances to the first beyond (see NOTE below) the current whose document
>> number is greater than or equal
>>  to target. Returns the current document number or NO_MORE_DOCS if there
>> are no more docs in the set.
>> Behaves as if written:
>>  int advance(int target) {
>>    int doc;
>>    while ((doc = nextDoc()) < target) {
>>    }
>>    return doc;
>>  }
>>  Some implementations are considerably more efficient than that.
>> NOTE: when target < current implementations may opt not to advance beyond
>> their current docID().
>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>> Scorers. If your
>>  implementation cannot efficiently determine that it should exhaust, it
>> is recommended that you check for
>>  that value in each call to this method.
>> NOTE: after the iterator has exhausted you should not call this method,
>> as it may result in unpredicted
>>  behavior.
>> --------------------------------------
>> Then I modified my algorithm again and found that
>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>> the cases, it will return currentDoc if target < currentDoc. but in some
>> boundary condition, it will not.
>> it's not a bug but let me sad. I thought my algorithm has some bug
>> because it's advance method is not exactly the same as original
>> DisjunctionSumScorer's.
>> ----codes of DisjunctionSumScorer---
>>   @Override
>>   public int advance(int target) throws IOException {
>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>       return currentDoc = NO_MORE_DOCS;
>>     }
>>     if (target <= currentDoc) {
>>       return currentDoc;
>>     }
>>    ....
>> -------------------
>> for most case if (target <= currentDoc) it will return currentDoc;
>> but if previous advance will make sub scorers exhausted, then if may
>> return NO_MORE_DOCS
>> an example is:
>>    currentDoc=-1
>>    minimumNrMatchers=1
>>    subScorers:
>>       TermScorer: docIds: [1, 2, 6]
>>       TermScorer: docIds: [2, 4]
>> after first call advance(5)
>>     currentDoc=6
>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>> then call advance(6)
>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>> NO_MORE_DOCS
>>
>> My question is why the advance(int target) method is defined like this?
>> for the reason of efficient or any other reasons?
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
thanks. it's great.

On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I can't help with the particular question, but can share some experience.
> My task is roughly the same I've found the patch
> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
> (with one small addition, I'll post it in comments soon). By using it I
> have disjunction summing query with steady subscorers.
>
> Regards
>
> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>
>> hi all,
>>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
>> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
>> Why I want to do this is: When I Collect a doc, I want to know which term
>> is matched(especially for BooleanClause whose Occur is SHOULD). we have
>> discussed some solutions, such as adding bit masks in disjunction scorers.
>> with this method, when we finds a matched doc, we can recursively find
>> which leaf scorer is matched. But we think it's not very efficient and not
>> convenient to use(this is my proposal but not agreed by others in our
>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>    we analysed the codes and found that the only Scorers used by
>> BooleanScorer2 that will make the children scorers' docID() not equal to
>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>> find the smallest doc. after finding a matched doc, the currentDoc is the
>> matched doc and all the scorers in the heap will call nextDoc() so all of
>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>> of the root of the scorer tree.
>>    So we modify the DisjuctionSumScorer and let it behavior as we
>> expected. And then I wrote some TestCase and it works well. And also I
>> wrote some random generated TermScorer and compared the nextDoc(),score()
>> and advance(int) method of original DisjunctionSumScorer and modified one.
>> nextDoc() and score() and exactly the same. But for advance(int target), we
>> found some interesting and strange things.
>>    at the beginning, I think if target is less than current docID, it
>> will just return current docID and do nothing. this assumption let my
>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>> docID is larger than target or not.
>>    So I am confused and then read the javadoc of DocIdSetIterator:
>> ----------------- javadoc of DocIdSetIterator.advance(int
>> target)-------------
>>
>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>> IOException
>>
>> Advances to the first beyond (see NOTE below) the current whose document
>> number is greater than or equal
>>  to target. Returns the current document number or NO_MORE_DOCS if there
>> are no more docs in the set.
>> Behaves as if written:
>>  int advance(int target) {
>>    int doc;
>>    while ((doc = nextDoc()) < target) {
>>    }
>>    return doc;
>>  }
>>  Some implementations are considerably more efficient than that.
>> NOTE: when target < current implementations may opt not to advance beyond
>> their current docID().
>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>> Scorers. If your
>>  implementation cannot efficiently determine that it should exhaust, it
>> is recommended that you check for
>>  that value in each call to this method.
>> NOTE: after the iterator has exhausted you should not call this method,
>> as it may result in unpredicted
>>  behavior.
>> --------------------------------------
>> Then I modified my algorithm again and found that
>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>> the cases, it will return currentDoc if target < currentDoc. but in some
>> boundary condition, it will not.
>> it's not a bug but let me sad. I thought my algorithm has some bug
>> because it's advance method is not exactly the same as original
>> DisjunctionSumScorer's.
>> ----codes of DisjunctionSumScorer---
>>   @Override
>>   public int advance(int target) throws IOException {
>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>       return currentDoc = NO_MORE_DOCS;
>>     }
>>     if (target <= currentDoc) {
>>       return currentDoc;
>>     }
>>    ....
>> -------------------
>> for most case if (target <= currentDoc) it will return currentDoc;
>> but if previous advance will make sub scorers exhausted, then if may
>> return NO_MORE_DOCS
>> an example is:
>>    currentDoc=-1
>>    minimumNrMatchers=1
>>    subScorers:
>>       TermScorer: docIds: [1, 2, 6]
>>       TermScorer: docIds: [2, 4]
>> after first call advance(5)
>>     currentDoc=6
>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>> then call advance(6)
>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>> NO_MORE_DOCS
>>
>> My question is why the advance(int target) method is defined like this?
>> for the reason of efficient or any other reasons?
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
Michael McCandless wrote:

So... the good news is I made a new scorer (basically copied
DisjunctionMaxScorer and then tweaked from there) that scores the OR-only
case. All tests pass w/ this new scorer.

And more good news is that if you don't score (I sort by doctitle to do
that), you get a speedup – 7.7% in my simplistic test (prefix query unit*,
expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force
it to use BS2 not BS – the nocommits in the patch).

But the bad news is with scoring on it's 22.7% slower!

And, the weird news is, I discovered accidentally that BS2 is much (> 2X)
faster for this one query. I think we need to modify the criteria that
decides whether to use BS or BS2... maybe when there are lots of
lowish-docFreq terms, BS2 is better?
---------------
why new algorithm is 22% slower than old one?
I read the codes and feel them almost the same.

On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I can't help with the particular question, but can share some experience.
> My task is roughly the same I've found the patch
> https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
> (with one small addition, I'll post it in comments soon). By using it I
> have disjunction summing query with steady subscorers.
>
> Regards
>
> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:
>
>> hi all,
>>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
>> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
>> Why I want to do this is: When I Collect a doc, I want to know which term
>> is matched(especially for BooleanClause whose Occur is SHOULD). we have
>> discussed some solutions, such as adding bit masks in disjunction scorers.
>> with this method, when we finds a matched doc, we can recursively find
>> which leaf scorer is matched. But we think it's not very efficient and not
>> convenient to use(this is my proposal but not agreed by others in our
>> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>>    we analysed the codes and found that the only Scorers used by
>> BooleanScorer2 that will make the children scorers' docID() not equal to
>> parent is an anonymous class inherited from DisjunctionSumScorer. All other
>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>>    The implementation algorithm of DisjunctionSumScorer use a heap to
>> find the smallest doc. after finding a matched doc, the currentDoc is the
>> matched doc and all the scorers in the heap will call nextDoc() so all of
>> the scorers' current docID the nextDoc of currentDoc. if there are N level
>> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
>> of the root of the scorer tree.
>>    So we modify the DisjuctionSumScorer and let it behavior as we
>> expected. And then I wrote some TestCase and it works well. And also I
>> wrote some random generated TermScorer and compared the nextDoc(),score()
>> and advance(int) method of original DisjunctionSumScorer and modified one.
>> nextDoc() and score() and exactly the same. But for advance(int target), we
>> found some interesting and strange things.
>>    at the beginning, I think if target is less than current docID, it
>> will just return current docID and do nothing. this assumption let my
>> algorithm go wrong. Then I read the codes of TermScorer and found each call
>> of advance(int) of TermScorer will call nextDoc() no matter whether current
>> docID is larger than target or not.
>>    So I am confused and then read the javadoc of DocIdSetIterator:
>> ----------------- javadoc of DocIdSetIterator.advance(int
>> target)-------------
>>
>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
>> IOException
>>
>> Advances to the first beyond (see NOTE below) the current whose document
>> number is greater than or equal
>>  to target. Returns the current document number or NO_MORE_DOCS if there
>> are no more docs in the set.
>> Behaves as if written:
>>  int advance(int target) {
>>    int doc;
>>    while ((doc = nextDoc()) < target) {
>>    }
>>    return doc;
>>  }
>>  Some implementations are considerably more efficient than that.
>> NOTE: when target < current implementations may opt not to advance beyond
>> their current docID().
>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
>> Scorers. If your
>>  implementation cannot efficiently determine that it should exhaust, it
>> is recommended that you check for
>>  that value in each call to this method.
>> NOTE: after the iterator has exhausted you should not call this method,
>> as it may result in unpredicted
>>  behavior.
>> --------------------------------------
>> Then I modified my algorithm again and found that
>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
>> the cases, it will return currentDoc if target < currentDoc. but in some
>> boundary condition, it will not.
>> it's not a bug but let me sad. I thought my algorithm has some bug
>> because it's advance method is not exactly the same as original
>> DisjunctionSumScorer's.
>> ----codes of DisjunctionSumScorer---
>>   @Override
>>   public int advance(int target) throws IOException {
>>     if (scorerDocQueue.size() < minimumNrMatchers) {
>>       return currentDoc = NO_MORE_DOCS;
>>     }
>>     if (target <= currentDoc) {
>>       return currentDoc;
>>     }
>>    ....
>> -------------------
>> for most case if (target <= currentDoc) it will return currentDoc;
>> but if previous advance will make sub scorers exhausted, then if may
>> return NO_MORE_DOCS
>> an example is:
>>    currentDoc=-1
>>    minimumNrMatchers=1
>>    subScorers:
>>       TermScorer: docIds: [1, 2, 6]
>>       TermScorer: docIds: [2, 4]
>> after first call advance(5)
>>     currentDoc=6
>>     only first scorer is now in the heap, scorerDocQueue.size()==1
>> then call advance(6)
>>     because scorerDocQueue.size() < minimumNrMatchers, it just return
>> NO_MORE_DOCS
>>
>> My question is why the advance(int target) method is defined like this?
>> for the reason of efficient or any other reasons?
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

I can't help with the particular question, but can share some experience.
My task is roughly the same I've found the patch
https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
(with one small addition, I'll post it in comments soon). By using it I
have disjunction summing query with steady subscorers.

Regards

On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

I can't help with the particular question, but can share some experience.
My task is roughly the same I've found the patch
https://issues.apache.org/jira/browse/LUCENE-2686 is absolutely useful
(with one small addition, I'll post it in comments soon). By using it I
have disjunction summing query with steady subscorers.

Regards

On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fa...@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
some mistakes of the example:
after first call advance(5)
    currentDoc=6
    first scorer's nextDoc is called to in advance, the heap is empty now.
then call advance(6)
    because scorerDocQueue.size() < minimumNrMatchers, it just return
NO_MORE_DOCS

On Tue, Apr 17, 2012 at 6:37 PM, Li Li <fa...@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?

Posted by Li Li <fa...@gmail.com>.
some mistakes of the example:
after first call advance(5)
    currentDoc=6
    first scorer's nextDoc is called to in advance, the heap is empty now.
then call advance(6)
    because scorerDocQueue.size() < minimumNrMatchers, it just return
NO_MORE_DOCS

On Tue, Apr 17, 2012 at 6:37 PM, Li Li <fa...@gmail.com> wrote:

> hi all,
>     I am now hacking the BooleanScorer2 to let it keep the docID() of the
> leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
> Why I want to do this is: When I Collect a doc, I want to know which term
> is matched(especially for BooleanClause whose Occur is SHOULD). we have
> discussed some solutions, such as adding bit masks in disjunction scorers.
> with this method, when we finds a matched doc, we can recursively find
> which leaf scorer is matched. But we think it's not very efficient and not
> convenient to use(this is my proposal but not agreed by others in our
> team). and then we came up with another one: Modifying DisjunctionSumScorer.
>    we analysed the codes and found that the only Scorers used by
> BooleanScorer2 that will make the children scorers' docID() not equal to
> parent is an anonymous class inherited from DisjunctionSumScorer. All other
> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),
> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.
>    The implementation algorithm of DisjunctionSumScorer use a heap to find
> the smallest doc. after finding a matched doc, the currentDoc is the
> matched doc and all the scorers in the heap will call nextDoc() so all of
> the scorers' current docID the nextDoc of currentDoc. if there are N level
> DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId
> of the root of the scorer tree.
>    So we modify the DisjuctionSumScorer and let it behavior as we
> expected. And then I wrote some TestCase and it works well. And also I
> wrote some random generated TermScorer and compared the nextDoc(),score()
> and advance(int) method of original DisjunctionSumScorer and modified one.
> nextDoc() and score() and exactly the same. But for advance(int target), we
> found some interesting and strange things.
>    at the beginning, I think if target is less than current docID, it will
> just return current docID and do nothing. this assumption let my algorithm
> go wrong. Then I read the codes of TermScorer and found each call of
> advance(int) of TermScorer will call nextDoc() no matter whether current
> docID is larger than target or not.
>    So I am confused and then read the javadoc of DocIdSetIterator:
> ----------------- javadoc of DocIdSetIterator.advance(int
> target)-------------
>
> int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws
> IOException
>
> Advances to the first beyond (see NOTE below) the current whose document
> number is greater than or equal
>  to target. Returns the current document number or NO_MORE_DOCS if there
> are no more docs in the set.
> Behaves as if written:
>  int advance(int target) {
>    int doc;
>    while ((doc = nextDoc()) < target) {
>    }
>    return doc;
>  }
>  Some implementations are considerably more efficient than that.
> NOTE: when target < current implementations may opt not to advance beyond
> their current docID().
> NOTE: this method may be called with NO_MORE_DOCS for efficiency by some
> Scorers. If your
>  implementation cannot efficiently determine that it should exhaust, it is
> recommended that you check for
>  that value in each call to this method.
> NOTE: after the iterator has exhausted you should not call this method, as
> it may result in unpredicted
>  behavior.
> --------------------------------------
> Then I modified my algorithm again and found that
> DisjunctionSumScorer.advance(int target) has some strange behavior. most of
> the cases, it will return currentDoc if target < currentDoc. but in some
> boundary condition, it will not.
> it's not a bug but let me sad. I thought my algorithm has some bug because
> it's advance method is not exactly the same as original
> DisjunctionSumScorer's.
> ----codes of DisjunctionSumScorer---
>   @Override
>   public int advance(int target) throws IOException {
>     if (scorerDocQueue.size() < minimumNrMatchers) {
>       return currentDoc = NO_MORE_DOCS;
>     }
>     if (target <= currentDoc) {
>       return currentDoc;
>     }
>    ....
> -------------------
> for most case if (target <= currentDoc) it will return currentDoc;
> but if previous advance will make sub scorers exhausted, then if may
> return NO_MORE_DOCS
> an example is:
>    currentDoc=-1
>    minimumNrMatchers=1
>    subScorers:
>       TermScorer: docIds: [1, 2, 6]
>       TermScorer: docIds: [2, 4]
> after first call advance(5)
>     currentDoc=6
>     only first scorer is now in the heap, scorerDocQueue.size()==1
> then call advance(6)
>     because scorerDocQueue.size() < minimumNrMatchers, it just return
> NO_MORE_DOCS
>
> My question is why the advance(int target) method is defined like this?
> for the reason of efficient or any other reasons?
>
>