You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2009/01/29 03:56:22 UTC

Matcher

Greets,

Many ideas have come together in a new KS class, Matcher, which I think ought
to replace Scorer in both KS and Lucy.

Conceptually, Matcher began as a simple iterator over a set of doc nums.  It
was born because KS needed an opaque super class to support iterating over
deletions -- which might be backed either by a bit vector or by what we're
calling "tombstones".

In Lucene, that's the role played by "DocIdSetIterator".  The name
"DocIdSetIterator" is precise and descriptive, but since it's a tad verbose I
stumbled around for a bit looking for an alternative... and arrived at
"Matcher".

The idea of a "Matcher" class has been batted around many times before, but
usually in the context of "Scorer minus the scoring".  This "Matcher" would be
a more general class, usable not only for iterating over hits, but for
iterating over any set of doc nums in any context.

Around this time, Nate and I had a private email exchange on the subject of
deletions-as-iterators.  Nate:

    That said, deletions as filters seems like a good way to view things, 
    though.  I'd probably go further, and not have any special handling 
    for them in the base, then subclass Scorer_Collect to 
    Scorer_Collect_Filtered, and allow for stacked filters.

Interesting... So, if you had both a QueryFilter and some deletions, you might
merge them together into a single iterator whose output would OR together
their doc num sets.  You'd probably want to do this with a priority queue...
Hmm, that's the same technique that we'd be using for merging tombstone rows
-- thus a generic priority queue class which unions the output of multiple
Matcher iterators would kill two birds with one stone.

But wait -- we already have such a class!  ScorerDocQueue is used by ORScorer
to OR together the doc num sets of multiple Scorers.  And yet, it doesn't
handle any scoring duties itself -- the only methods it calls are Next(),
Advance() and Get_Doc_Num(), all of which are defined by Scorer's parent class
Matcher.  All we have to do to adapt it is change the queue elements from
Scorers to Matchers.  *Three* birds with one stone!

Can we pursue this still further?

The original concept for "Matcher" was a class that matches but does not
score.  Instead of a HitCollector, one thought went, we might use a
"MatchCollector" which collects only document numbers.  

The KS HitCollector's Collect() method has recently changed to address the
"match-only" case.  MatchCollector's Collect() method would theoretically have
taken only a document number, saving cycles by not requiring the Scorer to
calculate a score:

   HC_Collect(collector, doc_num, Scorer_Tally(scorer));
   /* vs. */
   MatchColl_Collect(match_collector, doc_num);

However, passing arguments is cheap.  How about, if instead of passing scoring
information we pass the Scorer itself, and leave it up to individual
HitCollectors whether to request scoring info?

   HC_Collect(collector, doc_num, scorer);

That way, classes like BitCollector can just accept the doc_num and ignore the
scorer.  Turns out tere's no need for a MatchCollector class, after all.

But then, what if we want HC_Collect() to accept a Matcher rather than require
a Scorer?  Logically, we ought to be able to, but Matcher doesn't currently
know how to supply the scoring info that some HitCollectors would need.

Well, we can fix that by moving the Scorer_Tally() method up into Matcher, and
having it supply a default score of 0.0.  Then, *any* Matcher iterating over a
set of doc ids can be used as a source of hit data.

This is another win for code reuse.  Classes which perform sophisticated
scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
PhraseScorer, or compound scorers like
ANDScorer/ORScorer/RequiredOptionalScorer which call Similarity.coord(), can
subclass constant-scoring Matcher classes, e.g. ORMatcher.  However, classes
like ORMatcher can also be be used in non-scoring contexts.
    
In addition, we make certain kinds of subclasses simpler to write.  Nate
expressed a desire a while back to subclass the KS boolean query/scorer
hierarchy so that he could take advantage of its logical matching capabilities
yet implement custom scoring algos -- but having to cancel out all the TF/IDF
stuff imposed an annoying complexity cost.  Subclassing boolean Matcher
classes would have made that project simpler.

Now, let's assess where we've ended up.  Matcher was originally supposed to be
a simple iterator class.  It's still quite simple, but with the added methods
it looks almost identical to the old Scorer -- the only difference is the
missing Similarity member var.  That begs the question, why not just use
Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
filtering, etc.

It's possible.  But I think it makes more sense to give our search engine
library a broader mandate and re-conceptualize its search apparatus as a
matching tool which can be extended with scoring.  We can achieve that by
replacing Scorer with Matcher and by using Matcher in contexts where a
"Scorer" wouldn't have seemed appropriate.

Marvin Humphrey

Re: Matcher

Posted by Michael McCandless <lu...@mikemccandless.com>.

Marvin Humphrey wrote:

>> I'm seeing sizable performance gains by "taking advantage" of  
>> Matchers that
>> could expose a random access API.  It would be good if we could  
>> query a
>> Matcher to see if it supports random access...
>
> I hadn't replied to this before now for the same reason I hadn't  
> posted a
> response to the JIRA issue you opened: I'm not fond of the proposed  
> solution,
> but I don't have any better ideas. :(

Yeah I'm not sure how we can cleanly solve it in Lucene, either.
There's this one too, which we should factor in to Matcher:

     https://issues.apache.org/jira/browse/LUCENE-1252

ie, for "interesting" queries, whose contraints can be broken into
AND'd sub-constraints, you may want to reach in and test for the first
sub-constraint, against other clausess in the main query, and only go
back to the more expensive second sub-constraint if all main
constraints pass.  Perhaps, during rewrite, such queries could return
2 AND'd constraints somehow.

> Reaching into a Matcher and choosing which path to go down depending  
> on
> whether it supports random access or not is going to result in a lot  
> of
> duplicated code.  Everywhere we want to optimize, we need one track  
> for the
> random access method, and a fallback track for the iterator.   
> Furthermore, the
> tracks aren't very similar, so the complexity cost to realize the  
> optimization
> is high.

It's worse: you need the ability to distribute a random-access Matcher
"down" to all the sub-Matchers (ie, to match how deleted docs are
applied in Lucene today).

> There's a classic OO code simplification technique you may be  
> familiar with:
> "replace conditionals with polymorphism".  This is the exact  
> opposite of that:
> "replace polymorphism with conditionals".  Ick.

I fear the high performance solution will simply be messy.  Query
execution/optimization is likely a problem whose best-performing
solution is absurdly hairy.  And, worse, the elegant compact solution
is not just a little bit slower, it seems to be a lot slower.

EG I could imagine a solution where we create dynamic Java code on the
fly, specialized to scoring the exact complex query, compile it, and
run it.  This likely would perform well, but sure is a crazy departure
from what we do today.

BTW I think such such query optimization only applies to "hard"
queries.  Simple queries (whose approxCount() seems low) should just
execute the default way.

> At present, I don't have the necessary benchmarking tools at my  
> disposal to
> verify that your benchmarking tests also hold true for C.   
> Unfortunately, I
> suspect they do; results might even be worse if there's some  
> automatic bounds
> checking needed by Java that we can figure out how to forego in C.

Yeah I'm guessing in C you'll see it even worse than we see in Java.

> I'm waiting for some intuitive leap to come along and rescue me.

OK :)

>> Another thing you might want to add to Matcher is "approxCount()",
>> which would make a rough guess at the possible number of matches, to
>> help a future "matching optimizer" pick a strategy.
>
> I can see us doing this within the core.  It could be handy when  
> say, deciding
> whether to invert a sparse filter and apply it using "next match"  
> instead of
> "next miss".
>
> However, Matcher is supposed to be a public class and I would be  
> reluctant to
> add approxCount() as a public method.  Since Lucy cannot assume that  
> its
> target platforms support sane deprecation, once we add a public  
> method it's
> there forever.  Therefore, we need to be conservative about exposing  
> public
> methods in general, and extremely conservative about exposing  
> methods whose
> sole purpose is optimization.

Hmm good point.

Mike

Re: Matcher

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Feb 06, 2009 at 05:55:39AM -0500, Michael McCandless wrote:

> One thing I'm struggling with now (in Lucene), that I'm not sure applies to
> KS/Lucy, is how to best take "random access Matchers" into account.  Not all
> Matchers are created equal.  Some really want to do the next()/skipTo()
> iteration (eg a posting list), but others (deleted docs currently, or cached
> filters, in Lucene) are "best" accessed by random access accept(int docID)
> API, 

---->8 SNIP 8<----

> I'm seeing sizable performance gains by "taking advantage" of Matchers that
> could expose a random access API.  It would be good if we could query a
> Matcher to see if it supports random access...

I hadn't replied to this before now for the same reason I hadn't posted a
response to the JIRA issue you opened: I'm not fond of the proposed solution,
but I don't have any better ideas. :(

Reaching into a Matcher and choosing which path to go down depending on
whether it supports random access or not is going to result in a lot of
duplicated code.  Everywhere we want to optimize, we need one track for the
random access method, and a fallback track for the iterator.  Furthermore, the
tracks aren't very similar, so the complexity cost to realize the optimization
is high.  

There's a classic OO code simplification technique you may be familiar with:
"replace conditionals with polymorphism".  This is the exact opposite of that:
"replace polymorphism with conditionals".  Ick.

At present, I don't have the necessary benchmarking tools at my disposal to
verify that your benchmarking tests also hold true for C.  Unfortunately, I
suspect they do; results might even be worse if there's some automatic bounds
checking needed by Java that we can figure out how to forego in C.

I'm waiting for some intuitive leap to come along and rescue me.

> Another thing you might want to add to Matcher is "approxCount()",
> which would make a rough guess at the possible number of matches, to
> help a future "matching optimizer" pick a strategy.

I can see us doing this within the core.  It could be handy when say, deciding
whether to invert a sparse filter and apply it using "next match" instead of
"next miss".

However, Matcher is supposed to be a public class and I would be reluctant to
add approxCount() as a public method.  Since Lucy cannot assume that its
target platforms support sane deprecation, once we add a public method it's
there forever.  Therefore, we need to be conservative about exposing public
methods in general, and extremely conservative about exposing methods whose
sole purpose is optimization.

Marvin Humphrey

Re: Matcher

Posted by Michael McCandless <lu...@mikemccandless.com>.

I like this approach Marvin, and I appreciate the meandering path you
went through to arrive at it!

I very much like putting the invocation of Matcher.score()
(Scorer_Tally()) into the collector's hands.  In Lucene we always
score prior to collection, but eg if your app will sort by something
other than relevance, and does not intend to present relevance, then
you don't need to compute the score.

One thing I'm struggling with now (in Lucene), that I'm not sure
applies to KS/Lucy, is how to best take "random access Matchers" into
account.  Not all Matchers are created equal.  Some really want to do
the next()/skipTo() iteration (eg a posting list), but others (deleted
docs currently, or cached filters, in Lucene) are "best" accessed by
random access accept(int docID) API, perhaps permuted down to all leaf
TermScorers, though they could also do iteration.  I'm seeing sizable
performance gains by "taking advantage" of Matchers that could expose
a random access API.  It would be good if we could query a Matcher to
see if it supports random access...

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.

Mike

Marvin Humphrey wrote:

> Greets,
>
> Many ideas have come together in a new KS class, Matcher, which I  
> think ought
> to replace Scorer in both KS and Lucy.
>
> Conceptually, Matcher began as a simple iterator over a set of doc  
> nums.  It
> was born because KS needed an opaque super class to support  
> iterating over
> deletions -- which might be backed either by a bit vector or by what  
> we're
> calling "tombstones".
>
> In Lucene, that's the role played by "DocIdSetIterator".  The name
> "DocIdSetIterator" is precise and descriptive, but since it's a tad  
> verbose I
> stumbled around for a bit looking for an alternative... and arrived at
> "Matcher".
>
> The idea of a "Matcher" class has been batted around many times  
> before, but
> usually in the context of "Scorer minus the scoring".  This  
> "Matcher" would be
> a more general class, usable not only for iterating over hits, but for
> iterating over any set of doc nums in any context.
>
> Around this time, Nate and I had a private email exchange on the  
> subject of
> deletions-as-iterators.  Nate:
>
>    That said, deletions as filters seems like a good way to view  
> things,
>    though.  I'd probably go further, and not have any special handling
>    for them in the base, then subclass Scorer_Collect to
>    Scorer_Collect_Filtered, and allow for stacked filters.
>
> Interesting... So, if you had both a QueryFilter and some deletions,  
> you might
> merge them together into a single iterator whose output would OR  
> together
> their doc num sets.  You'd probably want to do this with a priority  
> queue...
> Hmm, that's the same technique that we'd be using for merging  
> tombstone rows
> -- thus a generic priority queue class which unions the output of  
> multiple
> Matcher iterators would kill two birds with one stone.
>
> But wait -- we already have such a class!  ScorerDocQueue is used by  
> ORScorer
> to OR together the doc num sets of multiple Scorers.  And yet, it  
> doesn't
> handle any scoring duties itself -- the only methods it calls are  
> Next(),
> Advance() and Get_Doc_Num(), all of which are defined by Scorer's  
> parent class
> Matcher.  All we have to do to adapt it is change the queue elements  
> from
> Scorers to Matchers.  *Three* birds with one stone!
>
> Can we pursue this still further?
>
> The original concept for "Matcher" was a class that matches but does  
> not
> score.  Instead of a HitCollector, one thought went, we might use a
> "MatchCollector" which collects only document numbers.
>
> The KS HitCollector's Collect() method has recently changed to  
> address the
> "match-only" case.  MatchCollector's Collect() method would  
> theoretically have
> taken only a document number, saving cycles by not requiring the  
> Scorer to
> calculate a score:
>
>   HC_Collect(collector, doc_num, Scorer_Tally(scorer));
>   /* vs. */
>   MatchColl_Collect(match_collector, doc_num);
>
> However, passing arguments is cheap.  How about, if instead of  
> passing scoring
> information we pass the Scorer itself, and leave it up to individual
> HitCollectors whether to request scoring info?
>
>   HC_Collect(collector, doc_num, scorer);
>
> That way, classes like BitCollector can just accept the doc_num and  
> ignore the
> scorer.  Turns out tere's no need for a MatchCollector class, after  
> all.
>
> But then, what if we want HC_Collect() to accept a Matcher rather  
> than require
> a Scorer?  Logically, we ought to be able to, but Matcher doesn't  
> currently
> know how to supply the scoring info that some HitCollectors would  
> need.
>
> Well, we can fix that by moving the Scorer_Tally() method up into  
> Matcher, and
> having it supply a default score of 0.0.  Then, *any* Matcher  
> iterating over a
> set of doc ids can be used as a source of hit data.
>
> This is another win for code reuse.  Classes which perform  
> sophisticated
> scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
> PhraseScorer, or compound scorers like
> ANDScorer/ORScorer/RequiredOptionalScorer which call  
> Similarity.coord(), can
> subclass constant-scoring Matcher classes, e.g. ORMatcher.  However,  
> classes
> like ORMatcher can also be be used in non-scoring contexts.
>
> In addition, we make certain kinds of subclasses simpler to write.   
> Nate
> expressed a desire a while back to subclass the KS boolean query/ 
> scorer
> hierarchy so that he could take advantage of its logical matching  
> capabilities
> yet implement custom scoring algos -- but having to cancel out all  
> the TF/IDF
> stuff imposed an annoying complexity cost.  Subclassing boolean  
> Matcher
> classes would have made that project simpler.
>
> Now, let's assess where we've ended up.  Matcher was originally  
> supposed to be
> a simple iterator class.  It's still quite simple, but with the  
> added methods
> it looks almost identical to the old Scorer -- the only difference  
> is the
> missing Similarity member var.  That begs the question, why not just  
> use
> Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
> filtering, etc.
>
> It's possible.  But I think it makes more sense to give our search  
> engine
> library a broader mandate and re-conceptualize its search apparatus  
> as a
> matching tool which can be extended with scoring.  We can achieve  
> that by
> replacing Scorer with Matcher and by using Matcher in contexts where a
> "Scorer" wouldn't have seemed appropriate.
>
> Marvin Humphrey
>