You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Karl Wettin <ka...@gmail.com> on 2009/04/07 02:21:32 UTC

HitCollector#collect(int,float,Collection)

How crazy would it be to refactor HitCollector so it also accept the  
matching queries?

Let's ignore my use case (not sure it makes sense yet, it's related to  
finding a threadshold between probably interesting and definitly not  
interesting results of huge OR-statements, but I really have to try it  
out before I can say if it's any good) and just focus on the speed  
impact. If I cleared and reused the Collection passed down to the  
HitCollector then it shouldn't really slow things down, right? And if  
I reused the collections in my TopDocsCollector as low scoring results  
was pushed down then it shouldn't have to be expensive there either. Or?


     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Michael McCandless <lu...@mikemccandless.com>.
My guess is such an approach could be made to work...

But I think I'd rather directly improve *Scorer so that they provide
such details (and you pay no performance cost if you don't ask for
these details).  Likewise for positional details of matching, which
highlighter could use.  And, then, we could absorb Span* back into
their primary counterparts.

Mike

On Tue, Jun 2, 2009 at 8:04 AM, Karl Wettin<ka...@gmail.com> wrote:
> So, I've been sleeping on this for a few weeks. Would it be possible to
> solve this with a decorator? Perhaps a top level decorator that also
> decorates all subqueries at rewrite-time and then keeps the instantiated
> scorers bound to the top level decorator, i.e. makes the decorated query non
> resuable.
>
> Query realQuery = ...
> DecoratedQuery dq = new DecoratedQuery(realQuery);
> searcher.search(dq, ..);
> Map<Query, Float> dq.getScoringQueries();
>
> Not quite sure if this is terrible or elegant.
>
>
>    karl
>
> 7 apr 2009 kl. 12.17 skrev Michael McCandless:
>
>> On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin <ka...@gmail.com> wrote:
>>>
>>> 7 apr 2009 kl. 10.23 skrev Michael McCandless:
>>>
>>>> Do you mean tracking the "atomic queries" that caused a given hit to
>>>> match (where "atomic query" is a query that actually uses
>>>> TermDocs/Positions to check matching, vs other queries like
>>>> BooleanQuery that "glomm together" sub-query matches)?
>>>>
>>>> EG for a boolean query w/ N clauses, which of those N clauses matched?
>>>
>>> This is exactly what I mean. I do however think it makes sense to get
>>> information about non atomic queries as it seems reasonble that the first
>>> clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is more
>>> interesting than only getting to know that one of the clauses of that
>>> boolean query is matching.
>>
>> Ahh OK I agree.  So every query in the full tree should be able to
>> state whether it matched the doc.
>>
>>>> A natural place to do this is Scorer API, ie extend it with a
>>>> "getMatchingAtomicQueries" or some such.  Probably, for efficiency,
>>>> each Query should be pre-assigned an int position, and then the
>>>> matching is represented as a bit array, reused across matches.  Your
>>>> collector could then ask the scorer for these bits if it wanted.
>>>> There should be no performance cost for collectors that don't use this
>>>> functionality.
>>>
>>> I'll look in to it.
>>>
>>> Thanks for the feedback.
>>>
>>>
>>>    karl
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Karl Wettin <ka...@gmail.com>.
So, I've been sleeping on this for a few weeks. Would it be possible  
to solve this with a decorator? Perhaps a top level decorator that  
also decorates all subqueries at rewrite-time and then keeps the  
instantiated scorers bound to the top level decorator, i.e. makes the  
decorated query non resuable.

Query realQuery = ...
DecoratedQuery dq = new DecoratedQuery(realQuery);
searcher.search(dq, ..);
Map<Query, Float> dq.getScoringQueries();

Not quite sure if this is terrible or elegant.


     karl

7 apr 2009 kl. 12.17 skrev Michael McCandless:

> On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin <ka...@gmail.com>  
> wrote:
>>
>> 7 apr 2009 kl. 10.23 skrev Michael McCandless:
>>
>>> Do you mean tracking the "atomic queries" that caused a given hit to
>>> match (where "atomic query" is a query that actually uses
>>> TermDocs/Positions to check matching, vs other queries like
>>> BooleanQuery that "glomm together" sub-query matches)?
>>>
>>> EG for a boolean query w/ N clauses, which of those N clauses  
>>> matched?
>>
>> This is exactly what I mean. I do however think it makes sense to get
>> information about non atomic queries as it seems reasonble that the  
>> first
>> clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is  
>> more
>> interesting than only getting to know that one of the clauses of that
>> boolean query is matching.
>
> Ahh OK I agree.  So every query in the full tree should be able to
> state whether it matched the doc.
>
>>> A natural place to do this is Scorer API, ie extend it with a
>>> "getMatchingAtomicQueries" or some such.  Probably, for efficiency,
>>> each Query should be pre-assigned an int position, and then the
>>> matching is represented as a bit array, reused across matches.  Your
>>> collector could then ask the scorer for these bits if it wanted.
>>> There should be no performance cost for collectors that don't use  
>>> this
>>> functionality.
>>
>> I'll look in to it.
>>
>> Thanks for the feedback.
>>
>>
>>     karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin <ka...@gmail.com> wrote:
>
> 7 apr 2009 kl. 10.23 skrev Michael McCandless:
>
>> Do you mean tracking the "atomic queries" that caused a given hit to
>> match (where "atomic query" is a query that actually uses
>> TermDocs/Positions to check matching, vs other queries like
>> BooleanQuery that "glomm together" sub-query matches)?
>>
>> EG for a boolean query w/ N clauses, which of those N clauses matched?
>
> This is exactly what I mean. I do however think it makes sense to get
> information about non atomic queries as it seems reasonble that the first
> clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is more
> interesting than only getting to know that one of the clauses of that
> boolean query is matching.

Ahh OK I agree.  So every query in the full tree should be able to
state whether it matched the doc.

>> A natural place to do this is Scorer API, ie extend it with a
>> "getMatchingAtomicQueries" or some such.  Probably, for efficiency,
>> each Query should be pre-assigned an int position, and then the
>> matching is represented as a bit array, reused across matches.  Your
>> collector could then ask the scorer for these bits if it wanted.
>> There should be no performance cost for collectors that don't use this
>> functionality.
>
> I'll look in to it.
>
> Thanks for the feedback.
>
>
>     karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Karl Wettin <ka...@gmail.com>.
7 apr 2009 kl. 10.23 skrev Michael McCandless:

> Do you mean tracking the "atomic queries" that caused a given hit to
> match (where "atomic query" is a query that actually uses
> TermDocs/Positions to check matching, vs other queries like
> BooleanQuery that "glomm together" sub-query matches)?
>
> EG for a boolean query w/ N clauses, which of those N clauses matched?

This is exactly what I mean. I do however think it makes sense to get  
information about non atomic queries as it seems reasonble that the  
first clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching  
is more interesting than only getting to know that one of the clauses  
of that boolean query is matching.

> A natural place to do this is Scorer API, ie extend it with a
> "getMatchingAtomicQueries" or some such.  Probably, for efficiency,
> each Query should be pre-assigned an int position, and then the
> matching is represented as a bit array, reused across matches.  Your
> collector could then ask the scorer for these bits if it wanted.
> There should be no performance cost for collectors that don't use this
> functionality.

I'll look in to it.

Thanks for the feedback.


      karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Michael McCandless <lu...@mikemccandless.com>.
Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?

EG for a boolean query w/ N clauses, which of those N clauses matched?

This has been discussed/requested several times on java-user, and I
think it makes alot of sense.

A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use this
functionality.

We've also discussed (under LUCENE-1522) similar extensions to Scorer
API to get exact positions contributing to a match, and possibly using
such an API to merge in Span{Term,And,Or}Query to their "normal"
counterparts.

But we should do this separately from LUCENE-1575 -- the java ghosts
there are already challenging enough!

Mike

On Mon, Apr 6, 2009 at 11:57 PM, Shai Erera <se...@gmail.com> wrote:
> Hi Karl,
>
> LUCENE-1575 refactors HitCollector by seperating the score from document
> collection. So if we were to introduce this type of method (that you
> suggest), it would be through a setQueries(Collection<Query>) method.
>
> Maybe you try to verify if your use case makes sense, is efficient etc.,
> before we do this change. Adding a setQueries to Collector (the new name of
> HC) shouldn't be a problem since we can always add an empty-impl method, not
> affecting back-compat. However I wonder from where will it be called,
> whether it makes sense to create that Collection object, pass it around
> while knowing that most collectors will not use it?
>
> Is it something that you perhaps can implement by extending Collector (and
> some other classes), and in your extending code call to setQueries? Today,
> as far as I remember, only Scorer calls collect() and I'm not sure if Scorer
> has the information of the matching queries. Even if it does, extending it
> and calling setQueries from the extension seems more reasonable, than adding
> such call to every query execution, which also means instantiating a new
> Collection<Query> for every search (unless we provide an API on
> IndexSearcher which allows you to pass such object).
>
> What do you think?
>
> On Tue, Apr 7, 2009 at 3:21 AM, Karl Wettin <ka...@gmail.com> wrote:
>>
>> How crazy would it be to refactor HitCollector so it also accept the
>> matching queries?
>>
>> Let's ignore my use case (not sure it makes sense yet, it's related to
>> finding a threadshold between probably interesting and definitly not
>> interesting results of huge OR-statements, but I really have to try it out
>> before I can say if it's any good) and just focus on the speed impact. If I
>> cleared and reused the Collection passed down to the HitCollector then it
>> shouldn't really slow things down, right? And if I reused the collections in
>> my TopDocsCollector as low scoring results was pushed down then it shouldn't
>> have to be expensive there either. Or?
>>
>>
>>    karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: HitCollector#collect(int,float,Collection)

Posted by Shai Erera <se...@gmail.com>.
Hi Karl,

LUCENE-1575 refactors HitCollector by seperating the score from document
collection. So if we were to introduce this type of method (that you
suggest), it would be through a setQueries(Collection<Query>) method.

Maybe you try to verify if your use case makes sense, is efficient etc.,
before we do this change. Adding a setQueries to Collector (the new name of
HC) shouldn't be a problem since we can always add an empty-impl method, not
affecting back-compat. However I wonder from where will it be called,
whether it makes sense to create that Collection object, pass it around
while knowing that most collectors will not use it?

Is it something that you perhaps can implement by extending Collector (and
some other classes), and in your extending code call to setQueries? Today,
as far as I remember, only Scorer calls collect() and I'm not sure if Scorer
has the information of the matching queries. Even if it does, extending it
and calling setQueries from the extension seems more reasonable, than adding
such call to every query execution, which also means instantiating a new
Collection<Query> for every search (unless we provide an API on
IndexSearcher which allows you to pass such object).

What do you think?

On Tue, Apr 7, 2009 at 3:21 AM, Karl Wettin <ka...@gmail.com> wrote:

> How crazy would it be to refactor HitCollector so it also accept the
> matching queries?
>
> Let's ignore my use case (not sure it makes sense yet, it's related to
> finding a threadshold between probably interesting and definitly not
> interesting results of huge OR-statements, but I really have to try it out
> before I can say if it's any good) and just focus on the speed impact. If I
> cleared and reused the Collection passed down to the HitCollector then it
> shouldn't really slow things down, right? And if I reused the collections in
> my TopDocsCollector as low scoring results was pushed down then it shouldn't
> have to be expensive there either. Or?
>
>
>    karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>