You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Matt Ronge <mr...@theronge.com> on 2008/08/30 03:34:01 UTC

Pre-filtering for expensive query

Hi all,

I am working on implementing a new Query, Weight and Scorer that is  
expensive to run. I'd like to limit the number of documents I run this  
query on by first building a candidate set of documents with a boolean  
query. Once I have that candidate set, I was hoping I could build a  
filter off of it, and issue that along with my expensive query.  
However, after reading the code I see that filtering is done during  
the search, and not before hand. So my initial boolean query won't  
help in limiting the number of documents scored by my expensive query.

  Has anyone done any work into restricting the set of docs that a  
query operates on?
Or should I just implement something myself in a custom scorer?

Thanks in advance,
--
Matt Ronge

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Wednesday 03 September 2008 18:06:57 schreef Matt Ronge:
> On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote:
> > Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
> >> On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
> >>> Can you tell us a bit more about what you custom query does?
> >>> Perhaps you can build the "candidate filter" and reuse it over
> >>> and over again?
> >>
> >> I cannot reuse it. The candidate filter would be constructed by
> >> first running a boolean query with a number of SHOULD clauses. So
> >> then I know what docs atleast contain the terms I'm looking for.
> >> Once I have this set, I will look at the ordering of the matches
> >> (it's a bit more sophisticated than just a phrase query) and find
> >> the final matches.
> >
> > Sounds like you may want to take a look at SpanNearQuery.
>
> I'm going to take a second look at SpanNearQuery. I need it to
> support optional tokens, so I'm guessing I'll need to create a
> subclass to do that.

SpanNearQuery was not designed for optional tokens.
This can be tricky so make sure your specs are good. I know
only of this article for optional tokens and proximity:
Kunihiko Sadakane and Hiroshi Imai.  Fast algorithms for k -word 
proximity search. IEICE Trans. Fundamentals, E84-A(9), September 2001.

>
> >> Since my boolean clauses are different for each query I can't
> >> reuse the filter.
> >
> > With (a variation of) SpanNearQuery you may end up not needing
> > any filtering at all, because it already uses skipTo() where
> > possible.
> >
> > In case you are looking for documents that contain partial phrases
> > from an input query that has more than 2 words, have a look at
> > Nutch.
>
> I poked around in the Nutch docs and Javadocs, what should I look at
> in Nutch? What does it do exactly, is it the trick that Doug Cutting 
> mentioned where you concat neighboring terms together like "Hello
> world" becomes the token hello.world?

That is an optimization for combinations of high frequency terms,
which is built into nutch iirc. But I don't know the details, so please
ask on a nutch list.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Matt Ronge <mr...@theronge.com>.

On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote:

> Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
>> On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
>>> Can you tell us a bit more about what you custom query does?
>>> Perhaps you can build the "candidate filter" and reuse it over and
>>> over again?
>>
>> I cannot reuse it. The candidate filter would be constructed by first
>> running a boolean query with a number of SHOULD clauses. So then I
>> know what docs atleast contain the terms I'm looking for. Once I have
>> this set, I will look at the ordering of the matches (it's a bit more
>> sophisticated than just a phrase query) and find the final matches.
>
> Sounds like you may want to take a look at SpanNearQuery.

I'm going to take a second look at SpanNearQuery. I need it to support  
optional tokens, so I'm guessing I'll need to create a subclass to do  
that.

>
>> Since my boolean clauses are different for each query I can't reuse
>> the filter.
>
> With (a variation of) SpanNearQuery you may end up not needing
> any filtering at all, because it already uses skipTo() where possible.
>
> In case you are looking for documents that contain partial phrases
> from an input query that has more than 2 words, have a look at Nutch.

I poked around in the Nutch docs and Javadocs, what should I look at  
in Nutch? What does it do exactly, is it the trick that Doug Cutting  
mentioned where you concat neighboring terms together like "Hello  
world" becomes the token hello.world?

Thanks for everyones help so far, I really appreciate it,
--
Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
> On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
> > Can you tell us a bit more about what you custom query does?
> > Perhaps you can build the "candidate filter" and reuse it over and
> > over again?
>
> I cannot reuse it. The candidate filter would be constructed by first
> running a boolean query with a number of SHOULD clauses. So then I
> know what docs atleast contain the terms I'm looking for. Once I have
> this set, I will look at the ordering of the matches (it's a bit more
> sophisticated than just a phrase query) and find the final matches.

Sounds like you may want to take a look at SpanNearQuery.

> Since my boolean clauses are different for each query I can't reuse
> the filter.

With (a variation of) SpanNearQuery you may end up not needing
any filtering at all, because it already uses skipTo() where possible.

In case you are looking for documents that contain partial phrases
from an input query that has more than 2 words, have a look at Nutch.

Regards,
Paul Elschot


>
>
> --
> Matt
>
> >> Hi all,
> >>
> >> I am working on implementing a new Query, Weight and Scorer that
> >> is expensive to run. I'd like to limit the number of documents I
> >> run this query on by first building a candidate set of documents
> >> with a boolean query. Once I have that candidate set, I was hoping
> >> I could build a filter off of it, and issue that along with my
> >> expensive query. However, after reading the code I see that
> >> filtering is done during the search, and not before hand. So my
> >> initial boolean query won't help in limiting the number of
> >> documents scored by my expensive query.
> >>
> >> Has anyone done any work into restricting the set of docs that a
> >> query operates on?
> >> Or should I just implement something myself in a custom scorer?
> >>
> >> Thanks in advance,
> >> --
> >> Matt Ronge
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Matt Ronge <mr...@theronge.com>.

On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:

> Can you tell us a bit more about what you custom query does? Perhaps  
> you can build the "candidate filter" and reuse it over and over again?

I cannot reuse it. The candidate filter would be constructed by first  
running a boolean query with a number of SHOULD clauses. So then I  
know what docs atleast contain the terms I'm looking for. Once I have  
this set, I will look at the ordering of the matches (it's a bit more  
sophisticated than just a phrase query) and find the final matches.  
Since my boolean clauses are different for each query I can't reuse  
the filter.


--
Matt

>
>> Hi all,
>>
>> I am working on implementing a new Query, Weight and Scorer that is  
>> expensive to run. I'd like to limit the number of documents I run  
>> this query on by first building a candidate set of documents with a  
>> boolean query. Once I have that candidate set, I was hoping I could  
>> build a filter off of it, and issue that along with my expensive  
>> query. However, after reading the code I see that filtering is done  
>> during the search, and not before hand. So my initial boolean query  
>> won't help in limiting the number of documents scored by my  
>> expensive query.
>>
>> Has anyone done any work into restricting the set of docs that a  
>> query operates on?
>> Or should I just implement something myself in a custom scorer?
>>
>> Thanks in advance,
>> --
>> Matt Ronge

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Karl Wettin <ka...@gmail.com>.

Can you tell us a bit more about what you custom query does? Perhaps  
you can build the "candidate filter" and reuse it over and over again?


        karl

30 aug 2008 kl. 03.34 skrev Matt Ronge:

> Hi all,
>
> I am working on implementing a new Query, Weight and Scorer that is  
> expensive to run. I'd like to limit the number of documents I run  
> this query on by first building a candidate set of documents with a  
> boolean query. Once I have that candidate set, I was hoping I could  
> build a filter off of it, and issue that along with my expensive  
> query. However, after reading the code I see that filtering is done  
> during the search, and not before hand. So my initial boolean query  
> won't help in limiting the number of documents scored by my  
> expensive query.
>
> Has anyone done any work into restricting the set of docs that a  
> query operates on?
> Or should I just implement something myself in a custom scorer?
>
> Thanks in advance,
> --
> Matt Ronge
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Andrzej Bialecki <ab...@getopt.org>.

Grant Ingersoll wrote:
> 
> On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote:

>>
>> I think you can use a FilteredQuery in a BooleanClause. This may be 
>> faster than the filtering code in the Searcher, because the evaluation 
>> is done during scoring and not afterwards. FilteredQuery internally makes
> 
> 
> FYI, not sure if this is exactly what you are talking about Andrzej, but 
> IndexSearcher no longer filters after scoring.  This was changed in 
> https://issues.apache.org/jira/browse/LUCENE-584

Ah, indeed - I was working with 2.3.0 release ... then there should be 
no visible performance difference if using the trunk version of 
IndexSearcher.

The only difference now between the IndexSearcher method and 
ConjunctionScorer would be when the supplied filter would match many 
documents. IndexSearcher always runs skipTo on the filter first, so 
potentially it would stop at many docIds that aren't matching in the 
scorer - whereas the ConjunctionScorer tries to order sub-scorers so 
that "sparse" scorers are checked first.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 30, 2008, at 3:14 PM, Andrzej Bialecki wrote:

> Matt Ronge wrote:
>> Hi all,
>> I am working on implementing a new Query, Weight and Scorer that is  
>> expensive to run. I'd like to limit the number of documents I run  
>> this query on by first building a candidate set of documents with a  
>> boolean query. Once I have that candidate set, I was hoping I could  
>> build a filter off of it, and issue that along with my expensive  
>> query. However, after reading the code I see that filtering is done  
>> during the search, and not before hand. So my initial boolean query  
>> won't help in limiting the number of documents scored by my  
>> expensive query.
>> Has anyone done any work into restricting the set of docs that a  
>> query operates on?
>> Or should I just implement something myself in a custom scorer?
>
> I think you can use a FilteredQuery in a BooleanClause. This may be  
> faster than the filtering code in the Searcher, because the  
> evaluation is done during scoring and not afterwards. FilteredQuery  
> internally makes


FYI, not sure if this is exactly what you are talking about Andrzej,  
but IndexSearcher no longer filters after scoring.  This was changed  
in https://issues.apache.org/jira/browse/LUCENE-584

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Andrzej Bialecki <ab...@getopt.org>.

Matt Ronge wrote:
> Hi all,
> 
> I am working on implementing a new Query, Weight and Scorer that is 
> expensive to run. I'd like to limit the number of documents I run this 
> query on by first building a candidate set of documents with a boolean 
> query. Once I have that candidate set, I was hoping I could build a 
> filter off of it, and issue that along with my expensive query. However, 
> after reading the code I see that filtering is done during the search, 
> and not before hand. So my initial boolean query won't help in limiting 
> the number of documents scored by my expensive query.
> 
>  Has anyone done any work into restricting the set of docs that a query 
> operates on?
> Or should I just implement something myself in a custom scorer?

I think you can use a FilteredQuery in a BooleanClause. This may be 
faster than the filtering code in the Searcher, because the evaluation 
is done during scoring and not afterwards. FilteredQuery internally 
makes use of skipTo(), which should help to limit the number of 
evaluated docs.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge:
> On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:
> > Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
> >> Hi all,
> >>
> >> I am working on implementing a new Query, Weight and Scorer that
> >> is expensive to run. I'd like to limit the number of documents I
> >> run this query on by first building a candidate set of documents
> >> with a boolean query. Once I have that candidate set, I was hoping
> >> I could build a filter off of it, and issue that along with my
> >> expensive query. However, after reading the code I see that
> >> filtering is done during the search, and not before hand.
> >
> > Correct. I suppose you mean the filtering code in IndexSearcher?
>
> Yes, that's exactly what I mean.
>
> >> So my initial boolean query
> >> won't help in limiting the number of documents scored by my
> >> expensive query.
> >
> > The trick of filtering is the use of skipTo() on both the filter
> > and the scorer to skip superfluous work as much as possible.
> > So when you make your scorer implement skipTo() efficiently,
> > filtering it should reduce the amount of scoring done.
> >
> > Implementing skipTo() efficiently is normally done by using
> > TermScorer.skipTo() on the leafs of a scorer structure. So,
> > in case you implement your own TermScorer, take a serious
> > look at TermScorer.skipTo().
> >
> > Normally, score value computations are not the bottleneck,
> > but accessing the index is, and this is where skipTo() does
> > the real work. At the moment avoiding score value computations
> > is a nice extra.
>
> I was not aware of this. Where can I find the code that uses the
> filter to determine what values to feed to skipTo (I'm trying to get
> a better understand of the Lucene source)?

It's the same code in IndexSearcher.
ConjunctionScorer.skipTo() does the much the same thing for
any number of scorers.

>
> >> Or should I just implement something myself in a custom scorer?
> >
> > In case you have a better way than skipTo(), or something
> > to improve on this issue to allow a Filter as clause to
> > BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345
> > let us know.
>
> Thanks, if the skipTo approach doesn't work, I'll take a look at
> this.

For the moment, Andrzej's suggestion to use FilteredQuery as a clause 
could well be good enough.
Btw. FilteredQuery also contains a filtering scorer under the hood,
you could take a look there, too.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Matt Ronge <mr...@theronge.com>.

On Sep 3, 2008, at 4:09 PM, Paul Elschot wrote:

> Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge:
>> On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:
>>> Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
>>>> Hi all,
>>>>
>>>> I am working on implementing a new Query, Weight and Scorer that
>>>> is expensive to run. I'd like to limit the number of documents I
>>>> run this query on by first building a candidate set of documents
>>>> with a boolean query. Once I have that candidate set, I was hoping
>>>> I could build a filter off of it, and issue that along with my
>>>> expensive query. However, after reading the code I see that
>>>> filtering is done during the search, and not before hand.
>>>
>>> Correct. I suppose you mean the filtering code in IndexSearcher?
>>
>> Yes, that's exactly what I mean.
>
> As Grant pointed out, this code was recently changed
> by LUCENE-584.
> I was referring to the (current trunk) code including this
> change that uses skipTo() on a DocIdSetIterator obtained
> from the Filter. Sorry for any confusion on this.

That clears things up alot :)
I kept looking at the code trying to figure out how it didn't filter  
after scoring. I'll download the latest code and look.

Thanks,
--
Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge:
> On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:
> > Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
> >> Hi all,
> >>
> >> I am working on implementing a new Query, Weight and Scorer that
> >> is expensive to run. I'd like to limit the number of documents I
> >> run this query on by first building a candidate set of documents
> >> with a boolean query. Once I have that candidate set, I was hoping
> >> I could build a filter off of it, and issue that along with my
> >> expensive query. However, after reading the code I see that
> >> filtering is done during the search, and not before hand.
> >
> > Correct. I suppose you mean the filtering code in IndexSearcher?
>
> Yes, that's exactly what I mean.

As Grant pointed out, this code was recently changed
by LUCENE-584.
I was referring to the (current trunk) code including this
change that uses skipTo() on a DocIdSetIterator obtained
from the Filter. Sorry for any confusion on this.

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Matt Ronge <mr...@theronge.com>.

On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:

> Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
>> Hi all,
>>
>> I am working on implementing a new Query, Weight and Scorer that is
>> expensive to run. I'd like to limit the number of documents I run
>> this query on by first building a candidate set of documents with a
>> boolean query. Once I have that candidate set, I was hoping I could
>> build a filter off of it, and issue that along with my expensive
>> query. However, after reading the code I see that filtering is done
>> during the search, and not before hand.
>
> Correct. I suppose you mean the filtering code in IndexSearcher?

Yes, that's exactly what I mean.

>
>> So my initial boolean query
>> won't help in limiting the number of documents scored by my expensive
>> query.
>
> The trick of filtering is the use of skipTo() on both the filter and
> the scorer to skip superfluous work as much as possible.
> So when you make your scorer implement skipTo() efficiently,
> filtering it should reduce the amount of scoring done.
>
> Implementing skipTo() efficiently is normally done by using
> TermScorer.skipTo() on the leafs of a scorer structure. So,
> in case you implement your own TermScorer, take a serious
> look at TermScorer.skipTo().
>
> Normally, score value computations are not the bottleneck,
> but accessing the index is, and this is where skipTo() does
> the real work. At the moment avoiding score value computations
> is a nice extra.

I was not aware of this. Where can I find the code that uses the  
filter to determine what values to feed to skipTo (I'm trying to get a  
better understand of the Lucene source)?

>
>
>> Or should I just implement something myself in a custom scorer?
>
> In case you have a better way than skipTo(), or something
> to improve on this issue to allow a Filter as clause to BooleanQuery:
> https://issues.apache.org/jira/browse/LUCENE-1345
> let us know.


Thanks, if the skipTo approach doesn't work, I'll take a look at this.

--
Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Pre-filtering for expensive query

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
> Hi all,
>
> I am working on implementing a new Query, Weight and Scorer that is
> expensive to run. I'd like to limit the number of documents I run
> this query on by first building a candidate set of documents with a
> boolean query. Once I have that candidate set, I was hoping I could
> build a filter off of it, and issue that along with my expensive
> query. However, after reading the code I see that filtering is done
> during the search, and not before hand.

Correct. I suppose you mean the filtering code in IndexSearcher?

> So my initial boolean query 
> won't help in limiting the number of documents scored by my expensive
> query.

The trick of filtering is the use of skipTo() on both the filter and
the scorer to skip superfluous work as much as possible.
So when you make your scorer implement skipTo() efficiently,
filtering it should reduce the amount of scoring done.

Implementing skipTo() efficiently is normally done by using
TermScorer.skipTo() on the leafs of a scorer structure. So,
in case you implement your own TermScorer, take a serious
look at TermScorer.skipTo().

Normally, score value computations are not the bottleneck,
but accessing the index is, and this is where skipTo() does
the real work. At the moment avoiding score value computations
is a nice extra.

>
>   Has anyone done any work into restricting the set of docs that a
> query operates on?

Yes, Filters.

> Or should I just implement something myself in a custom scorer?

In case you have a better way than skipTo(), or something
to improve on this issue to allow a Filter as clause to BooleanQuery:
https://issues.apache.org/jira/browse/LUCENE-1345
let us know.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org