You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shaun Senecal <ss...@gmail.com> on 2009/10/15 10:14:45 UTC

PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

I know this has been discussed to great length, but I still have not found a
satisfactory solution and I am hoping someone on the list has some ideas...

We have a large index (4M+ Documents) with a handful of Fields.  We need to
perform PrefixQueries on multiple fields.  The problem is that when the
Query gets rewritten, certain fields expand to too many terms and we end up
with TooManyClauses (I know, I know, read the FAQ).  The solution so far has
been to extract the bits of the query which cause TooManyClauses to be
thrown and make them filters:

for every field to be searched {
    try {
        PrefixQuery(term).rewrite();

        if (resulting BooleanQuery contains at least 1 clause) // important,
otherwise 0 results can be returned when >0 should be returned
            add the rewritten query to a BooleanQuery (using SHOULD)
    catch (TMC) {
        PrefixFilter(term)
        add the filter to a BooleanFilter(using SHOULD)
    }
}


Up to Lucene 2.4, this has been working out for us.  However, in Lucene 2.9
this breaks since rewrite() now returns a ConstantScoreQuery.  I changed the
code to automatically make the entire query a filter if TooManyClauses is
ever caught, but this had massive performance implications.  It seems to
have doubled our average query execution time!

Is there a solution to this?  Is there a way I can know that a
ConstantScoreQuery will match at least 1 term (if not, I dont want to add it
to the BooleanQuery)?  Does 2.9 support new features that would aid in this
area?

Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Michael McCandless <lu...@mikemccandless.com>.
Super!

Mike

On Fri, Oct 16, 2009 at 4:06 AM, Shaun Senecal <ss...@gmail.com> wrote:
> Thanks Mike.  The queries are now running faster than they ever were before,
> and are returning the expected results!
>
>
> On Fri, Oct 16, 2009 at 7:39 AM, Shaun Senecal <ss...@gmail.com>wrote:
>
>> Ah!  I thought that the ConstantScoreQuery would also be rewritten into a
>> BooleanQuery, resulting in the same exception.  If that's the case, then
>> this should work.  I'll give that a try when I get into the office this
>> morning.
>>
>>
>>
>> On Fri, Oct 16, 2009 at 6:46 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> Well, you could wrap the C | D filter as a Query (using
>>> ConstantScoreQuery), and then add that as a SHOULD clause on your
>>> toplevel BooleanQuery?
>>>
>>> Mike
>>>
>>> On Thu, Oct 15, 2009 at 5:42 PM, Shaun Senecal <ss...@gmail.com>
>>> wrote:
>>> > At first I thought so, yes, but then I realised that the query I wanted
>>> to
>>> > execute was A | B | C | D and in reality I was executing (A | B) & (C |
>>> D).
>>> > I guess my unit tests were missing some cases and don't currently catch
>>> > this.
>>> >
>>> >
>>> >
>>> > On Thu, Oct 15, 2009 at 11:59 PM, Michael McCandless <
>>> > lucene@mikemccandless.com> wrote:
>>> >
>>> >> You should be able to do exactly what you were doing on 2.4, right?
>>> >> (By setting the rewrite method).
>>> >>
>>> >> Mike
>>> >>
>>> >> On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <
>>> ssenecal.work@gmail.com>
>>> >> wrote:
>>> >> > Thanks for the explanation Mike.  It looks like I have no choice but
>>> to
>>> >> move
>>> >> > any queries which throw TooManyClauses to be Filters. Sadly, this
>>> means a
>>> >> > max query time of 6s under load unless I can find a way to rewrite
>>> the
>>> >> query
>>> >> > to span a Query and a Filter.
>>> >> >
>>> >> >
>>> >> > Thanks again
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
>>> >> > lucene@mikemccandless.com> wrote:
>>> >> >
>>> >> >> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <
>>> ssenecal.work@gmail.com
>>> >> >
>>> >> >> wrote:
>>> >> >>
>>> >> >> > Up to Lucene 2.4, this has been working out for us.  However, in
>>> >> >> > Lucene 2.9 this breaks since rewrite() now returns a
>>> >> >> > ConstantScoreQuery.
>>> >> >>
>>> >> >> You can get back to the 2.4 behavior by calling
>>> >> >>
>>> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
>>> >> >> before calling rewrite().
>>> >> >>
>>> >> >> > Is there a way I can know that a ConstantScoreQuery will match at
>>> >> >> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
>>> >> >>
>>> >> >> There is a new method in 2.9:
>>> MultiTermQuery.getTotalNumberOfTerms(),
>>> >> >> which returns how many terms were visited during rewrite.  Would
>>> that
>>> >> >> work?
>>> >> >>
>>> >> >> > My understanding is that Lucene will apply the Filter (C | D)
>>> first,
>>> >> >> > limiting the result set, then apply the Query (A | B).  Is this
>>> >> >> > correct?
>>> >> >>
>>> >> >> Actually the filter & query clauses are AND'd in a sort of leapfrog
>>> >> >> fashion, taking turns skipping up to the other's doc ID and only
>>> >> >> accepting a doc ID when they both skip to the same point.  But this
>>> >> >> (the mechanics of how Lucene takes a filter into account) is an
>>> >> >> implementation detail and is likely to change.
>>> >> >>
>>> >> >> > If so, the end result is essentially the query: (A | B) & (C | D)
>>> >> >>
>>> >> >> Except that C, D contribute no scoring information, if scoring
>>> >> >> matters.  If scoring doesn't matter, entirely (even for A, B), you
>>> >> >> should use a collector that does not call score() at all to save
>>> CPU.
>>> >> >>
>>> >> >> Mike
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Shaun Senecal <ss...@gmail.com>.
Thanks Mike.  The queries are now running faster than they ever were before,
and are returning the expected results!


On Fri, Oct 16, 2009 at 7:39 AM, Shaun Senecal <ss...@gmail.com>wrote:

> Ah!  I thought that the ConstantScoreQuery would also be rewritten into a
> BooleanQuery, resulting in the same exception.  If that's the case, then
> this should work.  I'll give that a try when I get into the office this
> morning.
>
>
>
> On Fri, Oct 16, 2009 at 6:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Well, you could wrap the C | D filter as a Query (using
>> ConstantScoreQuery), and then add that as a SHOULD clause on your
>> toplevel BooleanQuery?
>>
>> Mike
>>
>> On Thu, Oct 15, 2009 at 5:42 PM, Shaun Senecal <ss...@gmail.com>
>> wrote:
>> > At first I thought so, yes, but then I realised that the query I wanted
>> to
>> > execute was A | B | C | D and in reality I was executing (A | B) & (C |
>> D).
>> > I guess my unit tests were missing some cases and don't currently catch
>> > this.
>> >
>> >
>> >
>> > On Thu, Oct 15, 2009 at 11:59 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> You should be able to do exactly what you were doing on 2.4, right?
>> >> (By setting the rewrite method).
>> >>
>> >> Mike
>> >>
>> >> On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <
>> ssenecal.work@gmail.com>
>> >> wrote:
>> >> > Thanks for the explanation Mike.  It looks like I have no choice but
>> to
>> >> move
>> >> > any queries which throw TooManyClauses to be Filters. Sadly, this
>> means a
>> >> > max query time of 6s under load unless I can find a way to rewrite
>> the
>> >> query
>> >> > to span a Query and a Filter.
>> >> >
>> >> >
>> >> > Thanks again
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
>> >> > lucene@mikemccandless.com> wrote:
>> >> >
>> >> >> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <
>> ssenecal.work@gmail.com
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >> > Up to Lucene 2.4, this has been working out for us.  However, in
>> >> >> > Lucene 2.9 this breaks since rewrite() now returns a
>> >> >> > ConstantScoreQuery.
>> >> >>
>> >> >> You can get back to the 2.4 behavior by calling
>> >> >>
>> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
>> >> >> before calling rewrite().
>> >> >>
>> >> >> > Is there a way I can know that a ConstantScoreQuery will match at
>> >> >> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
>> >> >>
>> >> >> There is a new method in 2.9:
>> MultiTermQuery.getTotalNumberOfTerms(),
>> >> >> which returns how many terms were visited during rewrite.  Would
>> that
>> >> >> work?
>> >> >>
>> >> >> > My understanding is that Lucene will apply the Filter (C | D)
>> first,
>> >> >> > limiting the result set, then apply the Query (A | B).  Is this
>> >> >> > correct?
>> >> >>
>> >> >> Actually the filter & query clauses are AND'd in a sort of leapfrog
>> >> >> fashion, taking turns skipping up to the other's doc ID and only
>> >> >> accepting a doc ID when they both skip to the same point.  But this
>> >> >> (the mechanics of how Lucene takes a filter into account) is an
>> >> >> implementation detail and is likely to change.
>> >> >>
>> >> >> > If so, the end result is essentially the query: (A | B) & (C | D)
>> >> >>
>> >> >> Except that C, D contribute no scoring information, if scoring
>> >> >> matters.  If scoring doesn't matter, entirely (even for A, B), you
>> >> >> should use a collector that does not call score() at all to save
>> CPU.
>> >> >>
>> >> >> Mike
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Shaun Senecal <ss...@gmail.com>.
Ah!  I thought that the ConstantScoreQuery would also be rewritten into a
BooleanQuery, resulting in the same exception.  If that's the case, then
this should work.  I'll give that a try when I get into the office this
morning.


On Fri, Oct 16, 2009 at 6:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Well, you could wrap the C | D filter as a Query (using
> ConstantScoreQuery), and then add that as a SHOULD clause on your
> toplevel BooleanQuery?
>
> Mike
>
> On Thu, Oct 15, 2009 at 5:42 PM, Shaun Senecal <ss...@gmail.com>
> wrote:
> > At first I thought so, yes, but then I realised that the query I wanted
> to
> > execute was A | B | C | D and in reality I was executing (A | B) & (C |
> D).
> > I guess my unit tests were missing some cases and don't currently catch
> > this.
> >
> >
> >
> > On Thu, Oct 15, 2009 at 11:59 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> You should be able to do exactly what you were doing on 2.4, right?
> >> (By setting the rewrite method).
> >>
> >> Mike
> >>
> >> On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <ssenecal.work@gmail.com
> >
> >> wrote:
> >> > Thanks for the explanation Mike.  It looks like I have no choice but
> to
> >> move
> >> > any queries which throw TooManyClauses to be Filters. Sadly, this
> means a
> >> > max query time of 6s under load unless I can find a way to rewrite the
> >> query
> >> > to span a Query and a Filter.
> >> >
> >> >
> >> > Thanks again
> >> >
> >> >
> >> >
> >> > On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
> >> > lucene@mikemccandless.com> wrote:
> >> >
> >> >> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <
> ssenecal.work@gmail.com
> >> >
> >> >> wrote:
> >> >>
> >> >> > Up to Lucene 2.4, this has been working out for us.  However, in
> >> >> > Lucene 2.9 this breaks since rewrite() now returns a
> >> >> > ConstantScoreQuery.
> >> >>
> >> >> You can get back to the 2.4 behavior by calling
> >> >>
> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
> >> >> before calling rewrite().
> >> >>
> >> >> > Is there a way I can know that a ConstantScoreQuery will match at
> >> >> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
> >> >>
> >> >> There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
> >> >> which returns how many terms were visited during rewrite.  Would that
> >> >> work?
> >> >>
> >> >> > My understanding is that Lucene will apply the Filter (C | D)
> first,
> >> >> > limiting the result set, then apply the Query (A | B).  Is this
> >> >> > correct?
> >> >>
> >> >> Actually the filter & query clauses are AND'd in a sort of leapfrog
> >> >> fashion, taking turns skipping up to the other's doc ID and only
> >> >> accepting a doc ID when they both skip to the same point.  But this
> >> >> (the mechanics of how Lucene takes a filter into account) is an
> >> >> implementation detail and is likely to change.
> >> >>
> >> >> > If so, the end result is essentially the query: (A | B) & (C | D)
> >> >>
> >> >> Except that C, D contribute no scoring information, if scoring
> >> >> matters.  If scoring doesn't matter, entirely (even for A, B), you
> >> >> should use a collector that does not call score() at all to save CPU.
> >> >>
> >> >> Mike
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Michael McCandless <lu...@mikemccandless.com>.
Well, you could wrap the C | D filter as a Query (using
ConstantScoreQuery), and then add that as a SHOULD clause on your
toplevel BooleanQuery?

Mike

On Thu, Oct 15, 2009 at 5:42 PM, Shaun Senecal <ss...@gmail.com> wrote:
> At first I thought so, yes, but then I realised that the query I wanted to
> execute was A | B | C | D and in reality I was executing (A | B) & (C | D).
> I guess my unit tests were missing some cases and don't currently catch
> this.
>
>
>
> On Thu, Oct 15, 2009 at 11:59 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> You should be able to do exactly what you were doing on 2.4, right?
>> (By setting the rewrite method).
>>
>> Mike
>>
>> On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <ss...@gmail.com>
>> wrote:
>> > Thanks for the explanation Mike.  It looks like I have no choice but to
>> move
>> > any queries which throw TooManyClauses to be Filters. Sadly, this means a
>> > max query time of 6s under load unless I can find a way to rewrite the
>> query
>> > to span a Query and a Filter.
>> >
>> >
>> > Thanks again
>> >
>> >
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <ssenecal.work@gmail.com
>> >
>> >> wrote:
>> >>
>> >> > Up to Lucene 2.4, this has been working out for us.  However, in
>> >> > Lucene 2.9 this breaks since rewrite() now returns a
>> >> > ConstantScoreQuery.
>> >>
>> >> You can get back to the 2.4 behavior by calling
>> >> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
>> >> before calling rewrite().
>> >>
>> >> > Is there a way I can know that a ConstantScoreQuery will match at
>> >> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
>> >>
>> >> There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
>> >> which returns how many terms were visited during rewrite.  Would that
>> >> work?
>> >>
>> >> > My understanding is that Lucene will apply the Filter (C | D) first,
>> >> > limiting the result set, then apply the Query (A | B).  Is this
>> >> > correct?
>> >>
>> >> Actually the filter & query clauses are AND'd in a sort of leapfrog
>> >> fashion, taking turns skipping up to the other's doc ID and only
>> >> accepting a doc ID when they both skip to the same point.  But this
>> >> (the mechanics of how Lucene takes a filter into account) is an
>> >> implementation detail and is likely to change.
>> >>
>> >> > If so, the end result is essentially the query: (A | B) & (C | D)
>> >>
>> >> Except that C, D contribute no scoring information, if scoring
>> >> matters.  If scoring doesn't matter, entirely (even for A, B), you
>> >> should use a collector that does not call score() at all to save CPU.
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Shaun Senecal <ss...@gmail.com>.
At first I thought so, yes, but then I realised that the query I wanted to
execute was A | B | C | D and in reality I was executing (A | B) & (C | D).
I guess my unit tests were missing some cases and don't currently catch
this.



On Thu, Oct 15, 2009 at 11:59 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You should be able to do exactly what you were doing on 2.4, right?
> (By setting the rewrite method).
>
> Mike
>
> On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <ss...@gmail.com>
> wrote:
> > Thanks for the explanation Mike.  It looks like I have no choice but to
> move
> > any queries which throw TooManyClauses to be Filters. Sadly, this means a
> > max query time of 6s under load unless I can find a way to rewrite the
> query
> > to span a Query and a Filter.
> >
> >
> > Thanks again
> >
> >
> >
> > On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <ssenecal.work@gmail.com
> >
> >> wrote:
> >>
> >> > Up to Lucene 2.4, this has been working out for us.  However, in
> >> > Lucene 2.9 this breaks since rewrite() now returns a
> >> > ConstantScoreQuery.
> >>
> >> You can get back to the 2.4 behavior by calling
> >> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
> >> before calling rewrite().
> >>
> >> > Is there a way I can know that a ConstantScoreQuery will match at
> >> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
> >>
> >> There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
> >> which returns how many terms were visited during rewrite.  Would that
> >> work?
> >>
> >> > My understanding is that Lucene will apply the Filter (C | D) first,
> >> > limiting the result set, then apply the Query (A | B).  Is this
> >> > correct?
> >>
> >> Actually the filter & query clauses are AND'd in a sort of leapfrog
> >> fashion, taking turns skipping up to the other's doc ID and only
> >> accepting a doc ID when they both skip to the same point.  But this
> >> (the mechanics of how Lucene takes a filter into account) is an
> >> implementation detail and is likely to change.
> >>
> >> > If so, the end result is essentially the query: (A | B) & (C | D)
> >>
> >> Except that C, D contribute no scoring information, if scoring
> >> matters.  If scoring doesn't matter, entirely (even for A, B), you
> >> should use a collector that does not call score() at all to save CPU.
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Michael McCandless <lu...@mikemccandless.com>.
You should be able to do exactly what you were doing on 2.4, right?
(By setting the rewrite method).

Mike

On Thu, Oct 15, 2009 at 8:30 AM, Shaun Senecal <ss...@gmail.com> wrote:
> Thanks for the explanation Mike.  It looks like I have no choice but to move
> any queries which throw TooManyClauses to be Filters. Sadly, this means a
> max query time of 6s under load unless I can find a way to rewrite the query
> to span a Query and a Filter.
>
>
> Thanks again
>
>
>
> On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <ss...@gmail.com>
>> wrote:
>>
>> > Up to Lucene 2.4, this has been working out for us.  However, in
>> > Lucene 2.9 this breaks since rewrite() now returns a
>> > ConstantScoreQuery.
>>
>> You can get back to the 2.4 behavior by calling
>> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
>> before calling rewrite().
>>
>> > Is there a way I can know that a ConstantScoreQuery will match at
>> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
>>
>> There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
>> which returns how many terms were visited during rewrite.  Would that
>> work?
>>
>> > My understanding is that Lucene will apply the Filter (C | D) first,
>> > limiting the result set, then apply the Query (A | B).  Is this
>> > correct?
>>
>> Actually the filter & query clauses are AND'd in a sort of leapfrog
>> fashion, taking turns skipping up to the other's doc ID and only
>> accepting a doc ID when they both skip to the same point.  But this
>> (the mechanics of how Lucene takes a filter into account) is an
>> implementation detail and is likely to change.
>>
>> > If so, the end result is essentially the query: (A | B) & (C | D)
>>
>> Except that C, D contribute no scoring information, if scoring
>> matters.  If scoring doesn't matter, entirely (even for A, B), you
>> should use a collector that does not call score() at all to save CPU.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Shaun Senecal <ss...@gmail.com>.
Thanks for the explanation Mike.  It looks like I have no choice but to move
any queries which throw TooManyClauses to be Filters. Sadly, this means a
max query time of 6s under load unless I can find a way to rewrite the query
to span a Query and a Filter.


Thanks again



On Thu, Oct 15, 2009 at 6:52 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <ss...@gmail.com>
> wrote:
>
> > Up to Lucene 2.4, this has been working out for us.  However, in
> > Lucene 2.9 this breaks since rewrite() now returns a
> > ConstantScoreQuery.
>
> You can get back to the 2.4 behavior by calling
> prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
> before calling rewrite().
>
> > Is there a way I can know that a ConstantScoreQuery will match at
> > least 1 term (if not, I dont want to add it to the BooleanQuery)?
>
> There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
> which returns how many terms were visited during rewrite.  Would that
> work?
>
> > My understanding is that Lucene will apply the Filter (C | D) first,
> > limiting the result set, then apply the Query (A | B).  Is this
> > correct?
>
> Actually the filter & query clauses are AND'd in a sort of leapfrog
> fashion, taking turns skipping up to the other's doc ID and only
> accepting a doc ID when they both skip to the same point.  But this
> (the mechanics of how Lucene takes a filter into account) is an
> implementation detail and is likely to change.
>
> > If so, the end result is essentially the query: (A | B) & (C | D)
>
> Except that C, D contribute no scoring information, if scoring
> matters.  If scoring doesn't matter, entirely (even for A, B), you
> should use a collector that does not call score() at all to save CPU.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Oct 15, 2009 at 4:57 AM, Shaun Senecal <ss...@gmail.com> wrote:

> Up to Lucene 2.4, this has been working out for us.  However, in
> Lucene 2.9 this breaks since rewrite() now returns a
> ConstantScoreQuery.

You can get back to the 2.4 behavior by calling
prefixQuery.setRewriteMethod(prefixQuery.SCORING_BOOLEAN_QUERY_REWRITE)
before calling rewrite().

> Is there a way I can know that a ConstantScoreQuery will match at
> least 1 term (if not, I dont want to add it to the BooleanQuery)?

There is a new method in 2.9: MultiTermQuery.getTotalNumberOfTerms(),
which returns how many terms were visited during rewrite.  Would that
work?

> My understanding is that Lucene will apply the Filter (C | D) first,
> limiting the result set, then apply the Query (A | B).  Is this
> correct?

Actually the filter & query clauses are AND'd in a sort of leapfrog
fashion, taking turns skipping up to the other's doc ID and only
accepting a doc ID when they both skip to the same point.  But this
(the mechanics of how Lucene takes a filter into account) is an
implementation detail and is likely to change.

> If so, the end result is essentially the query: (A | B) & (C | D)

Except that C, D contribute no scoring information, if scoring
matters.  If scoring doesn't matter, entirely (even for A, B), you
should use a collector that does not call score() at all to save CPU.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PrefixQueries on large indexes (4M+ Documents) using a partial Query partial Filter solution

Posted by Shaun Senecal <ss...@gmail.com>.
Sorry for the double post, but I think I can clarify the problem a little
more.

We want to execute:
    query: A | B | C | D
    filter: null

However, C and D cause TooManyClauses, so instead we execute:
    query: A | B
    filter: C | D

My understanding is that Lucene will apply the Filter (C | D) first,
limiting the result set, then apply the Query (A | B).  Is this correct?

If so, the end result is essentially the query: (A | B) & (C | D)

Is there any way I can achieve (A | B | C | D) without putting the entire
query into a filter (which is too slow)?



Shaun


On Thu, Oct 15, 2009 at 5:14 PM, Shaun Senecal <ss...@gmail.com>wrote:

> I know this has been discussed to great length, but I still have not found
> a satisfactory solution and I am hoping someone on the list has some
> ideas...
>
> We have a large index (4M+ Documents) with a handful of Fields.  We need to
> perform PrefixQueries on multiple fields.  The problem is that when the
> Query gets rewritten, certain fields expand to too many terms and we end up
> with TooManyClauses (I know, I know, read the FAQ).  The solution so far has
> been to extract the bits of the query which cause TooManyClauses to be
> thrown and make them filters:
>
> for every field to be searched {
>     try {
>         PrefixQuery(term).rewrite();
>
>         if (resulting BooleanQuery contains at least 1 clause) //
> important, otherwise 0 results can be returned when >0 should be returned
>             add the rewritten query to a BooleanQuery (using SHOULD)
>     catch (TMC) {
>         PrefixFilter(term)
>         add the filter to a BooleanFilter(using SHOULD)
>     }
> }
>
>
> Up to Lucene 2.4, this has been working out for us.  However, in Lucene 2.9
> this breaks since rewrite() now returns a ConstantScoreQuery.  I changed the
> code to automatically make the entire query a filter if TooManyClauses is
> ever caught, but this had massive performance implications.  It seems to
> have doubled our average query execution time!
>
> Is there a solution to this?  Is there a way I can know that a
> ConstantScoreQuery will match at least 1 term (if not, I dont want to add it
> to the BooleanQuery)?  Does 2.9 support new features that would aid in this
> area?
>