You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Dempsey <cd...@gmail.com> on 2020/07/14 12:00:35 UTC

Understanding Negative Filter Queries

I'm trying to understand the difference between something like
fq={!cache=false}(tag:* -tag:email) which is very slow compared to
fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.

I believe in the case of `tag:*` Solr spends some effort to gather all of
the documents that have a value for `tag` and then removes those with
`-tag:email` while in the `*:*` Solr simply uses the document set as-is
and  then remove those with `-tag:email` (*and I believe Erick mentioned
there were special optimizations for `*:*`*)?

Re: Understanding Negative Filter Queries

Posted by Erick Erickson <er...@gmail.com>.
There’s another possibility if the person I _should_ shoot who
wrote the query can’t change it; add cost=101 and turn it
into a post-filter. It’s not clear to me how much difference
that’d make, but it might be worth a shot, see: 

https://yonik.com/advanced-filter-caching-in-solr-2/

Best,
Erick

> On Jul 14, 2020, at 8:33 AM, Chris Dempsey <cd...@gmail.com> wrote:
> 
>> 
>> Well, they’ll be exactly the same if (and only if) every document has a
>> tag. Otherwise, the
>> first one will exclude a doc that has no tag and the second one will
>> include it.
> 
> 
> That's a good point/catch.
> 
> How slow is “very slow”?
>> 
> 
> Well, in the case I was looking at it was about 10x slower but with the
> following caveats that there were 15 or so of these negative fq all some
> version of `fq={!cache=false}(tag:* -tag:<something>)` (*don't shoot me I
> didn't write it lol*) over 15 million documents. Which to me means that
> each fq was doing each step that you described below:
> 
> The second form only has to index into the terms dictionary for the tag
>> field
>> value “email”, then zip down the posting list for all the docs that have
>> it. The
>> first form has to first identify all the docs that have a tag, accumulate
>> that list,
>> _then_ find the “email” value and zip down the postings list.
>> 
> 
> Thanks yet again Erick. That solidified in my mind how this works. Much
> appreciated!
> 
> 
> 
> 
> 
> On Tue, Jul 14, 2020 at 7:22 AM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Yeah, there are optimizations there. BTW, these two queries are subtly
>> different.
>> 
>> Well, they’ll be exactly the same if (and only if) every document has a
>> tag. Otherwise, the
>> first one will exclude a doc that has no tag and the second one will
>> include it.
>> 
>> How slow is “very slow”?
>> 
>> The second form only has to index into the terms dictionary for the tag
>> field
>> value “email”, then zip down the posting list for all the docs that have
>> it. The
>> first form has to first identify all the docs that have a tag, accumulate
>> that list,
>> _then_ find the “email” value and zip down the postings list.
>> 
>> You could get around this if you require the first form functionality by,
>> say,
>> including a boolean field “has_tags”, then the first one would be
>> 
>> fq=has_tags:true -tags:email
>> 
>> Best,
>> Erick
>> 
>>> On Jul 14, 2020, at 8:05 AM, Emir Arnautović <
>> emir.arnautovic@sematext.com> wrote:
>>> 
>>> Hi Chris,
>>> tag:* is a wildcard query while *:* is match all query. I believe that
>> adjusting pure negative is turned on by default so you can safely just use
>> -tag:email and it’ll be translated to *:* -tag:email.
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 14 Jul 2020, at 14:00, Chris Dempsey <cd...@gmail.com> wrote:
>>>> 
>>>> I'm trying to understand the difference between something like
>>>> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
>>>> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
>>>> 
>>>> I believe in the case of `tag:*` Solr spends some effort to gather all
>> of
>>>> the documents that have a value for `tag` and then removes those with
>>>> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
>>>> and  then remove those with `-tag:email` (*and I believe Erick mentioned
>>>> there were special optimizations for `*:*`*)?
>>> 
>> 
>> 


Re: Understanding Negative Filter Queries

Posted by Chris Dempsey <cd...@gmail.com>.
>
> Well, they’ll be exactly the same if (and only if) every document has a
> tag. Otherwise, the
> first one will exclude a doc that has no tag and the second one will
> include it.


That's a good point/catch.

How slow is “very slow”?
>

Well, in the case I was looking at it was about 10x slower but with the
following caveats that there were 15 or so of these negative fq all some
version of `fq={!cache=false}(tag:* -tag:<something>)` (*don't shoot me I
didn't write it lol*) over 15 million documents. Which to me means that
each fq was doing each step that you described below:

The second form only has to index into the terms dictionary for the tag
> field
> value “email”, then zip down the posting list for all the docs that have
> it. The
> first form has to first identify all the docs that have a tag, accumulate
> that list,
> _then_ find the “email” value and zip down the postings list.
>

Thanks yet again Erick. That solidified in my mind how this works. Much
appreciated!





On Tue, Jul 14, 2020 at 7:22 AM Erick Erickson <er...@gmail.com>
wrote:

> Yeah, there are optimizations there. BTW, these two queries are subtly
> different.
>
> Well, they’ll be exactly the same if (and only if) every document has a
> tag. Otherwise, the
> first one will exclude a doc that has no tag and the second one will
> include it.
>
> How slow is “very slow”?
>
> The second form only has to index into the terms dictionary for the tag
> field
> value “email”, then zip down the posting list for all the docs that have
> it. The
> first form has to first identify all the docs that have a tag, accumulate
> that list,
> _then_ find the “email” value and zip down the postings list.
>
> You could get around this if you require the first form functionality by,
> say,
> including a boolean field “has_tags”, then the first one would be
>
> fq=has_tags:true -tags:email
>
> Best,
> Erick
>
> > On Jul 14, 2020, at 8:05 AM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> >
> > Hi Chris,
> > tag:* is a wildcard query while *:* is match all query. I believe that
> adjusting pure negative is turned on by default so you can safely just use
> -tag:email and it’ll be translated to *:* -tag:email.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 14 Jul 2020, at 14:00, Chris Dempsey <cd...@gmail.com> wrote:
> >>
> >> I'm trying to understand the difference between something like
> >> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
> >> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
> >>
> >> I believe in the case of `tag:*` Solr spends some effort to gather all
> of
> >> the documents that have a value for `tag` and then removes those with
> >> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
> >> and  then remove those with `-tag:email` (*and I believe Erick mentioned
> >> there were special optimizations for `*:*`*)?
> >
>
>

Re: Understanding Negative Filter Queries

Posted by Erick Erickson <er...@gmail.com>.
Yeah, there are optimizations there. BTW, these two queries are subtly different.

Well, they’ll be exactly the same if (and only if) every document has a tag. Otherwise, the
first one will exclude a doc that has no tag and the second one will include it.

How slow is “very slow”?

The second form only has to index into the terms dictionary for the tag field
value “email”, then zip down the posting list for all the docs that have it. The
first form has to first identify all the docs that have a tag, accumulate that list,
_then_ find the “email” value and zip down the postings list. 

You could get around this if you require the first form functionality by, say, 
including a boolean field “has_tags”, then the first one would be 

fq=has_tags:true -tags:email

Best,
Erick

> On Jul 14, 2020, at 8:05 AM, Emir Arnautović <em...@sematext.com> wrote:
> 
> Hi Chris,
> tag:* is a wildcard query while *:* is match all query. I believe that adjusting pure negative is turned on by default so you can safely just use -tag:email and it’ll be translated to *:* -tag:email.
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 14 Jul 2020, at 14:00, Chris Dempsey <cd...@gmail.com> wrote:
>> 
>> I'm trying to understand the difference between something like
>> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
>> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
>> 
>> I believe in the case of `tag:*` Solr spends some effort to gather all of
>> the documents that have a value for `tag` and then removes those with
>> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
>> and  then remove those with `-tag:email` (*and I believe Erick mentioned
>> there were special optimizations for `*:*`*)?
> 


Re: Understanding Negative Filter Queries

Posted by Emir Arnautović <em...@sematext.com>.
Hi Chris,
tag:* is a wildcard query while *:* is match all query. I believe that adjusting pure negative is turned on by default so you can safely just use -tag:email and it’ll be translated to *:* -tag:email.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Jul 2020, at 14:00, Chris Dempsey <cd...@gmail.com> wrote:
> 
> I'm trying to understand the difference between something like
> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
> 
> I believe in the case of `tag:*` Solr spends some effort to gather all of
> the documents that have a value for `tag` and then removes those with
> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
> and  then remove those with `-tag:email` (*and I believe Erick mentioned
> there were special optimizations for `*:*`*)?