You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ravi Solr <ra...@gmail.com> on 2012/11/20 21:27:08 UTC

Weird negative query responses

Can somebody kindly clarify how negative queries work. I having this weird
issue with an analyzed text field. I want to find all docs which don't have
a value in the 'body' field. The field definition and query i am using is
given below. Can somebody tell me what I am doing wrong ??

DEFINITIONS
-------------------
<field name="body" type="text" indexed="true" stored="true"/>

    <fieldType name="text" class="solr.TextField" sortMissingLast="true"
omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
ignoreCase="true" expand="true"/>
        <!-- Case insensitive stop word removal.
enablePositionIncrements=true ensures that a 'gap' is left to allow for
accurate phrase queries. -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
protected="protwords.txt"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
protected="protwords.txt"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

QUERY
------------
http://testserver/solr/mycore/select/?q=*:*&start=0&rows=30&fl=systemid,body&fq=contenttype:"Article"
-body:[* TO *] pubdatetime:[2007-01-01T23:59:59Z TO
2010-12-31T00:00:00Z]&debugQuery=on

I get back the following doc which HAS the body, why does it match even
though I specifically asked solr for docs NOT containing body ???

RESPONSE
---------------------
<doc>
    <str name="body">
M1 M2 M3 M4 V1 V2 V3 V4

    </str>
    <str name="systemid">AR2010111900131</str></doc>
<doc>
    <str name="body">
M1 M2 M3 M4 V1 V2 V3 V4

   </str>
   <str name="systemid">AR2010111200081</str>
</doc>
<doc>
   <str name="body">
M1 M2 M3 M4 V1 V2 V3 V4

   </str>
   <str name="systemid">AR2010110408275</str>
</doc>

<doc>
    <str name="systemid">AR2010110406807</str>
</doc>
<doc>
   <str name="systemid">AR2010110303295</str>
</doc>
<doc>
    <str name="systemid">AR2010110105181</str>
</doc>


Thanks

Ravi Kiran Bhaskar

Re: Weird negative query responses

Posted by Chris Hostetter <ho...@fucit.org>.

: Without knowing anything about how Solr is configured, I would guess that it
: is because of a default operator of "OR" making it so that any of those filter
: clauses will match.  Give the following filter a try:

that shouldn't matter -- regardless of the default operator the "-" in 
front of the body:[* TO *] clause should force it to be a negated clause.

: Alternately, you could use three separate filters:
: 
: fq=contenttype:"Article"
: fq=*:* AND -body:[* TO *]
: fq=pubdatetime:[2007-01-01T23:59:59Z TO 2010-12-31T00:00:00Z]

the more significant question there is wether the OP wants to find docs 
that match *all* of the clauses or *any* of the clauses ... if the answer 
is "any" then using seperate fq's definitely won't work.

As for as the original question...

1) when troubleshooting, it definitely makes sense to simplify the 
problem down to rule things out ... if "AR2010111900131" is the id of a 
doc that has a body but mysteriously matches your query anyway, start with 
a very targeted query to sanity check things...

  q=systemid:AR2010111900131&fq=-body:[* TO *]

2) -body:[* TO *] matches documents that have no *INDEXED* terms in the 
body field -- what you showed in your results is that the documents have a 
*STORED* value in the body field -- are you absolutely certain that your 
analyzer doesn't prune everything down from the original body value so 
that there are no terms left to index?  have you tried pasting "M1 M2 M3 
M4 V1 V2 V3 V4" into the analysis tool for that field type?  

In particular i'm suspicious of what might be in your synonym files, stopword 
files, and protwords.txt ... especially protwords.txt, because that WDF 
config looks really fishy.  if i'm reading those options correctly it's 
not going to index anything for any of those input "words" unless they are 
in protwords.txt


-Hoss

Re: Weird negative query responses

Posted by Shawn Heisey <so...@elyograg.org>.

On 11/20/2012 1:27 PM, Ravi Solr wrote:
> Can somebody kindly clarify how negative queries work. I having this weird
> issue with an analyzed text field. I want to find all docs which don't have
> a value in the 'body' field. The field definition and query i am using is
> given below. Can somebody tell me what I am doing wrong ??

<snip>

> QUERY
> ------------
> http://testserver/solr/mycore/select/?q=*:*&start=0&rows=30&fl=systemid,body&fq=contenttype:"Article"
> -body:[* TO *] pubdatetime:[2007-01-01T23:59:59Z TO
> 2010-12-31T00:00:00Z]&debugQuery=on
>
> I get back the following doc which HAS the body, why does it match even
> though I specifically asked solr for docs NOT containing body ???

Without knowing anything about how Solr is configured, I would guess 
that it is because of a default operator of "OR" making it so that any 
of those filter clauses will match.  Give the following filter a try:

fq=contenttype:"Article" AND -body:[* TO *] AND 
pubdatetime:[2007-01-01T23:59:59Z TO 2010-12-31T00:00:00Z]

Alternately, you could use three separate filters:

fq=contenttype:"Article"
fq=*:* AND -body:[* TO *]
fq=pubdatetime:[2007-01-01T23:59:59Z TO 2010-12-31T00:00:00Z]

Thanks,
Shawn