You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Marian Steinbach <ma...@sendung.de> on 2012/02/03 17:11:17 UTC

Zero Matches Weirdness

Hi!

I am having a weird issue with a search string not producing a match
where it should. I can reproduce it with both 3.4 and 3.5.

"Where it should" means that I am getting a hit in the "Analyse" tool
in the admin panel, but not in a query via /select.

Now when I try

   select?q=Am+Heidstamm&...

I get zero results back. But, when I quote the string

  select?q=%22Am+Heidstamm%22&...

I get several hits.

BTW, the token "am" is filtered out in the field text, since it's in a
stopword list.

Any ideas on how this can b explained?

My defaultSearchField ist "text". The field gets its content via
several copyField statements.

The configuration for text is as follows:

   <field name="text" type="text_de" indexed="true" stored="false"
multiValued="true" />

The configuration for type text_de is this:

    <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
		<analyzer>
			<!-- protect slashes from tokenizer by replacing with something unique -->
			<charFilter class="solr.PatternReplaceCharFilterFactory"
				pattern="([A-Z]+)/([0-9]+)/([0-9]+)" replacement="$1ḧ$2ḧ$3" />
			<charFilter class="solr.PatternReplaceCharFilterFactory"
				pattern="([0-9]+)/([0-9]+)" replacement="$1ḧ$2" />
			<!-- protect paragraph symbol from tokenizer -->
			<charFilter class="solr.PatternReplaceCharFilterFactory"
				pattern="§\s*([0-9]+)" replacement="ǚ$1" />
			<tokenizer class="solr.StandardTokenizerFactory"/>
			<filter class="solr.WordDelimiterFilterFactory"
				generateWordParts="1" generateNumberParts="1" catenateWords="1"
				catenateNumbers="1" catenateAll="1" preserveOriginal="1"
splitOnCaseChange="1"/>
			<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_de.txt" enablePositionIncrements="true" />
			<filter class="solr.LowerCaseFilterFactory" />
			<filter class="solr.GermanMinimalStemFilterFactory" />
			<!-- get slashes back in -->
			<filter class="solr.PatternReplaceFilterFactory" pattern="ḧ"
replacement="/" />
			<!-- get paragraph symbols back in -->
			<filter class="solr.PatternReplaceFilterFactory" pattern="ǚ"
replacement="§" />
        </analyzer>
    </fieldType>


Log output for the unquoted phrase:

INFO: [] webapp=/solr path=/select
params={facet=true&sort=score+desc&fl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhang&debugQuery=true&start=0&q=Am+Heidstamm&hl.fl=betreff&wt=json&fq=&hl=true&rows=10}
hits=0 status=0 QTime=29

... and for the quoted one:

INFO: [] webapp=/solr path=/select
params={facet=true&sort=score+desc&fl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhang&start=0&q="Am+Heidstamm"&hl.fl=betreff&wt=standard&fq=&hl=true&rows=10&version=2.2}
hits=14 status=0 QTime=244


Thanks!

Re: Zero Matches Weirdness

Posted by Dmitry Kan <dm...@gmail.com>.

Ok, thanks, Erick, good to know. Sorry for the confusion.

On Fri, Feb 3, 2012 at 9:42 PM, Erik Hatcher <er...@gmail.com> wrote:

> No, don't do that.  That's definitely not good advice.  If the analysis
> chain is the same for both index and query, just use <analyzer>.
>
> As for Marian's issue... was there literally a + in the query or was that
> urlencoded?   Try debugQuery=true for both queries and see what you get for
> the query parsing output.
>
>        Erik
>
> On Feb 3, 2012, at 14:18 , Dmitry Kan wrote:
>
> > Actually, I wouldn't count on it and just specify index and query sides
> > explicitly. Just to play it safe.
> >
> > On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach <ma...@sendung.de>
> wrote:
> >
> >> 2012/2/3 Dmitry Kan <dm...@gmail.com>:
> >>> What about <query> side of the field?
> >>>
> >>
> >> It's identical. At least that's what I think, since I din't specify
> >> the type="query" or type="index" attribute for the analyzer part.
> >>
> >> Marian
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
>
>


-- 
Regards,

Dmitry Kan

Re: Zero Matches Weirdness

Posted by Marian Steinbach <ma...@sendung.de>.

It just got rid of the one field "aktenzeichen" never matching in the
qf string. Now it works fine. Solved for now.

Thanks!

Re: Zero Matches Weirdness

Posted by Marian Steinbach <ma...@sendung.de>.

2012/2/3 Erik Hatcher <er...@gmail.com>:
> As for Marian's issue... was there literally a + in the query or was that urlencoded?   Try debugQuery=true for both queries and see what you get for the query parsing output.
>

I tested both + and %20 with and without quotes, it doesn't make a
difference whether I use + or %20.

Here is the debug output for the unquoted version (zero hits):

debug: {
   rawquerystring: "Am Heidstamm",
   querystring: "Am Heidstamm",
   parsedquery: "+((DisjunctionMaxQuery((aktenzeichen:Am^10.0))
DisjunctionMaxQuery((text:heidstamm^0.1 | betreff:heidstamm^3.0 |
aktenzeichen:Heidstamm^10.0)))~2)",
   parsedquery_toString: "+(((aktenzeichen:Am^10.0)
(text:heidstamm^0.1 | betreff:heidstamm^3.0 |
aktenzeichen:Heidstamm^10.0))~2)",
   QParser: "ExtendedDismaxQParser",
}


And for the quoted version (with hits):

{
   rawquerystring: ""Am Heidstamm"",
   querystring: ""Am Heidstamm"",
   parsedquery: "+DisjunctionMaxQuery((text:heidstamm^0.1 |
betreff:heidstamm^3.0 | aktenzeichen:Am Heidstamm^10.0))",
   parsedquery_toString: "+(text:heidstamm^0.1 | betreff:heidstamm^3.0
| aktenzeichen:Am Heidstamm^10.0)",
   explain: { },
   QParser: "ExtendedDismaxQParser",
}


As it seems to me, the "+(((aktenzeichen:Am^10.0) (text:heidstamm^0.1
| betreff:heidstamm^3.0 | aktenzeichen:Heidstamm^10.0))~2)" condition
cannot be fulfilled. I have "AND" as the detault operator. The term
"(aktenzeichen:Am^10.0)" cannot be satisfied. The thing is: why does
it even appear there?

This is my current qf:

   betreff^5.0 aktenzeichen^10.0 body^0.2 text^0.1

I have just changed this to only

   text^0.1

for the sake of testing, and then it works.

It seems as if I haven't quite understood the impact of qf. I thought
it would allow me to boost the score based on a string appearing in a
field. I didn't expect it to affect what matches and what doesnt.

Marian

Re: Zero Matches Weirdness

Posted by Erik Hatcher <er...@gmail.com>.

No, don't do that.  That's definitely not good advice.  If the analysis chain is the same for both index and query, just use <analyzer>.

As for Marian's issue... was there literally a + in the query or was that urlencoded?   Try debugQuery=true for both queries and see what you get for the query parsing output.

	Erik

On Feb 3, 2012, at 14:18 , Dmitry Kan wrote:

> Actually, I wouldn't count on it and just specify index and query sides
> explicitly. Just to play it safe.
> 
> On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach <ma...@sendung.de> wrote:
> 
>> 2012/2/3 Dmitry Kan <dm...@gmail.com>:
>>> What about <query> side of the field?
>>> 
>> 
>> It's identical. At least that's what I think, since I din't specify
>> the type="query" or type="index" attribute for the analyzer part.
>> 
>> Marian
>> 
> 
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan

Re: Zero Matches Weirdness

Posted by Dmitry Kan <dm...@gmail.com>.

Actually, I wouldn't count on it and just specify index and query sides
explicitly. Just to play it safe.

On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach <ma...@sendung.de> wrote:

> 2012/2/3 Dmitry Kan <dm...@gmail.com>:
> > What about <query> side of the field?
> >
>
> It's identical. At least that's what I think, since I din't specify
> the type="query" or type="index" attribute for the analyzer part.
>
> Marian
>

-- 
Regards,

Dmitry Kan

Re: Zero Matches Weirdness

Posted by Marian Steinbach <ma...@sendung.de>.

2012/2/3 Dmitry Kan <dm...@gmail.com>:
> What about <query> side of the field?
>

It's identical. At least that's what I think, since I din't specify
the type="query" or type="index" attribute for the analyzer part.

Marian

Re: Zero Matches Weirdness

Posted by Dmitry Kan <dm...@gmail.com>.

What about <query> side of the field?

On Fri, Feb 3, 2012 at 6:11 PM, Marian Steinbach <ma...@sendung.de> wrote:

> Hi!
>
> I am having a weird issue with a search string not producing a match
> where it should. I can reproduce it with both 3.4 and 3.5.
>
> "Where it should" means that I am getting a hit in the "Analyse" tool
> in the admin panel, but not in a query via /select.
>
> Now when I try
>
>   select?q=Am+Heidstamm&...
>
> I get zero results back. But, when I quote the string
>
>  select?q=%22Am+Heidstamm%22&...
>
> I get several hits.
>
> BTW, the token "am" is filtered out in the field text, since it's in a
> stopword list.
>
> Any ideas on how this can b explained?
>
> My defaultSearchField ist "text". The field gets its content via
> several copyField statements.
>
> The configuration for text is as follows:
>
>   <field name="text" type="text_de" indexed="true" stored="false"
> multiValued="true" />
>
> The configuration for type text_de is this:
>
>    <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
>                <analyzer>
>                        <!-- protect slashes from tokenizer by replacing
> with something unique -->
>                        <charFilter
> class="solr.PatternReplaceCharFilterFactory"
>                                pattern="([A-Z]+)/([0-9]+)/([0-9]+)"
> replacement="$1ḧ$2ḧ$3" />
>                        <charFilter
> class="solr.PatternReplaceCharFilterFactory"
>                                pattern="([0-9]+)/([0-9]+)"
> replacement="$1ḧ$2" />
>                        <!-- protect paragraph symbol from tokenizer -->
>                        <charFilter
> class="solr.PatternReplaceCharFilterFactory"
>                                pattern="§\s*([0-9]+)" replacement="ǚ$1" />
>                        <tokenizer class="solr.StandardTokenizerFactory"/>
>                        <filter class="solr.WordDelimiterFilterFactory"
>                                generateWordParts="1"
> generateNumberParts="1" catenateWords="1"
>                                catenateNumbers="1" catenateAll="1"
> preserveOriginal="1"
> splitOnCaseChange="1"/>
>                        <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords_de.txt" enablePositionIncrements="true" />
>                        <filter class="solr.LowerCaseFilterFactory" />
>                        <filter class="solr.GermanMinimalStemFilterFactory"
> />
>                        <!-- get slashes back in -->
>                        <filter class="solr.PatternReplaceFilterFactory"
> pattern="ḧ"
> replacement="/" />
>                        <!-- get paragraph symbols back in -->
>                        <filter class="solr.PatternReplaceFilterFactory"
> pattern="ǚ"
> replacement="§" />
>        </analyzer>
>    </fieldType>
>
>
> Log output for the unquoted phrase:
>
> INFO: [] webapp=/solr path=/select
>
> params={facet=true&sort=score+desc&fl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhang&debugQuery=true&start=0&q=Am+Heidstamm&hl.fl=betreff&wt=json&fq=&hl=true&rows=10}
> hits=0 status=0 QTime=29
>
> ... and for the quoted one:
>
> INFO: [] webapp=/solr path=/select
>
> params={facet=true&sort=score+desc&fl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhang&start=0&q="Am+Heidstamm"&hl.fl=betreff&wt=standard&fq=&hl=true&rows=10&version=2.2}
> hits=14 status=0 QTime=244
>
>
> Thanks!
>



-- 
Regards,

Dmitry Kan