You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by jglazner <jg...@coldcrow.com> on 2010/06/11 10:49:53 UTC

False Positives?

Ok,  So let me preface this by saying I'm a noob to solr/lucene, so if I this
is totally obvious.. please forgive me. I've been searching for a while now
and can't seem to figure out what is going on.

So here is the problem:  I've got a little dev index build with some songs
in it that i'm developing against.  When I search for springsteen through
the solr admin, my highest score is for a completely unrelated song called:
Agnus Dei - by Erna Spoorenberg [Soprano] (Haydn: Harmonienmesse).  When I
look closer at the fields stored in that record the word springsteen is no
where to be found in any of the fields on that record so I'm totally
confused.  When I turn on hit highlighting to find out what it thinks it
matched on it's highlighting the word "Spoorenberg " and "Soprano"?!?!  I
enabled the debug query and I see this toward the bottom... but not sure
what it means:

<str name="rawquerystring">springsteen</str>
<str name="querystring">springsteen</str>
<str name="parsedquery">name_title:SPRN</str>
<str name="parsedquery_toString">name_title:SPRN</str>
<lst name="explain">
<str name="artist.artist.3106">
4.42386 = (MATCH) fieldWeight(name_title:SPRN in 3105), product of:
  1.4142135 = tf(termFreq(name_title:SPRN)=2)
  6.2562833 = idf(docFreq=704, maxDocs=135196)
  0.5 = fieldNorm(field=name_title, doc=3105)
</str>
..........
this goes on for all the results.  So as near as I could tell it took the
term sprintgsteen and truncated it to sprn?  but even so how does sprn match
"Spoorenberg" or "Saprano"?

I'm using solr 1.4

Thanks for any input you can give me.

Jed.




-- 
View this message in context: http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888027.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: False Positives?

Posted by Walter Underwood <wu...@wunderwood.org>.

This filter chain takes a word, stems it, then converts the stem to a phonetic representation. 

1. Only do one transformation for each field, like stemming or phonetic.
2. Stemming isn't useful for names.

You are also removing stopwords, which can be a problem for names.

Here is an example of what that chain is doing. You should be able to see this with the analysis page in the admin UI.

"The Cars"
"cars"  (remove stopwords, lower case)
"car" (stem)
"KR" (phonetic)

There are some other problems here, like using synonyms at query time. That results in unexpected scoring, because the synonyms will have different IDFs. The variant that is most rare in the index will win. If they are applied at index time, all variants will have the same IDF.

wunder

On Jun 11, 2010, at 2:12 AM, jglazner wrote:

> 
> Chantal,
> 
> Thanks for the quick response:
> 
> Here is the field def from the schema for the field and the field type:
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
> inject="false"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
> inject="false"/>
>      </analyzer>
>    </fieldType>
> 
> <field name="name_title" type="text" indexed="true" stored="true"
> multiValued="false" />
> 
> Jed.
> 
> On Fri, Jun 11, 2010 at 3:05 AM, Chantal Ackermann [via Lucene] <
> ml-node+888051-797127135-9881@n3.nabble.com<ml...@n3.nabble.com>
>> wrote:
> 
>> Hi Jed,
>> 
>> please paste the complete field definition of "name_title" from your
>> schema.xml.
>> 
>> You are using an analyzer that reduces your text in an undesired way, on
>> both index and query side. You probably want "String" for names, or
>> similar.
>> 
>> "Spoorenberg" or "Saprano" are analyzed in the same way as
>> "springsteen", obviously. And the result is "SPRN" for all of them.
>> 
>> Chantal
>> 
>> 
>>> <str name="rawquerystring">springsteen</str>
>>> <str name="querystring">springsteen</str>
>>> <str name="parsedquery">name_title:SPRN</str>
>>> <str name="parsedquery_toString">name_title:SPRN</str>
>>> <lst name="explain">
>>> <str name="artist.artist.3106">
>>> 4.42386 = (MATCH) fieldWeight(name_title:SPRN in 3105), product of:
>>>  1.4142135 = tf(termFreq(name_title:SPRN)=2)
>>>  6.2562833 = idf(docFreq=704, maxDocs=135196)
>>>  0.5 = fieldNorm(field=name_title, doc=3105)
>>> </str>
>>> ..........
>>> this goes on for all the results.  So as near as I could tell it took the
>> 
>>> term sprintgsteen and truncated it to sprn?  but even so how does sprn
>> match
>>> "Spoorenberg" or "Saprano"?
>>> 
>>> I'm using solr 1.4
>>> 
>>> Thanks for any input you can give me.
>>> 
>>> Jed.
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> ------------------------------
>> View message @
>> http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888051.html
>> To unsubscribe from False Positives?, click here< (link removed) >.
>> 
>> 
>> 
> 
> -- 
> View this message in context: http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888077.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto

Re: False Positives?

Posted by Ahmet Arslan <io...@yahoo.com>.

solr.PhoneticFilterFactory looks suspicious. Can you verify this on solr admin analysis.jsp page. You can debug your analysis chain in this page.
If you paste "springsteen", it will show you output of each tokenizer/tokenfilter step by step.

Re: False Positives?

Posted by jglazner <jg...@coldcrow.com>.

Chantal,

Thanks for the quick response:

Here is the field def from the schema for the field and the field type:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
inject="false"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
inject="false"/>
      </analyzer>
    </fieldType>

<field name="name_title" type="text" indexed="true" stored="true"
multiValued="false" />

Jed.

On Fri, Jun 11, 2010 at 3:05 AM, Chantal Ackermann [via Lucene] <
ml-node+888051-797127135-9881@n3.nabble.com<ml...@n3.nabble.com>
> wrote:

> Hi Jed,
>
> please paste the complete field definition of "name_title" from your
> schema.xml.
>
> You are using an analyzer that reduces your text in an undesired way, on
> both index and query side. You probably want "String" for names, or
> similar.
>
> "Spoorenberg" or "Saprano" are analyzed in the same way as
> "springsteen", obviously. And the result is "SPRN" for all of them.
>
> Chantal
>
>
> > <str name="rawquerystring">springsteen</str>
> > <str name="querystring">springsteen</str>
> > <str name="parsedquery">name_title:SPRN</str>
> > <str name="parsedquery_toString">name_title:SPRN</str>
> > <lst name="explain">
> > <str name="artist.artist.3106">
> > 4.42386 = (MATCH) fieldWeight(name_title:SPRN in 3105), product of:
> >   1.4142135 = tf(termFreq(name_title:SPRN)=2)
> >   6.2562833 = idf(docFreq=704, maxDocs=135196)
> >   0.5 = fieldNorm(field=name_title, doc=3105)
> > </str>
> > ..........
> > this goes on for all the results.  So as near as I could tell it took the
>
> > term sprintgsteen and truncated it to sprn?  but even so how does sprn
> match
> > "Spoorenberg" or "Saprano"?
> >
> > I'm using solr 1.4
> >
> > Thanks for any input you can give me.
> >
> > Jed.
> >
> >
> >
> >
>
>
>
>
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888051.html
> To unsubscribe from False Positives?, click here< (link removed) >.
>
>
>

-- 
View this message in context: http://lucene.472066.n3.nabble.com/False-Positives-tp888027p888077.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: False Positives?

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Jed,

please paste the complete field definition of "name_title" from your
schema.xml.

You are using an analyzer that reduces your text in an undesired way, on
both index and query side. You probably want "String" for names, or
similar.

"Spoorenberg" or "Saprano" are analyzed in the same way as
"springsteen", obviously. And the result is "SPRN" for all of them.

Chantal


> <str name="rawquerystring">springsteen</str>
> <str name="querystring">springsteen</str>
> <str name="parsedquery">name_title:SPRN</str>
> <str name="parsedquery_toString">name_title:SPRN</str>
> <lst name="explain">
> <str name="artist.artist.3106">
> 4.42386 = (MATCH) fieldWeight(name_title:SPRN in 3105), product of:
>   1.4142135 = tf(termFreq(name_title:SPRN)=2)
>   6.2562833 = idf(docFreq=704, maxDocs=135196)
>   0.5 = fieldNorm(field=name_title, doc=3105)
> </str>
> ..........
> this goes on for all the results.  So as near as I could tell it took the
> term sprintgsteen and truncated it to sprn?  but even so how does sprn match
> "Spoorenberg" or "Saprano"?
> 
> I'm using solr 1.4
> 
> Thanks for any input you can give me.
> 
> Jed.
> 
> 
> 
>