You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by roySolr <ro...@gmail.com> on 2011/04/26 16:27:44 UTC

WhitespaceTokenizer and scoring(field length)

Hello,

I have a problem with the whitespaceTokenizer and scoring. An example:

id                     Titel
1                      Manchester united
2                      Manchester

With the whitespaceTokenizer "Manchester united" will be splitted to
"Manchester" and "united". When
i search for "manchester" i get id 1 and 2 in my results. What i want is
that id 2 scores higher(field length).
How can i fix this?


--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by Erick Erickson <er...@gmail.com>.

First, you can give us some more data to work with <G>...

In particular, attach &debugQuery=on to your http request and post
the results. That will show how the documents got their score.

Also, show us the <fieldType> definition and <field> definition for the field
in question.

Best
Erick

On Tue, Apr 26, 2011 at 10:27 AM, roySolr <ro...@gmail.com> wrote:
> Hello,
>
> I have a problem with the whitespaceTokenizer and scoring. An example:
>
> id                     Titel
> 1                      Manchester united
> 2                      Manchester
>
> With the whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and "united". When
> i search for "manchester" i get id 1 and 2 in my results. What i want is
> that id 2 scores higher(field length).
> How can i fix this?
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: WhitespaceTokenizer and scoring(field length)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

In Solr's schema.xml you can use omitNorms="true" to turn norms off on 
field-by-field basis.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Jonathan Rochkind <ro...@jhu.edu>
> To: solr-user@lucene.apache.org
> Sent: Wed, April 27, 2011 11:29:01 AM
> Subject: Re: WhitespaceTokenizer and scoring(field length)
> 
> You can turn off norms for the field.  It doens't make any sense to talk  about 
>"changing the length norm". The length norm is based on the size of the  field 
>for the particular document, to implement the TF/IDF style relevance  
>algorithm.   But you can turn off norms for the field if you don't want TF  to 
>be taken into account.
> 
> I _think_ if you turn off norms the relevancy  will be based purely on "term 
>count" rather than "term frequency", which is what  you're wanting. But not sure 
>of that, I get confused too thinking about the  implications of all this stuff, 
>but it's something to try/look into. Forget  exactly how you turn off norms, or 
>if there are ways to turn off some kinds of  field norms but not others, but I 
>recall there is definitely a way to do it on a  field-by-field basis (not I 
>think on a query-by-query basis).
> 
> On  4/27/2011 8:25 AM, roySolr wrote:
> > Thanks!! It's clear now, sometimes the  lengthNorm is the same. See the 
table
> > below:
> > 
> > # of  terms    lengthNorm
> >     1           1.0
> >     2          .625
> >     3         .5
> >      4         .5
> >     5          .4375
> >     6          .375
> >     7         .375
> >      8         .3125
> >     9          .3125
> >    10         .3125
> > 
> > Is it possible to change the lengthNorm?
> > 
> >  --
> > View this message in context: 
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
>
> >  Sent from the Solr - User mailing list archive at Nabble.com.
> > 
>

Re: WhitespaceTokenizer and scoring(field length)

Posted by Jonathan Rochkind <ro...@jhu.edu>.

You can turn off norms for the field.  It doens't make any sense to talk 
about "changing the length norm". The length norm is based on the size 
of the field for the particular document, to implement the TF/IDF style 
relevance algorithm.   But you can turn off norms for the field if you 
don't want TF to be taken into account.

I _think_ if you turn off norms the relevancy will be based purely on 
"term count" rather than "term frequency", which is what you're wanting. 
But not sure of that, I get confused too thinking about the implications 
of all this stuff, but it's something to try/look into. Forget exactly 
how you turn off norms, or if there are ways to turn off some kinds of 
field norms but not others, but I recall there is definitely a way to do 
it on a field-by-field basis (not I think on a query-by-query basis).

On 4/27/2011 8:25 AM, roySolr wrote:
> Thanks!! It's clear now, sometimes the lengthNorm is the same. See the table
> below:
>
> # of terms    lengthNorm
>     1          1.0
>     2         .625
>     3         .5
>     4         .5
>     5         .4375
>     6         .375
>     7         .375
>     8         .3125
>     9         .3125
>    10        .3125
>
> Is it possible to change the lengthNorm?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: WhitespaceTokenizer and scoring(field length)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yes, it is possible to implement your own Lucene Similarity in which you can 
override the length norm.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: roySolr <ro...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, April 27, 2011 8:25:31 AM
> Subject: Re: WhitespaceTokenizer and scoring(field length)
> 
> Thanks!! It's clear now, sometimes the lengthNorm is the same. See the  table
> below:
> 
> # of terms    lengthNorm
>    1           1.0
>    2          .625
>    3         .5
>    4          .5
>    5         .4375
>     6         .375
>    7          .375
>    8         .3125
>    9          .3125
>   10        .3125
> 
> Is it  possible to change the lengthNorm? 
> 
> --
> View this message in context: 
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
>

Re: WhitespaceTokenizer and scoring(field length)

Posted by Ahmet Arslan <io...@yahoo.com>.



Is it possible to change the lengthNorm? 

Yes you can customize it and plug it into solr. DefaultSimilarity and SweetSpotSimilarity can be starting point.

http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by roySolr <ro...@gmail.com>.

Thanks!! It's clear now, sometimes the lengthNorm is the same. See the table
below:

# of terms    lengthNorm
   1          1.0
   2         .625
   3         .5
   4         .5
   5         .4375
   6         .375
   7         .375
   8         .3125
   9         .3125
  10        .3125

Is it possible to change the lengthNorm? 

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by Ahmet Arslan <io...@yahoo.com>.

Lucene/solr's length normalization is not discriminative for very short documents.

See Jay's excellent explanation for more details. http://search-lucene.com/m/uGKRu1pvRjw/




----- Original Message -----
From: roySolr <ro...@gmail.com>
To: solr-user@lucene.apache.org
Cc: 
Sent: Wednesday, April 27, 2011 11:28 AM
Subject: Re: WhitespaceTokenizer and scoring(field length)

I thought it was something simple. Here is my configuration:

<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
   </analyzer>
</fieldType>

<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>

<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>

I search for "supermarket":

<doc>
    <str name="companyid">357</str>
    <str name="name">LIDL Headoffice</str>
    <arr name="related_category">
        <str>Supermarkt</str>
    </arr>
    <str name="storechain">LIDL</str>
    <arr name="searchField">
        <str>LIDL</str>
        <str>LIDL Headoffice</str>
        <str>Supermarket</str>
    </arr>
</doc>

<doc>
    <str name="companyid">719</str>
    <str name="name">LIDL</str>
    <arr name="related_category">
        <str>Supermarket</str>
    </arr>
    <str name="storechain">LIDL</str>
    <arr name="searchField">
        <str>LIDL</str>
        <str>LIDL</str>
        <str>Supermarket</str>
    </arr>
</doc>



debugQuery:
Both documents has the same score, but doc 357 has more characters in the
searchField.

<lst name="explain">
    <str name="357">
        1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                          
fieldNorm(field=searchField, doc=325)
    </str>
    
    <str name="719">
        1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                          
fieldNorm(field=searchField, doc=678)
    </str>
</lst>

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869546.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by roySolr <ro...@gmail.com>.

I thought it was something simple. Here is my configuration:

<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
	<charFilter class="solr.HTMLStripCharFilterFactory"/>
      	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
      	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
   </analyzer>
</fieldType>

<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>

<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>

I search for "supermarket":

<doc>
	<str name="companyid">357</str>
	<str name="name">LIDL Headoffice</str>
	<arr name="related_category">
		<str>Supermarkt</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL Headoffice</str>
		<str>Supermarket</str>
	</arr>
</doc>

<doc>
	<str name="companyid">719</str>
	<str name="name">LIDL</str>
	<arr name="related_category">
		<str>Supermarket</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL</str>
		<str>Supermarket</str>
	</arr>
</doc>



debugQuery:
Both documents has the same score, but doc 357 has more characters in the
searchField.

<lst name="explain">
	<str name="357">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                          
fieldNorm(field=searchField, doc=325)
	</str>
	
	<str name="719">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                          
fieldNorm(field=searchField, doc=678)
	</str>
</lst>

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869546.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by roySolr <ro...@gmail.com>.

I thought it was something simple. Here is my configuration:

<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
	<charFilter class="solr.HTMLStripCharFilterFactory"/>
      	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
      	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
   </analyzer>
</fieldType>

<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>

<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="category" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>


<doc>
	<str name="companyid">357</str>
	<str name="name">LIDL Headoffice</str>
	<arr name="related_category">
		<str>Supermarkt</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL Headoffice</str>
		<str>Supermarket</str>
	</arr>
</doc>

<doc>
	<str name="companyid">719</str>
	<str name="name">LIDL</str>
	<arr name="related_category">
		<str>Supermarket</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL</str>
		<str>Supermarket</str>
	</arr>
</doc>

<lst name="explain">
	<str name="357">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                                
fieldNorm(field=searchField, doc=325)
	</str>
	
	<str name="719">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =                                
fieldNorm(field=searchField, doc=678)
	</str>
</lst>

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by roySolr <ro...@gmail.com>.

I thought it was something simple. Here is my configuration:

<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
   <analyzer>
	<charFilter class="solr.HTMLStripCharFilterFactory"/>
      	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
      	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
   </analyzer>
</fieldType>

<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>

<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="category" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>


<doc>
	<str name="companyid">357</str>
	<str name="name">LIDL Headoffice</str>
	<arr name="related_category">
		<str>Supermarkt</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL Headoffice</str>
		<str>Supermarket</str>
	</arr>
</doc>

<doc>
	<str name="companyid">719</str>
	<str name="name">LIDL</str>
	<arr name="related_category">
		<str>Supermarket</str>
	</arr>
	<str name="storechain">LIDL</str>
	<arr name="searchField">
		<str>LIDL</str>
		<str>LIDL</str>
		<str>Supermarket</str>
	</arr>
</doc>

<lst name="explain">
	<str name="357">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 = fieldNorm(field=searchField, doc=325)
	</str>
	
	<str name="719">
		1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 = fieldNorm(field=searchField, doc=678)
	</str>
</lst>

--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869524.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WhitespaceTokenizer and scoring(field length)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

If you run your query with debugQuery=true you will see the explanation about 
how Lucene/Solr went about scoring your 2 docs.  If you can't figure out what's 
going on from there, send the relevant part to the list, along with the parsed 
query (which you can also see from debugQuery=true output) and maybe we can 
help.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: roySolr <ro...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 10:27:44 AM
> Subject: WhitespaceTokenizer and scoring(field length)
> 
> Hello,
> 
> I have a problem with the whitespaceTokenizer and scoring. An  example:
> 
> id                      Titel
> 1                       Manchester united
> 2                       Manchester
> 
> With the  whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and  "united". When
> i search for "manchester" i get id 1 and 2 in my results. What  i want is
> that id 2 scores higher(field length).
> How can i fix  this?
> 
> 
> --
> View this message in context: 
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
>