You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by roySolr <ro...@gmail.com> on 2011/04/26 16:27:44 UTC
WhitespaceTokenizer and scoring(field length)
Hello,
I have a problem with the whitespaceTokenizer and scoring. An example:
id Titel
1 Manchester united
2 Manchester
With the whitespaceTokenizer "Manchester united" will be splitted to
"Manchester" and "united". When
i search for "manchester" i get id 1 and 2 in my results. What i want is
that id 2 scores higher(field length).
How can i fix this?
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by Erick Erickson <er...@gmail.com>.
First, you can give us some more data to work with <G>...
In particular, attach &debugQuery=on to your http request and post
the results. That will show how the documents got their score.
Also, show us the <fieldType> definition and <field> definition for the field
in question.
Best
Erick
On Tue, Apr 26, 2011 at 10:27 AM, roySolr <ro...@gmail.com> wrote:
> Hello,
>
> I have a problem with the whitespaceTokenizer and scoring. An example:
>
> id Titel
> 1 Manchester united
> 2 Manchester
>
> With the whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and "united". When
> i search for "manchester" i get id 1 and 2 in my results. What i want is
> that id 2 scores higher(field length).
> How can i fix this?
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: WhitespaceTokenizer and scoring(field length)
Posted by Otis Gospodnetic <ot...@yahoo.com>.
In Solr's schema.xml you can use omitNorms="true" to turn norms off on
field-by-field basis.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: Jonathan Rochkind <ro...@jhu.edu>
> To: solr-user@lucene.apache.org
> Sent: Wed, April 27, 2011 11:29:01 AM
> Subject: Re: WhitespaceTokenizer and scoring(field length)
>
> You can turn off norms for the field. It doens't make any sense to talk about
>"changing the length norm". The length norm is based on the size of the field
>for the particular document, to implement the TF/IDF style relevance
>algorithm. But you can turn off norms for the field if you don't want TF to
>be taken into account.
>
> I _think_ if you turn off norms the relevancy will be based purely on "term
>count" rather than "term frequency", which is what you're wanting. But not sure
>of that, I get confused too thinking about the implications of all this stuff,
>but it's something to try/look into. Forget exactly how you turn off norms, or
>if there are ways to turn off some kinds of field norms but not others, but I
>recall there is definitely a way to do it on a field-by-field basis (not I
>think on a query-by-query basis).
>
> On 4/27/2011 8:25 AM, roySolr wrote:
> > Thanks!! It's clear now, sometimes the lengthNorm is the same. See the
table
> > below:
> >
> > # of terms lengthNorm
> > 1 1.0
> > 2 .625
> > 3 .5
> > 4 .5
> > 5 .4375
> > 6 .375
> > 7 .375
> > 8 .3125
> > 9 .3125
> > 10 .3125
> >
> > Is it possible to change the lengthNorm?
> >
> > --
> > View this message in context:
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
>
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
Re: WhitespaceTokenizer and scoring(field length)
Posted by Jonathan Rochkind <ro...@jhu.edu>.
You can turn off norms for the field. It doens't make any sense to talk
about "changing the length norm". The length norm is based on the size
of the field for the particular document, to implement the TF/IDF style
relevance algorithm. But you can turn off norms for the field if you
don't want TF to be taken into account.
I _think_ if you turn off norms the relevancy will be based purely on
"term count" rather than "term frequency", which is what you're wanting.
But not sure of that, I get confused too thinking about the implications
of all this stuff, but it's something to try/look into. Forget exactly
how you turn off norms, or if there are ways to turn off some kinds of
field norms but not others, but I recall there is definitely a way to do
it on a field-by-field basis (not I think on a query-by-query basis).
On 4/27/2011 8:25 AM, roySolr wrote:
> Thanks!! It's clear now, sometimes the lengthNorm is the same. See the table
> below:
>
> # of terms lengthNorm
> 1 1.0
> 2 .625
> 3 .5
> 4 .5
> 5 .4375
> 6 .375
> 7 .375
> 8 .3125
> 9 .3125
> 10 .3125
>
> Is it possible to change the lengthNorm?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: WhitespaceTokenizer and scoring(field length)
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, it is possible to implement your own Lucene Similarity in which you can
override the length norm.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: roySolr <ro...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, April 27, 2011 8:25:31 AM
> Subject: Re: WhitespaceTokenizer and scoring(field length)
>
> Thanks!! It's clear now, sometimes the lengthNorm is the same. See the table
> below:
>
> # of terms lengthNorm
> 1 1.0
> 2 .625
> 3 .5
> 4 .5
> 5 .4375
> 6 .375
> 7 .375
> 8 .3125
> 9 .3125
> 10 .3125
>
> Is it possible to change the lengthNorm?
>
> --
> View this message in context:
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
>
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: WhitespaceTokenizer and scoring(field length)
Posted by Ahmet Arslan <io...@yahoo.com>.
Is it possible to change the lengthNorm?
Yes you can customize it and plug it into solr. DefaultSimilarity and SweetSpotSimilarity can be starting point.
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by roySolr <ro...@gmail.com>.
Thanks!! It's clear now, sometimes the lengthNorm is the same. See the table
below:
# of terms lengthNorm
1 1.0
2 .625
3 .5
4 .5
5 .4375
6 .375
7 .375
8 .3125
9 .3125
10 .3125
Is it possible to change the lengthNorm?
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2870206.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by Ahmet Arslan <io...@yahoo.com>.
Lucene/solr's length normalization is not discriminative for very short documents.
See Jay's excellent explanation for more details. http://search-lucene.com/m/uGKRu1pvRjw/
----- Original Message -----
From: roySolr <ro...@gmail.com>
To: solr-user@lucene.apache.org
Cc:
Sent: Wednesday, April 27, 2011 11:28 AM
Subject: Re: WhitespaceTokenizer and scoring(field length)
I thought it was something simple. Here is my configuration:
<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>
<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>
I search for "supermarket":
<doc>
<str name="companyid">357</str>
<str name="name">LIDL Headoffice</str>
<arr name="related_category">
<str>Supermarkt</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL Headoffice</str>
<str>Supermarket</str>
</arr>
</doc>
<doc>
<str name="companyid">719</str>
<str name="name">LIDL</str>
<arr name="related_category">
<str>Supermarket</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL</str>
<str>Supermarket</str>
</arr>
</doc>
debugQuery:
Both documents has the same score, but doc 357 has more characters in the
searchField.
<lst name="explain">
<str name="357">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=325)
</str>
<str name="719">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=678)
</str>
</lst>
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869546.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by roySolr <ro...@gmail.com>.
I thought it was something simple. Here is my configuration:
<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>
<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>
I search for "supermarket":
<doc>
<str name="companyid">357</str>
<str name="name">LIDL Headoffice</str>
<arr name="related_category">
<str>Supermarkt</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL Headoffice</str>
<str>Supermarket</str>
</arr>
</doc>
<doc>
<str name="companyid">719</str>
<str name="name">LIDL</str>
<arr name="related_category">
<str>Supermarket</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL</str>
<str>Supermarket</str>
</arr>
</doc>
debugQuery:
Both documents has the same score, but doc 357 has more characters in the
searchField.
<lst name="explain">
<str name="357">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=325)
</str>
<str name="719">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=678)
</str>
</lst>
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869546.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by roySolr <ro...@gmail.com>.
I thought it was something simple. Here is my configuration:
<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>
<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="category" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>
<doc>
<str name="companyid">357</str>
<str name="name">LIDL Headoffice</str>
<arr name="related_category">
<str>Supermarkt</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL Headoffice</str>
<str>Supermarket</str>
</arr>
</doc>
<doc>
<str name="companyid">719</str>
<str name="name">LIDL</str>
<arr name="related_category">
<str>Supermarket</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL</str>
<str>Supermarket</str>
</arr>
</doc>
<lst name="explain">
<str name="357">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=325)
</str>
<str name="719">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 =
fieldNorm(field=searchField, doc=678)
</str>
</lst>
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869527.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by roySolr <ro...@gmail.com>.
I thought it was something simple. Here is my configuration:
<fieldType name="searchType" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<field name="searchField" type="searchType" indexed="true" stored="true"
multiValued="true"/>
<copyField source="name" dest="searchField" maxChars="500"/>
<copyField source="storechain" dest="searchField" maxChars="500"/>
<copyField source="category" dest="searchField" maxChars="500"/>
<copyField source="related_category" dest="searchField" maxChars="500"/>
<doc>
<str name="companyid">357</str>
<str name="name">LIDL Headoffice</str>
<arr name="related_category">
<str>Supermarkt</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL Headoffice</str>
<str>Supermarket</str>
</arr>
</doc>
<doc>
<str name="companyid">719</str>
<str name="name">LIDL</str>
<arr name="related_category">
<str>Supermarket</str>
</arr>
<str name="storechain">LIDL</str>
<arr name="searchField">
<str>LIDL</str>
<str>LIDL</str>
<str>Supermarket</str>
</arr>
</doc>
<lst name="explain">
<str name="357">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 325), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 = fieldNorm(field=searchField, doc=325)
</str>
<str name="719">
1.4330883 = (MATCH) fieldWeight(searchField:supermarket in 678), product
of: 1.0 = tf(termFreq(searchField:supermarket)=1) 2.8661766 =
idf(docFreq=3194, maxDocs=20651) 0.5 = fieldNorm(field=searchField, doc=678)
</str>
</lst>
--
View this message in context: http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2869524.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: WhitespaceTokenizer and scoring(field length)
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
If you run your query with debugQuery=true you will see the explanation about
how Lucene/Solr went about scoring your 2 docs. If you can't figure out what's
going on from there, send the relevant part to the list, along with the parsed
query (which you can also see from debugQuery=true output) and maybe we can
help.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: roySolr <ro...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 10:27:44 AM
> Subject: WhitespaceTokenizer and scoring(field length)
>
> Hello,
>
> I have a problem with the whitespaceTokenizer and scoring. An example:
>
> id Titel
> 1 Manchester united
> 2 Manchester
>
> With the whitespaceTokenizer "Manchester united" will be splitted to
> "Manchester" and "united". When
> i search for "manchester" i get id 1 and 2 in my results. What i want is
> that id 2 scores higher(field length).
> How can i fix this?
>
>
> --
> View this message in context:
>http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
>
> Sent from the Solr - User mailing list archive at Nabble.com.
>