You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erik Fäßler <er...@uni-jena.de> on 2012/03/23 19:40:36 UTC

Field length and scoring

Hello there,

I have a quite basic question but my Solr is behaving in a way I'm not quite sure of why it does so.

The setup is simple: I have a field "suggestionText" in which single strings are indexed. Schema:

 <field name="suggestionText" type="prefixNGram" indexed="true" stored="true"/>

Since I want this field to serve for a suggestion-search, the input string is analyzed by a EdgeNGramFilter.

Lets have a look on two cases:

case1: Input string was 'il2'
case2: Input string was 'il24'

As I can see from the Solr-admin-analysis-page, case1 is analysed as

i
il
il2

and case2 as

i
il
il2
il24

As you would expect. The point now is: When I search for 'il2' I would expect case1 to have a higher score than case2. I thought this way because I did not omit norms and thus I thought, the shorter field would get a (slightly) higher score. However, the scores in both cases are identical and so it happens that 'il24' is suggested prior to 'il2'.

Perhaps I did understand the norms or the notion of "field length" wrong. I would be grateful if you could help me out here and give me advice on how to accomplish the wished behavior.

Thanks and best regards,

	Erik

Re: Field length and scoring

Posted by Erik Fäßler <er...@uni-jena.de>.

Ahh, that's it - I thought of such a thing but couldn't find a proper affirmation with Google.

Thank you both for your answers. I guess I will just sort by value length myself.

Only one thing: Erick said my examples would both be one token long. But I rather think, there are both one "value" long but three and four tokens long, as the NGramAnalyzer splits the values in smaller tokens. And as it can be seen from the link given by Ahmet, field lengths of three and four are not distinguished - where the reason for my observation lies.

Thanks again and best regards,

Erik

On 24.03.2012, at 00:02, Ahmet Arslan <io...@yahoo.com> wrote:

>> Also, the field length is enocded in a byte (as I remember).
>> So it's
>> quite possible that,
>> even if the lengths of these fields were 3 and 4 instead of
>> both being
>> 1, the value
>> stored for the length norms would be the same number.
> 
> Exactly. http://search-lucene.com/m/uGKRu1pvRjw
>

Re: Field length and scoring

Posted by Ahmet Arslan <io...@yahoo.com>.

> Also, the field length is enocded in a byte (as I remember).
> So it's
> quite possible that,
> even if the lengths of these fields were 3 and 4 instead of
> both being
> 1, the value
> stored for the length norms would be the same number.

Exactly. http://search-lucene.com/m/uGKRu1pvRjw

Re: Field length and scoring

Posted by Erick Erickson <er...@gmail.com>.

Erik:

The field length is, I believe, based on _tokens_, not characters.
Both of your examples
are exactly one token long, so the scores are probably identical....

Also, the field length is enocded in a byte (as I remember). So it's
quite possible that,
even if the lengths of these fields were 3 and 4 instead of both being
1, the value
stored for the length norms would be the same number.

HTH
Erick

On Fri, Mar 23, 2012 at 2:40 PM, Erik Fäßler <er...@uni-jena.de> wrote:
> Hello there,
>
> I have a quite basic question but my Solr is behaving in a way I'm not quite sure of why it does so.
>
> The setup is simple: I have a field "suggestionText" in which single strings are indexed. Schema:
>
>  <field name="suggestionText" type="prefixNGram" indexed="true" stored="true"/>
>
> Since I want this field to serve for a suggestion-search, the input string is analyzed by a EdgeNGramFilter.
>
> Lets have a look on two cases:
>
> case1: Input string was 'il2'
> case2: Input string was 'il24'
>
> As I can see from the Solr-admin-analysis-page, case1 is analysed as
>
> i
> il
> il2
>
> and case2 as
>
> i
> il
> il2
> il24
>
> As you would expect. The point now is: When I search for 'il2' I would expect case1 to have a higher score than case2. I thought this way because I did not omit norms and thus I thought, the shorter field would get a (slightly) higher score. However, the scores in both cases are identical and so it happens that 'il24' is suggested prior to 'il2'.
>
> Perhaps I did understand the norms or the notion of "field length" wrong. I would be grateful if you could help me out here and give me advice on how to accomplish the wished behavior.
>
> Thanks and best regards,
>
>        Erik