You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ji...@svensktnaringsliv.se on 2016/04/20 16:43:45 UTC

Is it possible to configure a minimum field length for the fieldNorm value?

Hi,

In general I think that the fieldNorm factor in the score calculation is quite good. But when the text is short I think that the effect is two big.

Ie with two documents that have a short text in the same field, just a few characters extra in of the documents lower the fieldNorm factor too much. In one test the text in document 1 is 30 characters long and has fieldNorm 0.4375, and in document 2 the text is 37 characters long and has fieldNorm 0.375. That means that the first document gets almost a 20% higher score simply because of the 7 character difference.

What are my options if I want to change this behavior? Can I set a lower character limit, meaning that all fields with a length below this limit gets the same fieldNorm value?

I know I can force fieldNorm to be 1 by setting omitNorms="true" for that field, but I would prefer to still have it, just limit its effect on short texts.

Regards
/Jimi



RE: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Hi Ahmet,

Yes, I have also come to that conclusion, that I need to do one of those things if I want this function, since Solr/Lucene is lacking in this area. Although after some discussion with my coworkers, we decided to simply disable norms for the title field, and not do anything more, for now. Hopefully all the other boosting logic we use will give a reasonable user experience even without a length norm for the title.

Thanks for your help. :)

/Jimi

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Sent: Thursday, April 21, 2016 7:10 PM
To: solr-user@lucene.apache.org; Hullegård, Jimi <ji...@svensktnaringsliv.se>
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

Please do either :

1) write your own similarity that saves document length (docValues) in a lossless way and implement whatever punishment/algorithm you want.

or

2) disable norms altogether add an integer field (title_lenght) and populate it (outside the solr) with the number of words in the title field. And use some function query to influence the score. e.g. q=something&boost=someFuctionQuery(title_lenght)
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet



On Thursday, April 21, 2016 9:37 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Yes, it definately seems to be the main problem for us. I did some simple tests of the encoding and decoding calculations in DefaultSimilarity, and my findings are:

* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.125 depending if it is an edge case or not
* For input between 0.5 and 0.25, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.0625
* For input between 0.25 and 0.125, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.015625
* And so on, with smaller and smaller differences in the output value for edge cases

I would say that the main problem is for input values between 1.0 and 0.5. So if one could tweak the SweetSpotSimilarity to start it's "raw" (ie not encoded) lengthNorm values at 0.5 instead of 1.0, it would solve my problem for the title field. This would of course worsen the precision for longer text values, but since this is a title field that is not a problem.

So, is there a way to configure SweetSpotSimilarity to use 0.5 as it's highest lengthNorm value, instead of 1.0?

/Jimi

________________________________________

From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Thursday, April 21, 2016 2:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the length of the title field influence the score, so that matching documents with shorter title get a higher score than documents with longer title, all else considered equal.

So, when a user searches for "John", I would like the results to be pretty much in the order presented above. Though, it is not crucial that for example document 1 comes before document 2. But I would surely want document 1-3 to come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. In practice, the encoding of the fieldNorm seems to make this function much less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a general boost on documents with short titles, I only want to boost them if the title field actually matched the query.

/Jimi

________________________________________

From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I mean, traditionally length normalization has simply tried to distinguish a title field (rarely more than a dozen words) from a full body of text, or maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a feature-length article, paper, or even book. IOW, traditionally it was more of a boolean than a broad range of values. Sure, yes, you absolutely can define a custom similarity with a custom norm that supports a wide range of lengths, but you'll have to decide what you really want  to achieve to tune it.

Maybe you could give a couple examples of field values that you feel should be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a 
> sweetspot interval of 1 to 50 makes very little sense. I want to have 
> a fieldNorm value that differentiates between for example 2, 3, 4 and 
> 5 terms in the title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title 
> fieldNorm of two documents, where one title was one word longer than 
> the other title. And one fieldNorm value was 20% larger then the other 
> as a result of that. And since we use multiplicative scoring 
> calculation, a 20% increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that 
> this minor change in the text data (adding or removing one single 
> word) causes the score to change by a almost 20%. I noted this when I 
> renamed a document, removing a word from the title, and that single 
> change caused the document to move up several positions in the result 
> list. We don't want such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length 
> normalization factor is minimal". Then why does it inpact our result 
> in such a big way? And as I said, we don't want to disable it 
> completely, we just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for 
> the fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that 
> number and I assume it is correct.
> What really matters is the relative order of documents. It doesn't 
> mean anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition 
> of a non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short 
> document too much. But folks blend score with other structural fields 
> (popularity), even completely bypass relevancy score and order by 
> price, production date etc. I mean there are many use cases, the 
> effect of document length normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" 
> <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing 
> some different values at the class gives quite good results. Setting 
> ln_min=1, ln_max=2, steepness=0.1 and discountOverlaps=true should 
> give me more or less what I want. At least for the title field. I'm 
> not sure what the actual effect of those settings would be on longer 
> text fields, so maybe I will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be 
> considered domain specific requirements, like if to favor/punish 
> short/medium/long texts, and how. I was just wondering how many actual 
> use cases there are where one want's a ~20% difference in score 
> between two documents, where the only difference is that one of the 
> documents has one extra word in one field. (And now I'm talking about 
> an extra word that doesn't affect anything else except the fieldNorm 
> value). I for one find it hard to find such a use case, and would 
> consider it a very special use case, and would consider a more lenient 
> calculation a better fit for most use cases (and therefore most 
> domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for 
> the fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that 
> all documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not 
> employ document length punishment. If a document is longer than 100 do 
> some punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is 
> domain specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because 
> of just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted 
> behavior for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity 
> implementation with a custom one. That would just increase the 
> complexity of our setup and would make future upgrades harder (we 
> would for example have to remember to check if the default similarity 
> configuration or implementation changes).
>
> No, if it really is the case that most people like and want this, and 
> there is no way to configure Solr/Lucene to calculate fieldNorm in a 
> more reasonable way (in my book) for short field values, then I just 
> think we are forced to set omitNorms="true", maybe in combination with 
> a simple field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for 
> the fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not 
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is 
> based on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/
> lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimi
> larity.java#L115
>
> You can always write your own custom Similarity class to override that 
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, 
> <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score 
> > calculation is quite good. But when the text is short I think that 
> > the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just 
> > a few characters extra in of the documents lower the fieldNorm 
> > factor too
> much.
> > In one test the text in document 1 is 30 characters long and has 
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long 
> > and has fieldNorm 0.375. That means that the first document gets 
> > almost a 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a 
> > lower character limit, meaning that all fields with a length below 
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> > that field, but I would prefer to still have it, just limit its 
> > effect on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Jimi,

Please do either :

1) write your own similarity that saves document length (docValues) in a lossless way and implement whatever punishment/algorithm you want.

or

2) disable norms altogether add an integer field (title_lenght) and populate it (outside the solr) with the number of words in the title field. And use some function query to influence the score. e.g. q=something&boost=someFuctionQuery(title_lenght)
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet



On Thursday, April 21, 2016 9:37 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Yes, it definately seems to be the main problem for us. I did some simple tests of the encoding and decoding calculations in DefaultSimilarity, and my findings are:

* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.125 depending if it is an edge case or not
* For input between 0.5 and 0.25, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.0625
* For input between 0.25 and 0.125, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.015625
* And so on, with smaller and smaller differences in the output value for edge cases

I would say that the main problem is for input values between 1.0 and 0.5. So if one could tweak the SweetSpotSimilarity to start it's "raw" (ie not encoded) lengthNorm values at 0.5 instead of 1.0, it would solve my problem for the title field. This would of course worsen the precision for longer text values, but since this is a title field that is not a problem.

So, is there a way to configure SweetSpotSimilarity to use 0.5 as it's highest lengthNorm value, instead of 1.0?

/Jimi

________________________________________

From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Thursday, April 21, 2016 2:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the length of the title field influence the score, so that matching documents with shorter title get a higher score than documents with longer title, all else considered equal.

So, when a user searches for "John", I would like the results to be pretty much in the order presented above. Though, it is not crucial that for example document 1 comes before document 2. But I would surely want document 1-3 to come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. In practice, the encoding of the fieldNorm seems to make this function much less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a general boost on documents with short titles, I only want to boost them if the title field actually matched the query.

/Jimi

________________________________________

From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that all
> documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not employ
> document length punishment. If a document is longer than 100 do some
> punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is domain
> specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because of
> just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted behavior
> for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity
> implementation with a custom one. That would just increase the complexity
> of our setup and would make future upgrades harder (we would for example
> have to remember to check if the default similarity configuration or
> implementation changes).
>
> No, if it really is the case that most people like and want this, and
> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> reasonable way (in my book) for short field values, then I just think we
> are forced to set omitNorms="true", maybe in combination with a simple
> field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is based
> on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>
> You can always write your own custom Similarity class to override that
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score calculation
> > is quite good. But when the text is short I think that the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just a
> > few characters extra in of the documents lower the fieldNorm factor too
> much.
> > In one test the text in document 1 is 30 characters long and has
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > has fieldNorm 0.375. That means that the first document gets almost a
> > 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a
> > lower character limit, meaning that all fields with a length below
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > that field, but I would prefer to still have it, just limit its effect
> > on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Yes, it definately seems to be the main problem for us. I did some simple tests of the encoding and decoding calculations in DefaultSimilarity, and my findings are:

* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.125 depending if it is an edge case or not
* For input between 0.5 and 0.25, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.0625
* For input between 0.25 and 0.125, a difference of 0.01 in the input causes the output to change by a value of 0 or 0.015625
* And so on, with smaller and smaller differences in the output value for edge cases

I would say that the main problem is for input values between 1.0 and 0.5. So if one could tweak the SweetSpotSimilarity to start it's "raw" (ie not encoded) lengthNorm values at 0.5 instead of 1.0, it would solve my problem for the title field. This would of course worsen the precision for longer text values, but since this is a title field that is not a problem.

So, is there a way to configure SweetSpotSimilarity to use 0.5 as it's highest lengthNorm value, instead of 1.0?

/Jimi

________________________________________
From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Thursday, April 21, 2016 2:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the length of the title field influence the score, so that matching documents with shorter title get a higher score than documents with longer title, all else considered equal.

So, when a user searches for "John", I would like the results to be pretty much in the order presented above. Though, it is not crucial that for example document 1 comes before document 2. But I would surely want document 1-3 to come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. In practice, the encoding of the fieldNorm seems to make this function much less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a general boost on documents with short titles, I only want to boost them if the title field actually matched the query.

/Jimi

________________________________________

From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that all
> documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not employ
> document length punishment. If a document is longer than 100 do some
> punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is domain
> specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because of
> just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted behavior
> for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity
> implementation with a custom one. That would just increase the complexity
> of our setup and would make future upgrades harder (we would for example
> have to remember to check if the default similarity configuration or
> implementation changes).
>
> No, if it really is the case that most people like and want this, and
> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> reasonable way (in my book) for short field values, then I just think we
> are forced to set omitNorms="true", maybe in combination with a simple
> field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is based
> on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>
> You can always write your own custom Similarity class to override that
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score calculation
> > is quite good. But when the text is short I think that the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just a
> > few characters extra in of the documents lower the fieldNorm factor too
> much.
> > In one test the text in document 1 is 30 characters long and has
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > has fieldNorm 0.375. That means that the first document gets almost a
> > 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a
> > lower character limit, meaning that all fields with a length below
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > that field, but I would prefer to still have it, just limit its effect
> > on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Jim,

fieldNorm encode/decode thing cause some precision loss. 
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the length of the title field influence the score, so that matching documents with shorter title get a higher score than documents with longer title, all else considered equal.

So, when a user searches for "John", I would like the results to be pretty much in the order presented above. Though, it is not crucial that for example document 1 comes before document 2. But I would surely want document 1-3 to come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. In practice, the encoding of the fieldNorm seems to make this function much less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a general boost on documents with short titles, I only want to boost them if the title field actually matched the query.

/Jimi

________________________________________

From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that all
> documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not employ
> document length punishment. If a document is longer than 100 do some
> punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is domain
> specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because of
> just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted behavior
> for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity
> implementation with a custom one. That would just increase the complexity
> of our setup and would make future upgrades harder (we would for example
> have to remember to check if the default similarity configuration or
> implementation changes).
>
> No, if it really is the case that most people like and want this, and
> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> reasonable way (in my book) for short field values, then I just think we
> are forced to set omitNorms="true", maybe in combination with a simple
> field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is based
> on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>
> You can always write your own custom Similarity class to override that
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score calculation
> > is quite good. But when the text is short I think that the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just a
> > few characters extra in of the documents lower the fieldNorm factor too
> much.
> > In one test the text in document 1 is 30 characters long and has
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > has fieldNorm 0.375. That means that the first document gets almost a
> > 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a
> > lower character limit, meaning that all fields with a length below
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > that field, but I would prefer to still have it, just limit its effect
> > on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Yes, we do edismax per field boosting, with explicit boosting of the title field. So it sure makes length normalization less relevant. But not *completely* irrelevant, which is why I still want to have it as part of the scoring, just with much less impact that it currently has.

/Jimi
________________________________________
From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 4:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, <ji...@svensktnaringsliv.se>
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> ________________________________________
> >> From: Jack Krupansky <ja...@gmail.com>
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should
> >> be scored differently based on length.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
> >> wrote:
> >>
> >>> I am talking about the title field. And for the title field, a
> sweetspot
> >>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> >>> value that differentiates between for example 2, 3, 4 and 5 terms in
> the
> >>> title, but only very little.
> >>>
> >>> The 20% number I got by simply calculating the difference in the title
> >>> fieldNorm of two documents, where one title was one word longer than
> the
> >>> other title. And one fieldNorm value was 20% larger then the other as a
> >>> result of that. And since we use multiplicative scoring calculation, a
> >> 20%
> >>> increase in the fieldNorm results in a 20% increase in the final score.
> >>>
> >>> I'm not talking about "scores as percentages". I'm simply noting that
> >> this
> >>> minor change in the text data (adding or removing one single word)
> causes
> >>> the score to change by a almost 20%. I noted this when I renamed a
> >>> document, removing a word from the title, and that single change caused
> >> the
> >>> document to move up several positions in the result list. We don't want
> >>> such minor modifications to have such big impact of the resulting
> score.
> >>>
> >>> I'm not sure I can agree with you that "the effect of document length
> >>> normalization factor is minimal". Then why does it inpact our result in
> >>> such a big way? And as I said, we don't want to disable it completely,
> we
> >>> just want it to have a much lesser effect, even on really short texts.
> >>>
> >>> /Jimi
> >>>
> >>> ________________________________________
> >>> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> >>> Sent: Thursday, April 21, 2016 12:10 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> Please define a meaningful document-lenght range like min=1 max=50.
> >>> By the way you need to reindex every time you change something.
> >>>
> >>> Regarding 20% score change, I am not sure how you calculated that
> number
> >>> and I assume it is correct.
> >>> What really matters is the relative order of documents. It doesn't mean
> >>> anything addition of a word decreases the initial score by x%. Please
> >> see :
> >>> https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >>>
> >>> There is an information retrieval heuristic which says that addition
> of a
> >>> non-query term should decrease the score.
> >>>
> >>> Lucene's default document length normalization may favor short document
> >>> too much. But folks blend score with other structural fields
> >> (popularity),
> >>> even completely bypass relevancy score and order by price, production
> >> date
> >>> etc. I mean there are many use cases, the effect of document length
> >>> normalization factor is minimal.
> >>>
> >>> Lucene/Solr is highly pluggable, very easy to customize.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 11:05 PM, "
> >>> jimi.hullegard@svensktnaringsliv.se" <
> >> jimi.hullegard@svensktnaringsliv.se>
> >>> wrote:
> >>> Hi Ahmet,
> >>>
> >>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> >> some
> >>> different values at the class gives quite good results. Setting
> ln_min=1,
> >>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more
> or
> >>> less what I want. At least for the title field. I'm not sure what the
> >>> actual effect of those settings would be on longer text fields, so
> maybe
> >> I
> >>> will use the SweetSpotSimilarity only for the title field to start
> with.
> >>>
> >>> Of course I understand that there are many things that can be
> considered
> >>> domain specific requirements, like if to favor/punish short/medium/long
> >>> texts, and how. I was just wondering how many actual use cases there
> are
> >>> where one want's a ~20% difference in score between two documents,
> where
> >>> the only difference is that one of the documents has one extra word in
> >> one
> >>> field. (And now I'm talking about an extra word that doesn't affect
> >>> anything else except the fieldNorm value). I for one find it hard to
> find
> >>> such a use case, and would consider it a very special use case, and
> would
> >>> consider a more lenient calculation a better fit for most use cases
> (and
> >>> therefore most domains). :)
> >>>
> >>> /Jimi
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> >>> Sent: Wednesday, April 20, 2016 8:14 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> SweetSpotSimilarity allows you define a document length range, so that
> >> all
> >>> documents in that range will get same fieldNorm value.
> >>> In your case, you can say that from 1 word up to 100 words do not
> employ
> >>> document length punishment. If a document is longer than 100 do some
> >>> punishment.
> >>>
> >>> By the way; favoring/punishing  short, middle, or long documents is
> >> domain
> >>> specific thing. You are free to decide what to do.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 7:46 PM, "
> >> jimi.hullegard@svensktnaringsliv.se"
> >>> <ji...@svensktnaringsliv.se> wrote:
> >>> OK. Well, still, the fact that the score increases almost 20% because
> of
> >>> just one extra term in the field, is not really reasonable if you ask
> me.
> >>> But you seem to say that this is expected, reasonable and wanted
> behavior
> >>> for most use case?
> >>>
> >>> I'm not sure that I feel comfortable replacing the default Similarity
> >>> implementation with a custom one. That would just increase the
> complexity
> >>> of our setup and would make future upgrades harder (we would for
> example
> >>> have to remember to check if the default similarity configuration or
> >>> implementation changes).
> >>>
> >>> No, if it really is the case that most people like and want this, and
> >>> there is no way to configure Solr/Lucene to calculate fieldNorm in a
> more
> >>> reasonable way (in my book) for short field values, then I just think
> we
> >>> are forced to set omitNorms="true", maybe in combination with a simple
> >>> field boost for shorter fields.
> >>>
> >>> /Jimi
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> >>> Sent: Wednesday, April 20, 2016 5:18 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> FWIW, length for normalization is measured in terms (tokens), not
> >>> characters.
> >>>
> >>> With TDIFS similarity (the default before 6.0), the normalization is
> >> based
> >>> on the inverse square root of the number of terms in the field:
> >>>
> >>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >>>
> >>> That code is in ClassicSimilarity:
> >>>
> >>>
> >>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >>>
> >>> You can always write your own custom Similarity class to override that
> >>> calculation.
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hullegard@svensktnaringsliv.se
> >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> In general I think that the fieldNorm factor in the score calculation
> >>>> is quite good. But when the text is short I think that the effect is
> >> two
> >>> big.
> >>>>
> >>>> Ie with two documents that have a short text in the same field, just a
> >>>> few characters extra in of the documents lower the fieldNorm factor
> too
> >>> much.
> >>>> In one test the text in document 1 is 30 characters long and has
> >>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> >>>> has fieldNorm 0.375. That means that the first document gets almost a
> >>>> 20% higher score simply because of the 7 character difference.
> >>>>
> >>>> What are my options if I want to change this behavior? Can I set a
> >>>> lower character limit, meaning that all fields with a length below
> >>>> this limit gets the same fieldNorm value?
> >>>>
> >>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> >>>> that field, but I would prefer to still have it, just limit its effect
> >>>> on short texts.
> >>>>
> >>>> Regards
> >>>> /Jimi
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Jack Krupansky <ja...@gmail.com>.
Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, <ji...@svensktnaringsliv.se>
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> ________________________________________
> >> From: Jack Krupansky <ja...@gmail.com>
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should
> >> be scored differently based on length.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
> >> wrote:
> >>
> >>> I am talking about the title field. And for the title field, a
> sweetspot
> >>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> >>> value that differentiates between for example 2, 3, 4 and 5 terms in
> the
> >>> title, but only very little.
> >>>
> >>> The 20% number I got by simply calculating the difference in the title
> >>> fieldNorm of two documents, where one title was one word longer than
> the
> >>> other title. And one fieldNorm value was 20% larger then the other as a
> >>> result of that. And since we use multiplicative scoring calculation, a
> >> 20%
> >>> increase in the fieldNorm results in a 20% increase in the final score.
> >>>
> >>> I'm not talking about "scores as percentages". I'm simply noting that
> >> this
> >>> minor change in the text data (adding or removing one single word)
> causes
> >>> the score to change by a almost 20%. I noted this when I renamed a
> >>> document, removing a word from the title, and that single change caused
> >> the
> >>> document to move up several positions in the result list. We don't want
> >>> such minor modifications to have such big impact of the resulting
> score.
> >>>
> >>> I'm not sure I can agree with you that "the effect of document length
> >>> normalization factor is minimal". Then why does it inpact our result in
> >>> such a big way? And as I said, we don't want to disable it completely,
> we
> >>> just want it to have a much lesser effect, even on really short texts.
> >>>
> >>> /Jimi
> >>>
> >>> ________________________________________
> >>> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> >>> Sent: Thursday, April 21, 2016 12:10 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> Please define a meaningful document-lenght range like min=1 max=50.
> >>> By the way you need to reindex every time you change something.
> >>>
> >>> Regarding 20% score change, I am not sure how you calculated that
> number
> >>> and I assume it is correct.
> >>> What really matters is the relative order of documents. It doesn't mean
> >>> anything addition of a word decreases the initial score by x%. Please
> >> see :
> >>> https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >>>
> >>> There is an information retrieval heuristic which says that addition
> of a
> >>> non-query term should decrease the score.
> >>>
> >>> Lucene's default document length normalization may favor short document
> >>> too much. But folks blend score with other structural fields
> >> (popularity),
> >>> even completely bypass relevancy score and order by price, production
> >> date
> >>> etc. I mean there are many use cases, the effect of document length
> >>> normalization factor is minimal.
> >>>
> >>> Lucene/Solr is highly pluggable, very easy to customize.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 11:05 PM, "
> >>> jimi.hullegard@svensktnaringsliv.se" <
> >> jimi.hullegard@svensktnaringsliv.se>
> >>> wrote:
> >>> Hi Ahmet,
> >>>
> >>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> >> some
> >>> different values at the class gives quite good results. Setting
> ln_min=1,
> >>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more
> or
> >>> less what I want. At least for the title field. I'm not sure what the
> >>> actual effect of those settings would be on longer text fields, so
> maybe
> >> I
> >>> will use the SweetSpotSimilarity only for the title field to start
> with.
> >>>
> >>> Of course I understand that there are many things that can be
> considered
> >>> domain specific requirements, like if to favor/punish short/medium/long
> >>> texts, and how. I was just wondering how many actual use cases there
> are
> >>> where one want's a ~20% difference in score between two documents,
> where
> >>> the only difference is that one of the documents has one extra word in
> >> one
> >>> field. (And now I'm talking about an extra word that doesn't affect
> >>> anything else except the fieldNorm value). I for one find it hard to
> find
> >>> such a use case, and would consider it a very special use case, and
> would
> >>> consider a more lenient calculation a better fit for most use cases
> (and
> >>> therefore most domains). :)
> >>>
> >>> /Jimi
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> >>> Sent: Wednesday, April 20, 2016 8:14 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> Hi Jimi,
> >>>
> >>> SweetSpotSimilarity allows you define a document length range, so that
> >> all
> >>> documents in that range will get same fieldNorm value.
> >>> In your case, you can say that from 1 word up to 100 words do not
> employ
> >>> document length punishment. If a document is longer than 100 do some
> >>> punishment.
> >>>
> >>> By the way; favoring/punishing  short, middle, or long documents is
> >> domain
> >>> specific thing. You are free to decide what to do.
> >>>
> >>> Ahmet
> >>>
> >>>
> >>>
> >>> On Wednesday, April 20, 2016 7:46 PM, "
> >> jimi.hullegard@svensktnaringsliv.se"
> >>> <ji...@svensktnaringsliv.se> wrote:
> >>> OK. Well, still, the fact that the score increases almost 20% because
> of
> >>> just one extra term in the field, is not really reasonable if you ask
> me.
> >>> But you seem to say that this is expected, reasonable and wanted
> behavior
> >>> for most use case?
> >>>
> >>> I'm not sure that I feel comfortable replacing the default Similarity
> >>> implementation with a custom one. That would just increase the
> complexity
> >>> of our setup and would make future upgrades harder (we would for
> example
> >>> have to remember to check if the default similarity configuration or
> >>> implementation changes).
> >>>
> >>> No, if it really is the case that most people like and want this, and
> >>> there is no way to configure Solr/Lucene to calculate fieldNorm in a
> more
> >>> reasonable way (in my book) for short field values, then I just think
> we
> >>> are forced to set omitNorms="true", maybe in combination with a simple
> >>> field boost for shorter fields.
> >>>
> >>> /Jimi
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> >>> Sent: Wednesday, April 20, 2016 5:18 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Is it possible to configure a minimum field length for the
> >>> fieldNorm value?
> >>>
> >>> FWIW, length for normalization is measured in terms (tokens), not
> >>> characters.
> >>>
> >>> With TDIFS similarity (the default before 6.0), the normalization is
> >> based
> >>> on the inverse square root of the number of terms in the field:
> >>>
> >>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >>>
> >>> That code is in ClassicSimilarity:
> >>>
> >>>
> >>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >>>
> >>> You can always write your own custom Similarity class to override that
> >>> calculation.
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hullegard@svensktnaringsliv.se
> >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> In general I think that the fieldNorm factor in the score calculation
> >>>> is quite good. But when the text is short I think that the effect is
> >> two
> >>> big.
> >>>>
> >>>> Ie with two documents that have a short text in the same field, just a
> >>>> few characters extra in of the documents lower the fieldNorm factor
> too
> >>> much.
> >>>> In one test the text in document 1 is 30 characters long and has
> >>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> >>>> has fieldNorm 0.375. That means that the first document gets almost a
> >>>> 20% higher score simply because of the 7 character difference.
> >>>>
> >>>> What are my options if I want to change this behavior? Can I set a
> >>>> lower character limit, meaning that all fields with a length below
> >>>> this limit gets the same fieldNorm value?
> >>>>
> >>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> >>>> that field, but I would prefer to still have it, just limit its effect
> >>>> on short texts.
> >>>>
> >>>> Regards
> >>>> /Jimi
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Walter Underwood <wu...@wunderwood.org>.
Sure, here are some real world examples from my time at Netflix.

Is this movie twice as much about “new york”?

* New York, New York

Which one of these is the best match for “blade runner”:

* Blade Runner: The Final Cut
* Blade Runner: Theatrical & Director’s Cut
* Blade Runner: Workprint

http://dvd.netflix.com/Search?v1=blade+runner <http://dvd.netflix.com/Search?v1=blade+runner>

At Netflix (when I was there), those were shown in popularity order with a boost function.

And for stemming, should the movie “Saw” match “see”? Maybe not.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 20, 2016, at 5:28 PM, Jack Krupansky <ja...@gmail.com> wrote:
> 
> Maybe it's a cultural difference, but I can't imagine why on a query for
> "John", any of those titles would be treated as anything other than equals
> - namely, that they are all about John. Maybe the issue is that this seems
> like a contrived example, and I'm asking for a realistic example. Or, maybe
> you have some rule of relevance that you haven't yet shared - and I mean
> rule that a user would comprehend and consider valuable, not simply a
> mechanical rule.
> 
> 
> 
> -- Jack Krupansky
> 
> On Wed, Apr 20, 2016 at 8:10 PM, <ji...@svensktnaringsliv.se>
> wrote:
> 
>> Ok sure, I can try and give some examples :)
>> 
>> Lets say that we have the following documents:
>> 
>> Id: 1
>> Title: John Doe
>> 
>> Id: 2
>> Title: John Doe Jr.
>> 
>> Id: 3
>> Title: John Lennon: The Life
>> 
>> Id: 4
>> Title: John Thompson's Modern Course for the Piano: First Grade Book
>> 
>> Id: 5
>> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
>> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
>> Mrs. Surratt
>> 
>> 
>> And in general, when a search word matches the title, I would like to have
>> the length of the title field influence the score, so that matching
>> documents with shorter title get a higher score than documents with longer
>> title, all else considered equal.
>> 
>> So, when a user searches for "John", I would like the results to be pretty
>> much in the order presented above. Though, it is not crucial that for
>> example document 1 comes before document 2. But I would surely want
>> document 1-3 to come before document 4 and 5.
>> 
>> In my mind, the fieldNorm is a perfect solution for this. At least in
>> theory. In practice, the encoding of the fieldNorm seems to make this
>> function much less useful for this use case. Unless I have missed something.
>> 
>> Is there another way to achive something like this? Note that I don't want
>> a general boost on documents with short titles, I only want to boost them
>> if the title field actually matched the query.
>> 
>> /Jimi
>> 
>> ________________________________________
>> From: Jack Krupansky <ja...@gmail.com>
>> Sent: Thursday, April 21, 2016 1:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to configure a minimum field length for the
>> fieldNorm value?
>> 
>> I'm not sure I fully follow what distinction you're trying to focus on. I
>> mean, traditionally length normalization has simply tried to distinguish a
>> title field (rarely more than a dozen words) from a full body of text, or
>> maybe an abstract, not things like exactly how many words were in a title.
>> Or, as another example, a short newswire article of a few paragraphs vs. a
>> feature-length article, paper, or even book. IOW, traditionally it was more
>> of a boolean than a broad range of values. Sure, yes, you absolutely can
>> define a custom similarity with a custom norm that supports a wide range of
>> lengths, but you'll have to decide what you really want  to achieve to tune
>> it.
>> 
>> Maybe you could give a couple examples of field values that you feel should
>> be scored differently based on length.
>> 
>> -- Jack Krupansky
>> 
>> On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
>> wrote:
>> 
>>> I am talking about the title field. And for the title field, a sweetspot
>>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
>>> value that differentiates between for example 2, 3, 4 and 5 terms in the
>>> title, but only very little.
>>> 
>>> The 20% number I got by simply calculating the difference in the title
>>> fieldNorm of two documents, where one title was one word longer than the
>>> other title. And one fieldNorm value was 20% larger then the other as a
>>> result of that. And since we use multiplicative scoring calculation, a
>> 20%
>>> increase in the fieldNorm results in a 20% increase in the final score.
>>> 
>>> I'm not talking about "scores as percentages". I'm simply noting that
>> this
>>> minor change in the text data (adding or removing one single word) causes
>>> the score to change by a almost 20%. I noted this when I renamed a
>>> document, removing a word from the title, and that single change caused
>> the
>>> document to move up several positions in the result list. We don't want
>>> such minor modifications to have such big impact of the resulting score.
>>> 
>>> I'm not sure I can agree with you that "the effect of document length
>>> normalization factor is minimal". Then why does it inpact our result in
>>> such a big way? And as I said, we don't want to disable it completely, we
>>> just want it to have a much lesser effect, even on really short texts.
>>> 
>>> /Jimi
>>> 
>>> ________________________________________
>>> From: Ahmet Arslan <io...@yahoo.com.INVALID>
>>> Sent: Thursday, April 21, 2016 12:10 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> Hi Jimi,
>>> 
>>> Please define a meaningful document-lenght range like min=1 max=50.
>>> By the way you need to reindex every time you change something.
>>> 
>>> Regarding 20% score change, I am not sure how you calculated that number
>>> and I assume it is correct.
>>> What really matters is the relative order of documents. It doesn't mean
>>> anything addition of a word decreases the initial score by x%. Please
>> see :
>>> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>>> 
>>> There is an information retrieval heuristic which says that addition of a
>>> non-query term should decrease the score.
>>> 
>>> Lucene's default document length normalization may favor short document
>>> too much. But folks blend score with other structural fields
>> (popularity),
>>> even completely bypass relevancy score and order by price, production
>> date
>>> etc. I mean there are many use cases, the effect of document length
>>> normalization factor is minimal.
>>> 
>>> Lucene/Solr is highly pluggable, very easy to customize.
>>> 
>>> Ahmet
>>> 
>>> 
>>> On Wednesday, April 20, 2016 11:05 PM, "
>>> jimi.hullegard@svensktnaringsliv.se" <
>> jimi.hullegard@svensktnaringsliv.se>
>>> wrote:
>>> Hi Ahmet,
>>> 
>>> SweetSpotSimilarity seems quite nice. Some simple testing by throwing
>> some
>>> different values at the class gives quite good results. Setting ln_min=1,
>>> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
>>> less what I want. At least for the title field. I'm not sure what the
>>> actual effect of those settings would be on longer text fields, so maybe
>> I
>>> will use the SweetSpotSimilarity only for the title field to start with.
>>> 
>>> Of course I understand that there are many things that can be considered
>>> domain specific requirements, like if to favor/punish short/medium/long
>>> texts, and how. I was just wondering how many actual use cases there are
>>> where one want's a ~20% difference in score between two documents, where
>>> the only difference is that one of the documents has one extra word in
>> one
>>> field. (And now I'm talking about an extra word that doesn't affect
>>> anything else except the fieldNorm value). I for one find it hard to find
>>> such a use case, and would consider it a very special use case, and would
>>> consider a more lenient calculation a better fit for most use cases (and
>>> therefore most domains). :)
>>> 
>>> /Jimi
>>> 
>>> 
>>> -----Original Message-----
>>> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
>>> Sent: Wednesday, April 20, 2016 8:14 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> Hi Jimi,
>>> 
>>> SweetSpotSimilarity allows you define a document length range, so that
>> all
>>> documents in that range will get same fieldNorm value.
>>> In your case, you can say that from 1 word up to 100 words do not employ
>>> document length punishment. If a document is longer than 100 do some
>>> punishment.
>>> 
>>> By the way; favoring/punishing  short, middle, or long documents is
>> domain
>>> specific thing. You are free to decide what to do.
>>> 
>>> Ahmet
>>> 
>>> 
>>> 
>>> On Wednesday, April 20, 2016 7:46 PM, "
>> jimi.hullegard@svensktnaringsliv.se"
>>> <ji...@svensktnaringsliv.se> wrote:
>>> OK. Well, still, the fact that the score increases almost 20% because of
>>> just one extra term in the field, is not really reasonable if you ask me.
>>> But you seem to say that this is expected, reasonable and wanted behavior
>>> for most use case?
>>> 
>>> I'm not sure that I feel comfortable replacing the default Similarity
>>> implementation with a custom one. That would just increase the complexity
>>> of our setup and would make future upgrades harder (we would for example
>>> have to remember to check if the default similarity configuration or
>>> implementation changes).
>>> 
>>> No, if it really is the case that most people like and want this, and
>>> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
>>> reasonable way (in my book) for short field values, then I just think we
>>> are forced to set omitNorms="true", maybe in combination with a simple
>>> field boost for shorter fields.
>>> 
>>> /Jimi
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
>>> Sent: Wednesday, April 20, 2016 5:18 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Is it possible to configure a minimum field length for the
>>> fieldNorm value?
>>> 
>>> FWIW, length for normalization is measured in terms (tokens), not
>>> characters.
>>> 
>>> With TDIFS similarity (the default before 6.0), the normalization is
>> based
>>> on the inverse square root of the number of terms in the field:
>>> 
>>> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>>> 
>>> That code is in ClassicSimilarity:
>>> 
>>> 
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>>> 
>>> You can always write your own custom Similarity class to override that
>>> calculation.
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> In general I think that the fieldNorm factor in the score calculation
>>>> is quite good. But when the text is short I think that the effect is
>> two
>>> big.
>>>> 
>>>> Ie with two documents that have a short text in the same field, just a
>>>> few characters extra in of the documents lower the fieldNorm factor too
>>> much.
>>>> In one test the text in document 1 is 30 characters long and has
>>>> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
>>>> has fieldNorm 0.375. That means that the first document gets almost a
>>>> 20% higher score simply because of the 7 character difference.
>>>> 
>>>> What are my options if I want to change this behavior? Can I set a
>>>> lower character limit, meaning that all fields with a length below
>>>> this limit gets the same fieldNorm value?
>>>> 
>>>> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
>>>> that field, but I would prefer to still have it, just limit its effect
>>>> on short texts.
>>>> 
>>>> Regards
>>>> /Jimi
>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Yes, the example was contrived. Partly because our documents are mostly in Swedish text, but mostly because I thought that the example should be simple enough so it focused on the thing discussed (even though I simplifyed it to such a degree that I left out the current main problem with the fieldNorm, the fact that the values are too course when encoded). And we do have titles with title lengths varying in a way from 2 words to about 30 Words.

For me it makes perfect sense to have the shorter titles come up first in this example. It is basically the tf–idf principle. It is more likely that the document titled "John Doe" focuses on "John" than it is for the document titled "I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt".

Now, having said that, I never said that the title length should have a *big* inpact of the score. Infact, this is the main problem I'm trying to solve. I want the inpact to be very, very, small. Basically I want this factor to only *nudge* the document score. I want it to work in such a way so that if one first would consider the score without this factor, only when two documents have scores quite close to each other should this factor have any real effect on the resulting order in the search results. That could be achieved if the fieldNorm only would change for example from 0.79 to 0.74, like the resulting values from SweetSpotSimilarity for two example documents I tested. But when these values are encoded and decoded, the values become 0.75 and 0.625, causing a much bigger impact on the final score.

/Jimi
________________________________________
From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 2:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.



-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, <ji...@svensktnaringsliv.se>
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> ________________________________________
> From: Jack Krupansky <ja...@gmail.com>
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > I am talking about the title field. And for the title field, a sweetspot
> > interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> > value that differentiates between for example 2, 3, 4 and 5 terms in the
> > title, but only very little.
> >
> > The 20% number I got by simply calculating the difference in the title
> > fieldNorm of two documents, where one title was one word longer than the
> > other title. And one fieldNorm value was 20% larger then the other as a
> > result of that. And since we use multiplicative scoring calculation, a
> 20%
> > increase in the fieldNorm results in a 20% increase in the final score.
> >
> > I'm not talking about "scores as percentages". I'm simply noting that
> this
> > minor change in the text data (adding or removing one single word) causes
> > the score to change by a almost 20%. I noted this when I renamed a
> > document, removing a word from the title, and that single change caused
> the
> > document to move up several positions in the result list. We don't want
> > such minor modifications to have such big impact of the resulting score.
> >
> > I'm not sure I can agree with you that "the effect of document length
> > normalization factor is minimal". Then why does it inpact our result in
> > such a big way? And as I said, we don't want to disable it completely, we
> > just want it to have a much lesser effect, even on really short texts.
> >
> > /Jimi
> >
> > ________________________________________
> > From: Ahmet Arslan <io...@yahoo.com.INVALID>
> > Sent: Thursday, April 21, 2016 12:10 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > Please define a meaningful document-lenght range like min=1 max=50.
> > By the way you need to reindex every time you change something.
> >
> > Regarding 20% score change, I am not sure how you calculated that number
> > and I assume it is correct.
> > What really matters is the relative order of documents. It doesn't mean
> > anything addition of a word decreases the initial score by x%. Please
> see :
> > https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >
> > There is an information retrieval heuristic which says that addition of a
> > non-query term should decrease the score.
> >
> > Lucene's default document length normalization may favor short document
> > too much. But folks blend score with other structural fields
> (popularity),
> > even completely bypass relevancy score and order by price, production
> date
> > etc. I mean there are many use cases, the effect of document length
> > normalization factor is minimal.
> >
> > Lucene/Solr is highly pluggable, very easy to customize.
> >
> > Ahmet
> >
> >
> > On Wednesday, April 20, 2016 11:05 PM, "
> > jimi.hullegard@svensktnaringsliv.se" <
> jimi.hullegard@svensktnaringsliv.se>
> > wrote:
> > Hi Ahmet,
> >
> > SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> some
> > different values at the class gives quite good results. Setting ln_min=1,
> > ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> > less what I want. At least for the title field. I'm not sure what the
> > actual effect of those settings would be on longer text fields, so maybe
> I
> > will use the SweetSpotSimilarity only for the title field to start with.
> >
> > Of course I understand that there are many things that can be considered
> > domain specific requirements, like if to favor/punish short/medium/long
> > texts, and how. I was just wondering how many actual use cases there are
> > where one want's a ~20% difference in score between two documents, where
> > the only difference is that one of the documents has one extra word in
> one
> > field. (And now I'm talking about an extra word that doesn't affect
> > anything else except the fieldNorm value). I for one find it hard to find
> > such a use case, and would consider it a very special use case, and would
> > consider a more lenient calculation a better fit for most use cases (and
> > therefore most domains). :)
> >
> > /Jimi
> >
> >
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> > Sent: Wednesday, April 20, 2016 8:14 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > SweetSpotSimilarity allows you define a document length range, so that
> all
> > documents in that range will get same fieldNorm value.
> > In your case, you can say that from 1 word up to 100 words do not employ
> > document length punishment. If a document is longer than 100 do some
> > punishment.
> >
> > By the way; favoring/punishing  short, middle, or long documents is
> domain
> > specific thing. You are free to decide what to do.
> >
> > Ahmet
> >
> >
> >
> > On Wednesday, April 20, 2016 7:46 PM, "
> jimi.hullegard@svensktnaringsliv.se"
> > <ji...@svensktnaringsliv.se> wrote:
> > OK. Well, still, the fact that the score increases almost 20% because of
> > just one extra term in the field, is not really reasonable if you ask me.
> > But you seem to say that this is expected, reasonable and wanted behavior
> > for most use case?
> >
> > I'm not sure that I feel comfortable replacing the default Similarity
> > implementation with a custom one. That would just increase the complexity
> > of our setup and would make future upgrades harder (we would for example
> > have to remember to check if the default similarity configuration or
> > implementation changes).
> >
> > No, if it really is the case that most people like and want this, and
> > there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> > reasonable way (in my book) for short field values, then I just think we
> > are forced to set omitNorms="true", maybe in combination with a simple
> > field boost for shorter fields.
> >
> > /Jimi
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > Sent: Wednesday, April 20, 2016 5:18 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > FWIW, length for normalization is measured in terms (tokens), not
> > characters.
> >
> > With TDIFS similarity (the default before 6.0), the normalization is
> based
> > on the inverse square root of the number of terms in the field:
> >
> > return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >
> > That code is in ClassicSimilarity:
> >
> >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >
> > You can always write your own custom Similarity class to override that
> > calculation.
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> > wrote:
> >
> > > Hi,
> > >
> > > In general I think that the fieldNorm factor in the score calculation
> > > is quite good. But when the text is short I think that the effect is
> two
> > big.
> > >
> > > Ie with two documents that have a short text in the same field, just a
> > > few characters extra in of the documents lower the fieldNorm factor too
> > much.
> > > In one test the text in document 1 is 30 characters long and has
> > > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > > has fieldNorm 0.375. That means that the first document gets almost a
> > > 20% higher score simply because of the 7 character difference.
> > >
> > > What are my options if I want to change this behavior? Can I set a
> > > lower character limit, meaning that all fields with a length below
> > > this limit gets the same fieldNorm value?
> > >
> > > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > > that field, but I would prefer to still have it, just limit its effect
> > > on short texts.
> > >
> > > Regards
> > > /Jimi
> > >
> > >
> > >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Jack Krupansky <ja...@gmail.com>.
Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.



-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, <ji...@svensktnaringsliv.se>
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> ________________________________________
> From: Jack Krupansky <ja...@gmail.com>
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > I am talking about the title field. And for the title field, a sweetspot
> > interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> > value that differentiates between for example 2, 3, 4 and 5 terms in the
> > title, but only very little.
> >
> > The 20% number I got by simply calculating the difference in the title
> > fieldNorm of two documents, where one title was one word longer than the
> > other title. And one fieldNorm value was 20% larger then the other as a
> > result of that. And since we use multiplicative scoring calculation, a
> 20%
> > increase in the fieldNorm results in a 20% increase in the final score.
> >
> > I'm not talking about "scores as percentages". I'm simply noting that
> this
> > minor change in the text data (adding or removing one single word) causes
> > the score to change by a almost 20%. I noted this when I renamed a
> > document, removing a word from the title, and that single change caused
> the
> > document to move up several positions in the result list. We don't want
> > such minor modifications to have such big impact of the resulting score.
> >
> > I'm not sure I can agree with you that "the effect of document length
> > normalization factor is minimal". Then why does it inpact our result in
> > such a big way? And as I said, we don't want to disable it completely, we
> > just want it to have a much lesser effect, even on really short texts.
> >
> > /Jimi
> >
> > ________________________________________
> > From: Ahmet Arslan <io...@yahoo.com.INVALID>
> > Sent: Thursday, April 21, 2016 12:10 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > Please define a meaningful document-lenght range like min=1 max=50.
> > By the way you need to reindex every time you change something.
> >
> > Regarding 20% score change, I am not sure how you calculated that number
> > and I assume it is correct.
> > What really matters is the relative order of documents. It doesn't mean
> > anything addition of a word decreases the initial score by x%. Please
> see :
> > https://wiki.apache.org/lucene-java/ScoresAsPercentages
> >
> > There is an information retrieval heuristic which says that addition of a
> > non-query term should decrease the score.
> >
> > Lucene's default document length normalization may favor short document
> > too much. But folks blend score with other structural fields
> (popularity),
> > even completely bypass relevancy score and order by price, production
> date
> > etc. I mean there are many use cases, the effect of document length
> > normalization factor is minimal.
> >
> > Lucene/Solr is highly pluggable, very easy to customize.
> >
> > Ahmet
> >
> >
> > On Wednesday, April 20, 2016 11:05 PM, "
> > jimi.hullegard@svensktnaringsliv.se" <
> jimi.hullegard@svensktnaringsliv.se>
> > wrote:
> > Hi Ahmet,
> >
> > SweetSpotSimilarity seems quite nice. Some simple testing by throwing
> some
> > different values at the class gives quite good results. Setting ln_min=1,
> > ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> > less what I want. At least for the title field. I'm not sure what the
> > actual effect of those settings would be on longer text fields, so maybe
> I
> > will use the SweetSpotSimilarity only for the title field to start with.
> >
> > Of course I understand that there are many things that can be considered
> > domain specific requirements, like if to favor/punish short/medium/long
> > texts, and how. I was just wondering how many actual use cases there are
> > where one want's a ~20% difference in score between two documents, where
> > the only difference is that one of the documents has one extra word in
> one
> > field. (And now I'm talking about an extra word that doesn't affect
> > anything else except the fieldNorm value). I for one find it hard to find
> > such a use case, and would consider it a very special use case, and would
> > consider a more lenient calculation a better fit for most use cases (and
> > therefore most domains). :)
> >
> > /Jimi
> >
> >
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> > Sent: Wednesday, April 20, 2016 8:14 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > SweetSpotSimilarity allows you define a document length range, so that
> all
> > documents in that range will get same fieldNorm value.
> > In your case, you can say that from 1 word up to 100 words do not employ
> > document length punishment. If a document is longer than 100 do some
> > punishment.
> >
> > By the way; favoring/punishing  short, middle, or long documents is
> domain
> > specific thing. You are free to decide what to do.
> >
> > Ahmet
> >
> >
> >
> > On Wednesday, April 20, 2016 7:46 PM, "
> jimi.hullegard@svensktnaringsliv.se"
> > <ji...@svensktnaringsliv.se> wrote:
> > OK. Well, still, the fact that the score increases almost 20% because of
> > just one extra term in the field, is not really reasonable if you ask me.
> > But you seem to say that this is expected, reasonable and wanted behavior
> > for most use case?
> >
> > I'm not sure that I feel comfortable replacing the default Similarity
> > implementation with a custom one. That would just increase the complexity
> > of our setup and would make future upgrades harder (we would for example
> > have to remember to check if the default similarity configuration or
> > implementation changes).
> >
> > No, if it really is the case that most people like and want this, and
> > there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> > reasonable way (in my book) for short field values, then I just think we
> > are forced to set omitNorms="true", maybe in combination with a simple
> > field boost for shorter fields.
> >
> > /Jimi
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > Sent: Wednesday, April 20, 2016 5:18 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > FWIW, length for normalization is measured in terms (tokens), not
> > characters.
> >
> > With TDIFS similarity (the default before 6.0), the normalization is
> based
> > on the inverse square root of the number of terms in the field:
> >
> > return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
> >
> > That code is in ClassicSimilarity:
> >
> >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
> >
> > You can always write your own custom Similarity class to override that
> > calculation.
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> > wrote:
> >
> > > Hi,
> > >
> > > In general I think that the fieldNorm factor in the score calculation
> > > is quite good. But when the text is short I think that the effect is
> two
> > big.
> > >
> > > Ie with two documents that have a short text in the same field, just a
> > > few characters extra in of the documents lower the fieldNorm factor too
> > much.
> > > In one test the text in document 1 is 30 characters long and has
> > > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > > has fieldNorm 0.375. That means that the first document gets almost a
> > > 20% higher score simply because of the 7 character difference.
> > >
> > > What are my options if I want to change this behavior? Can I set a
> > > lower character limit, meaning that all fields with a length below
> > > this limit gets the same fieldNorm value?
> > >
> > > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > > that field, but I would prefer to still have it, just limit its effect
> > > on short texts.
> > >
> > > Regards
> > > /Jimi
> > >
> > >
> > >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the length of the title field influence the score, so that matching documents with shorter title get a higher score than documents with longer title, all else considered equal.

So, when a user searches for "John", I would like the results to be pretty much in the order presented above. Though, it is not crucial that for example document 1 comes before document 2. But I would surely want document 1-3 to come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. In practice, the encoding of the fieldNorm seems to make this function much less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a general boost on documents with short titles, I only want to boost them if the title field actually matched the query.

/Jimi

________________________________________
From: Jack Krupansky <ja...@gmail.com>
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that all
> documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not employ
> document length punishment. If a document is longer than 100 do some
> punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is domain
> specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because of
> just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted behavior
> for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity
> implementation with a custom one. That would just increase the complexity
> of our setup and would make future upgrades harder (we would for example
> have to remember to check if the default similarity configuration or
> implementation changes).
>
> No, if it really is the case that most people like and want this, and
> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> reasonable way (in my book) for short field values, then I just think we
> are forced to set omitNorms="true", maybe in combination with a simple
> field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is based
> on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>
> You can always write your own custom Similarity class to override that
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score calculation
> > is quite good. But when the text is short I think that the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just a
> > few characters extra in of the documents lower the fieldNorm factor too
> much.
> > In one test the text in document 1 is 30 characters long and has
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > has fieldNorm 0.375. That means that the first document gets almost a
> > 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a
> > lower character limit, meaning that all fields with a length below
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > that field, but I would prefer to still have it, just limit its effect
> > on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Jack Krupansky <ja...@gmail.com>.
I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <ji...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> ________________________________________
> From: Ahmet Arslan <io...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> SweetSpotSimilarity allows you define a document length range, so that all
> documents in that range will get same fieldNorm value.
> In your case, you can say that from 1 word up to 100 words do not employ
> document length punishment. If a document is longer than 100 do some
> punishment.
>
> By the way; favoring/punishing  short, middle, or long documents is domain
> specific thing. You are free to decide what to do.
>
> Ahmet
>
>
>
> On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se"
> <ji...@svensktnaringsliv.se> wrote:
> OK. Well, still, the fact that the score increases almost 20% because of
> just one extra term in the field, is not really reasonable if you ask me.
> But you seem to say that this is expected, reasonable and wanted behavior
> for most use case?
>
> I'm not sure that I feel comfortable replacing the default Similarity
> implementation with a custom one. That would just increase the complexity
> of our setup and would make future upgrades harder (we would for example
> have to remember to check if the default similarity configuration or
> implementation changes).
>
> No, if it really is the case that most people like and want this, and
> there is no way to configure Solr/Lucene to calculate fieldNorm in a more
> reasonable way (in my book) for short field values, then I just think we
> are forced to set omitNorms="true", maybe in combination with a simple
> field boost for shorter fields.
>
> /Jimi
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Wednesday, April 20, 2016 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> FWIW, length for normalization is measured in terms (tokens), not
> characters.
>
> With TDIFS similarity (the default before 6.0), the normalization is based
> on the inverse square root of the number of terms in the field:
>
> return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
>
> That code is in ClassicSimilarity:
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115
>
> You can always write your own custom Similarity class to override that
> calculation.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > In general I think that the fieldNorm factor in the score calculation
> > is quite good. But when the text is short I think that the effect is two
> big.
> >
> > Ie with two documents that have a short text in the same field, just a
> > few characters extra in of the documents lower the fieldNorm factor too
> much.
> > In one test the text in document 1 is 30 characters long and has
> > fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> > has fieldNorm 0.375. That means that the first document gets almost a
> > 20% higher score simply because of the 7 character difference.
> >
> > What are my options if I want to change this behavior? Can I set a
> > lower character limit, meaning that all fields with a length below
> > this limit gets the same fieldNorm value?
> >
> > I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> > that field, but I would prefer to still have it, just limit its effect
> > on short texts.
> >
> > Regards
> > /Jimi
> >
> >
> >
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
I am talking about the title field. And for the title field, a sweetspot interval of 1 to 50 makes very little sense. I want to have a fieldNorm value that differentiates between for example 2, 3, 4 and 5 terms in the title, but only very little.

The 20% number I got by simply calculating the difference in the title fieldNorm of two documents, where one title was one word longer than the other title. And one fieldNorm value was 20% larger then the other as a result of that. And since we use multiplicative scoring calculation, a 20% increase in the fieldNorm results in a 20% increase in the final score.

I'm not talking about "scores as percentages". I'm simply noting that this minor change in the text data (adding or removing one single word) causes the score to change by a almost 20%. I noted this when I renamed a document, removing a word from the title, and that single change caused the document to move up several positions in the result list. We don't want such minor modifications to have such big impact of the resulting score.

I'm not sure I can agree with you that "the effect of document length normalization factor is minimal". Then why does it inpact our result in such a big way? And as I said, we don't want to disable it completely, we just want it to have a much lesser effect, even on really short texts.

/Jimi

________________________________________
From: Ahmet Arslan <io...@yahoo.com.INVALID>
Sent: Thursday, April 21, 2016 12:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I assume it is correct.
What really matters is the relative order of documents. It doesn't mean anything addition of a word decreases the initial score by x%. Please see :
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a non-query term should decrease the score.

Lucene's default document length normalization may favor short document too much. But folks blend score with other structural fields (popularity), even completely bypass relevancy score and order by price, production date etc. I mean there are many use cases, the effect of document length normalization factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet


On Wednesday, April 20, 2016 11:05 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some different values at the class gives quite good results. Setting ln_min=1, ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less what I want. At least for the title field. I'm not sure what the actual effect of those settings would be on longer text fields, so maybe I will use the SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain specific requirements, like if to favor/punish short/medium/long texts, and how. I was just wondering how many actual use cases there are where one want's a ~20% difference in score between two documents, where the only difference is that one of the documents has one extra word in one field. (And now I'm talking about an extra word that doesn't affect anything else except the fieldNorm value). I for one find it hard to find such a use case, and would consider it a very special use case, and would consider a more lenient calculation a better fit for most use cases (and therefore most domains). :)

/Jimi


-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi



-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and
> has fieldNorm 0.375. That means that the first document gets almost a
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a
> lower character limit, meaning that all fields with a length below
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for
> that field, but I would prefer to still have it, just limit its effect
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I assume it is correct.
What really matters is the relative order of documents. It doesn't mean anything addition of a word decreases the initial score by x%. Please see : 
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a non-query term should decrease the score. 

Lucene's default document length normalization may favor short document too much. But folks blend score with other structural fields (popularity), even completely bypass relevancy score and order by price, production date etc. I mean there are many use cases, the effect of document length normalization factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet


On Wednesday, April 20, 2016 11:05 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some different values at the class gives quite good results. Setting ln_min=1, ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less what I want. At least for the title field. I'm not sure what the actual effect of those settings would be on longer text fields, so maybe I will use the SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain specific requirements, like if to favor/punish short/medium/long texts, and how. I was just wondering how many actual use cases there are where one want's a ~20% difference in score between two documents, where the only difference is that one of the documents has one extra word in one field. (And now I'm talking about an extra word that doesn't affect anything else except the fieldNorm value). I for one find it hard to find such a use case, and would consider it a very special use case, and would consider a more lenient calculation a better fit for most use cases (and therefore most domains). :)

/Jimi


-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi



-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

RE: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Hang on... It didn't work out as I wanted. But the problem seems to be in the encoding of the fieldNorm value. The decoded value is so coarse, so that when it is decoded the result is that two values that were quite close to each other originally, can become quite far apart after encoding and decoding.

For example, when testing this with two documents, the calculated fieldNorm value for the title field is 0.7905694 and 0.745356 respectively. Ie the difference is only about 0.05. But the encoded values become 122 and 121 respectively, and when these values are decoded, they become 0.75 and 0.625. The difference now is 0.125. That is quite a big step, if you ask me. In fact, it is so big so it more or less makes this whole thing with SweetSpotSimilarity useless for me.

Am I missing something here? Is it really so that one can have a really great similarity implementation, that spits out great values, only to have them butchered because of the way Lucene stores the data? Can I do something to remedy this?

/Jimi

-----Original Message-----
From: jimi.hullegard@svensktnaringsliv.se [mailto:jimi.hullegard@svensktnaringsliv.se] 
Sent: Wednesday, April 20, 2016 10:05 PM
To: solr-user@lucene.apache.org
Subject: RE: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some different values at the class gives quite good results. Setting ln_min=1, ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less what I want. At least for the title field. I'm not sure what the actual effect of those settings would be on longer text fields, so maybe I will use the SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain specific requirements, like if to favor/punish short/medium/long texts, and how. I was just wondering how many actual use cases there are where one want's a ~20% difference in score between two documents, where the only difference is that one of the documents has one extra word in one field. (And now I'm talking about an extra word that doesn't affect anything else except the fieldNorm value). I for one find it hard to find such a use case, and would consider it a very special use case, and would consider a more lenient calculation a better fit for most use cases (and therefore most domains). :)

/Jimi

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi



-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

RE: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some different values at the class gives quite good results. Setting ln_min=1, ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less what I want. At least for the title field. I'm not sure what the actual effect of those settings would be on longer text fields, so maybe I will use the SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain specific requirements, like if to favor/punish short/medium/long texts, and how. I was just wondering how many actual use cases there are where one want's a ~20% difference in score between two documents, where the only difference is that one of the documents has one extra word in one field. (And now I'm talking about an extra word that doesn't affect anything else except the fieldNorm value). I for one find it hard to find such a use case, and would consider it a very special use case, and would consider a more lenient calculation a better fit for most use cases (and therefore most domains). :)

/Jimi

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi



-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hullegard@svensktnaringsliv.se" <ji...@svensktnaringsliv.se> wrote:
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi



-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

RE: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by ji...@svensktnaringsliv.se.
OK. Well, still, the fact that the score increases almost 20% because of just one extra term in the field, is not really reasonable if you ask me. But you seem to say that this is expected, reasonable and wanted behavior for most use case?

I'm not sure that I feel comfortable replacing the default Similarity implementation with a custom one. That would just increase the complexity of our setup and would make future upgrades harder (we would for example have to remember to check if the default similarity configuration or implementation changes).

No, if it really is the case that most people like and want this, and there is no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way (in my book) for short field values, then I just think we are forced to set omitNorms="true", maybe in combination with a simple field boost for shorter fields.

/Jimi

 
-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Posted by Jack Krupansky <ja...@gmail.com>.
FWIW, length for normalization is measured in terms (tokens), not
characters.

With TDIFS similarity (the default before 6.0), the normalization is based
on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <ji...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation is
> quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a few
> characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has fieldNorm
> 0.4375, and in document 2 the text is 37 characters long and has fieldNorm
> 0.375. That means that the first document gets almost a 20% higher score
> simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a lower
> character limit, meaning that all fields with a length below this limit
> gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for that
> field, but I would prefer to still have it, just limit its effect on short
> texts.
>
> Regards
> /Jimi
>
>
>