You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2005/07/07 22:39:05 UTC

Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Our client, Rojo, is considering overriding the default implementation of
lengthNorm to fix the bias towards extremely short RSS documents.

The general idea put forth by Doug was that longer documents tend to have
more instances of matching words simply because they are longer, whereas
shorter documents tend to be more precise and should therefore be considered
more authoritative.

While we generally agree with this idea, it seems to break down for
extremely short documents.  For example, one and two word documents tend to
be test messages, error messages, or simple answers with no accompanying
context.

I've seen discussions of this before from Doug, Chuck, Kevin and Sanji;
likely others have posted as well.  We'd like to get your feedback on our
current idea for a new implementation, and perhaps eventually see about
getting the default Lucene formula changed.

Pictures speak louder than words.  I've attached a graph of what I'm about
to talk about, and if the attachment is not visible, I've also posted it
online at:
http://ideaeng.com/customers/rojo/lucene-doclength-normalization.gif

Looking at the graph, the default Lucene implementation is represented by
the dashed dark-purple line.  As you can see it's giving the highest scores
for documents with less than 5 words, with the max score going to single
word documents.  Doug's quick fix for clipping the score for documents with
less than 100 terms is shown in light purple.

Rojo's idea was to target documents of a particular length (we've chosen 50
for this graph), and then have a smooth curve that slopes away from there
for larger and smaller documents.  The red, green and blue curves are some
experiments I did trying to stretch out the standard "bell curve" (see
http://en.wikipedia.org/wiki/Normal_distribution)

The "flat" and "stretch" factors are specific to my formula.  I've tried
playing around with how gradual the curve slopes away for smaller and larger
documents; for example, the red curve really "punishes" documents with less
than 5 words.

We'd really appreciate your feedback on this, as we do plan to do
"something".  After figuring out what the curve "should be", the next items
on our end are implementation and fixing our excising indices, which I'll
save for a later post.

Thanks in advance for your feedback,
Mark Bennett
mbennett@ideaeng.com
(on behalf of rojo.com)

Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 7, 2005, at 3:16 PM, Mark Bennett wrote:
> Scanning their paper very quickly, I didn't see a specific mention  
> (though I
> might have missed it) of extremely short documents (< 5 words).

The study does not concern itself with different document lengths.   
They chose 6 different collections, but it appears that they were  
looking for a diversity of authorship and subject matter.

> Was there
> something specific about 1 and 2 word documents you had in mind?

Could you use a negative document boost on 1 and 2 word docs to solve  
your particular problem?

After pondering the clip method a little more, I've become wary of  
its effect on title fields.  It would work very well on what you  
refer to as "main" and I generally call "bodytext", but if it were  
set as a default, it would become necessary to weight "title" fields  
or short "keywords" fields more heavily.

I think it would be possible, even desirable, to turn on clipping for  
bodytext while turning it off for title/keywords.  That would require  
the implementor to be familiar with scoring formula theory, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Posted by Mark Bennett <mb...@ideaeng.com>.

Hello Marvin,

Thanks for the reply.

Scanning their paper very quickly, I didn't see a specific mention (though I
might have missed it) of extremely short documents (< 5 words).  Was there
something specific about 1 and 2 word documents you had in mind?

Good point on which field.  I was thinking of the "main" field, the body of
the message.  Certainly titles would be expected to be shorter.

Mark

-----Original Message-----
From: Marvin Humphrey [mailto:marvin@rectangular.com] 
Sent: Thursday, July 07, 2005 2:39 PM
To: java-dev@lucene.apache.org
Cc: Mark Bennett
Subject: Re: Proposal for change to DefaultSimilarity's lengthNorm to fix
"short document" problem

On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
> Our client, Rojo, is considering overriding the default  
> implementation of
> lengthNorm to fix the bias towards extremely short RSS documents.

Different normalization schemes are given a thorough examination in  
this 1997 paper:

http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Here is what they have to say about the ideal case, "full  
normalization":

[begin excerpt]

... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.

[end excerpt]

Their experimental results indicate that the Lucene default -- 1/sqrt 
(num_terms) -- is quite effective.  The effect upon precision of the  
various normalization schemes is specific to the characteristics of  
the document collection, though.  Extremely short RSS documents would  
seem to be an outlying case.  Anything short of (prohibitively  
expensive) full normalization requires a bias towards one length of  
document.  If you assign maximum weight to the 50-term documents,  
you've probably penalized dictionary definitions.  FWIW, (this is my  
second Lucene post -- I'm not involved with the project), I would  
lean towards the clip method as a default, but it's certainly  
justifiable to tweak a normalization scheme to suit your needs.

> The "flat" and "stretch" factors are specific to my formula.  I've  
> tried
> playing around with how gradual the curve slopes away for smaller  
> and larger
> documents; for example, the red curve really "punishes" documents  
> with less
> than 5 words.

Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt 
(num_terms) the number of terms in the field, rather than the number  
of terms in the document?  If that's true, then how would adopting a  
different curve as default affect the relative weight of a "title"  
field?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
> Our client, Rojo, is considering overriding the default  
> implementation of
> lengthNorm to fix the bias towards extremely short RSS documents.

Different normalization schemes are given a thorough examination in  
this 1997 paper:

http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Here is what they have to say about the ideal case, "full  
normalization":

[begin excerpt]

... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.

[end excerpt]

Their experimental results indicate that the Lucene default -- 1/sqrt 
(num_terms) -- is quite effective.  The effect upon precision of the  
various normalization schemes is specific to the characteristics of  
the document collection, though.  Extremely short RSS documents would  
seem to be an outlying case.  Anything short of (prohibitively  
expensive) full normalization requires a bias towards one length of  
document.  If you assign maximum weight to the 50-term documents,  
you've probably penalized dictionary definitions.  FWIW, (this is my  
second Lucene post -- I'm not involved with the project), I would  
lean towards the clip method as a default, but it's certainly  
justifiable to tweak a normalization scheme to suit your needs.

> The "flat" and "stretch" factors are specific to my formula.  I've  
> tried
> playing around with how gradual the curve slopes away for smaller  
> and larger
> documents; for example, the red curve really "punishes" documents  
> with less
> than 5 words.

Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt 
(num_terms) the number of terms in the field, rather than the number  
of terms in the document?  If that's true, then how would adopting a  
different curve as default affect the relative weight of a "title"  
field?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Posted by DM Smith <dm...@gmail.com>.

At crosswire.org we are using Lucene to index Bibles with each Bible 
having its own index and each
verse in the Bible is a document in the index. So each document is 
short. Length depends upon the
language of translation, but the lengths are from 2 to less than 100.

In our case the existing bias seems appropriate and it does not appear 
to break down for extremely
short documents.

I would suggest that if the bias is changed that it be based upon the 
length and distribution of documents
in the index. Or it be driven by programmer supplied parameters.

Mark Bennett wrote:

>Our client, Rojo, is considering overriding the default implementation of
>lengthNorm to fix the bias towards extremely short RSS documents.
>
>The general idea put forth by Doug was that longer documents tend to have
>more instances of matching words simply because they are longer, whereas
>shorter documents tend to be more precise and should therefore be considered
>more authoritative.
>
>While we generally agree with this idea, it seems to break down for
>extremely short documents.  For example, one and two word documents tend to
>be test messages, error messages, or simple answers with no accompanying
>context.
>
>I've seen discussions of this before from Doug, Chuck, Kevin and Sanji;
>likely others have posted as well.  We'd like to get your feedback on our
>current idea for a new implementation, and perhaps eventually see about
>getting the default Lucene formula changed.
>
>Pictures speak louder than words.  I've attached a graph of what I'm about
>to talk about, and if the attachment is not visible, I've also posted it
>online at:
>http://ideaeng.com/customers/rojo/lucene-doclength-normalization.gif
>
>Looking at the graph, the default Lucene implementation is represented by
>the dashed dark-purple line.  As you can see it's giving the highest scores
>for documents with less than 5 words, with the max score going to single
>word documents.  Doug's quick fix for clipping the score for documents with
>less than 100 terms is shown in light purple.
>
>Rojo's idea was to target documents of a particular length (we've chosen 50
>for this graph), and then have a smooth curve that slopes away from there
>for larger and smaller documents.  The red, green and blue curves are some
>experiments I did trying to stretch out the standard "bell curve" (see
>http://en.wikipedia.org/wiki/Normal_distribution)
>
>The "flat" and "stretch" factors are specific to my formula.  I've tried
>playing around with how gradual the curve slopes away for smaller and larger
>documents; for example, the red curve really "punishes" documents with less
>than 5 words.
>
>We'd really appreciate your feedback on this, as we do plan to do
>"something".  After figuring out what the curve "should be", the next items
>on our end are implementation and fixing our excising indices, which I'll
>save for a later post.
>
>Thanks in advance for your feedback,
>Mark Bennett
>mbennett@ideaeng.com
>(on behalf of rojo.com)
>
>
>
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org