You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "scott chu (朱炎詹)" <sc...@udngroup.com> on 2010/08/24 06:00:01 UTC

Why it's boosted up?

In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written to 
the directory (when writing the document) in a single byte (!) as follows: 
For each field of a document, all boosts of that field (i.e. all boosts 
under the same field name in that doc) are multiplied. The result is 
multiplied by the boost of the document, and also multiplied by a "field 
length norm" value that represents the length of that field in that doc (so 
shorter fields are automatically boosted up). "

I though the greater the value, the boosting is upper. Then why short fields 
are boost up? Isn't Norm value for short fields smaller?

Re: Why it's boosted up?

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks for your clear explanation! I got it :)
----- Original Message ----- 
From: "MitchK" <mi...@web.de>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 3:37 PM
Subject: Re: Why it's boosted up?


>
> Hi Scott,
>
>
>
>> (so  shorter fields are automatically boosted up). "
>>
> The theory behind that is the following (in easy words):
> Let's say you got two documents, each doc contains on 1 field (like it was
> in my example).
> Additionally we got a query that contains two words.
> Let's say doc1 contains on 10 words and doc2 contains on 20 words.
> The query matches both docs with both words.
> The idea of boosting shorter fields stronger than longer fields is the
> following:
> In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
> In doc2 2/20 = 0.1 => 10% of the words are matching your query.
>
> So doc1 should get a better score, because the rate of matching words vs 
> the
> total number of occuring words is greater than in doc2
> This is the idea of using norms as an index-time-boosting-factor. NOTE: 
> This
> does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
> illustrates what the idea behind such norms is.
>
> From the similarity-class's documentation of lengthNorm():
>
>
>
>> Matches in longer fields are less precise, so implementations of this
>> method usually return smaller values when numTokens is large, and larger
>> values when numTokens is small.
>>
>
> However, you, as a search-application-developer got the task, that you 
> have
> to decide whether this theory applies to your application or not. In some
> cases using norms makes no sense, in others it does.
> If you think that norms are applying to your project, ommitting them is no
> good approach to save disk-space.
> Furthermore: If you think the theory does apply to the business-needs of
> your application but its impact is currently to heavy, you can have a look
> at the sweetSpotSimilarity in Lucene.
>
>
>
>> The request is from our business team, they wish user of our product can
>> type in partial string of a word that exists in title or body field.
>>
> You mean something like typing "note" and also getting results like
> "notebook"?
> The correct approach for something like that is not using shingleFilter 
> but
> NGrams or edged NGrams.
> Shingles are doing something like that:
> "This is my shingle sentence" -> "This is, is my, my shingle, shingle
> sentence" -> it breaks up the sentence into smaller pieces. The benefit of
> doins so is, that, if a query matches one of these shingles, you have 
> found
> a short phrase without using the performance-consuming 
> phraseQuery-feature.
>
> Kind regards,
> - Mitch
>
>
> scott chu wrote:
>>
>> In Lucene's web page, there's a paragraph:
>>
>> "Indexing time boosts are preprocessed for storage efficiency and written
>> to
>> the directory (when writing the document) in a single byte (!) as 
>> follows:
>> For each field of a document, all boosts of that field (i.e. all boosts
>> under the same field name in that doc) are multiplied. The result is
>> multiplied by the boost of the document, and also multiplied by a "field
>> length norm" value that represents the length of that field in that doc
>> (so
>> shorter fields are automatically boosted up). "
>>
>> I though the greater the value, the boosting is upper. Then why short
>> fields
>> are boost up? Isn't Norm value for short fields smaller?
>>
>>
>>
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Why it's boosted up?

Posted by MitchK <mi...@web.de>.

Hi Scott,



> (so  shorter fields are automatically boosted up). " 
> 
The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
In doc2 2/20 = 0.1 => 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the
total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This
does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

>From the similarity-class's documentation of lengthNorm():



> Matches in longer fields are less precise, so implementations of this
> method usually return smaller values when numTokens is large, and larger
> values when numTokens is small.
> 

However, you, as a search-application-developer got the task, that you have
to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does. 
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene. 



> The request is from our business team, they wish user of our product can 
> type in partial string of a word that exists in title or body field.
> 
You mean something like typing "note" and also getting results like
"notebook"?
The correct approach for something like that is not using shingleFilter but
NGrams or edged NGrams.
Shingles are doing something like that:
"This is my shingle sentence" -> "This is, is my, my shingle, shingle
sentence" -> it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have found
a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch


scott chu wrote:
> 
> In Lucene's web page, there's a paragraph:
> 
> "Indexing time boosts are preprocessed for storage efficiency and written
> to 
> the directory (when writing the document) in a single byte (!) as follows: 
> For each field of a document, all boosts of that field (i.e. all boosts 
> under the same field name in that doc) are multiplied. The result is 
> multiplied by the boost of the document, and also multiplied by a "field 
> length norm" value that represents the length of that field in that doc
> (so 
> shorter fields are automatically boosted up). "
> 
> I though the greater the value, the boosting is upper. Then why short
> fields 
> are boost up? Isn't Norm value for short fields smaller?
> 
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why it's boosted up?

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks! That' make sense :)

----- Original Message ----- 
From: "Ahmet Arslan" <io...@yahoo.com>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 4:30 PM
Subject: Re: Why it's boosted up?


>> Then why short fields are boost up?
>
> In other words longer documents are punished. Because they contain 
> possibly many terms/words. If this mechanism does not exist, longer 
> documents takes over and pops up usually in the first page.
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Why it's boosted up?

Posted by Ahmet Arslan <io...@yahoo.com>.

> Then why short fields are boost up? 

In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page.