You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by daniel rosher <da...@hotonline.com> on 2008/02/07 11:15:28 UTC

Lucene scoring and short fields

Hi All,

Given that Lucene scoring can favour shorter fields in documents, in the
past we've had to pad out 'unreasonably' short fields to a set minimum
(with basically nonsense words), I'm wondering how others might have
dealt with this issue.

Another option is to have a custom Similarity class with an altered
lengthNorm method?

Cheers,
Dan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

From:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
score(q,d)   =  coord(q,d)  ·  queryNorm(q)  ·  SUM(  tf(t in d)  ·
idf(t)2  ·  t.getBoost() ·  norm(t,d)  )

Given one term query, and the term found in two documents doc{a},
doc{b}(with no boost on field, doc or query term)

score(q,d)   =~  SUM (  tf(t in d)  ·  norm(t,d)  )

and for one term:
score(q,d)   =~   tf(t in d)  ·  norm(t,d)  

also:
norm(t,d) =~ lengthNorm(field) 

lengthNorm(field) :
computed when the document is added to the index in accordance with the
number of tokens of this field in the document, so that shorter fields
contribute more to the score

in DefaultSimilarity.java

lengthNorm(field)  = 1/sqrt(num_terms_in_field)

doc{a} field{a} num_terms_in_field = 100, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(100) = 1
doc{b} field{a} num_terms_in_field = 300, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(300) = 0.577350269
Daniel Rosher
Developer


d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m: 
http://www.hotonline.com/






- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This message is sent in confidence for the addressee only. It may contain privileged 
information. The contents are not to be disclosed to anyone other than the addressee. 
Unauthorised recipients are requested to preserve this confidentiality and to advise 
us of any errors in transmission. Thank you.

hotonline ltd is registered in England & Wales. Registered office: One Canada Square, 
Canary Wharf, London E14 5AP. Registered No: 1904765.


This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene scoring and short fields

Posted by Chris Hostetter <ho...@fucit.org>.

: (with basically nonsense words), I'm wondering how others might have
: dealt with this issue.
: 
: Another option is to have a custom Similarity class with an altered
: lengthNorm method?

that is what i would recommend ... it's exactly what SweetSpotSimilarity 
does (you define a "platuea" of lengths that ar all treated equally, and 
anything shorter or longer then that drops off with lower lengthNorm 
values.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org