You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/10/31 13:48:42 UTC

Scoring algorithm?

Am I right in thinking that a document that the sortable field is only
two sentences long and contains the search term once will score higher
than one that is 50 sentences long that contains the search term 4
times?   Is there a way to change it to score higher based only on
number of hits?

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin

Re: Scoring algorithm?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Sat, Oct 31, 2009 at 10:22 AM, Paul Tomblin <pt...@xcski.com> wrote:
> If I change the schema this way, do I need to re-submit all the
> documents to Solr?

Yep.  And you should delete the index first before doing so (some
field properties are contagious... merging a segment w/o norms and a
segment with norms will result in a single segment with norms).

>  And if I have them all sitting on disk as XML
> files that look like
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <doc>
> <field name=...">...</field>
> <field name=...">...</field>
> </doc>
> is there a quick way to submit them all to Solr?

The easiest way is to just use something like post.sh *.xml
That's slow performance-wise, but not a big deal of you don't have too
many docs.

-Yonik
http://www.lucidimagination.com


> On Sat, Oct 31, 2009 at 10:04 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Sat, Oct 31, 2009 at 8:48 AM, Paul Tomblin <pt...@xcski.com> wrote:
>>> Am I right in thinking that a document that the sortable field is only
>>> two sentences long and contains the search term once will score higher
>>> than one that is 50 sentences long that contains the search term 4
>>> times?
>>
>> Yep.  Assuming 15 tokens per sentence, doc1 will have
>> lengthNorm = 1/(2*15)**.5 or 0.18 with  tf=1**.5 or 1
>> doc2 will have
>> lengthNorm  = 1/(50*15)**.5 or 0.04 with tf=4**.5 or 2
>>
>> Or if you don't want length normalization at all, simply use
>> omitNorms=true in the schema for this field.
>>
>>>  Is there a way to change it to score higher based only on
>>> number of hits?
>>
>> Yes, simply use omitNorms=true in the schema.xml for this field.
>>
>> If you still wanted a lengthNorm, you could change the balance by
>> creating a custom similarity and overriding either lengthNorm() or
>> tf()
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
> http://careers.stackoverflow.com/ptomblin
>

Re: Scoring algorithm?

Posted by Paul Tomblin <pt...@xcski.com>.
If I change the schema this way, do I need to re-submit all the
documents to Solr?  And if I have them all sitting on disk as XML
files that look like
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<doc>
<field name=...">...</field>
<field name=...">...</field>
</doc>
is there a quick way to submit them all to Solr?

On Sat, Oct 31, 2009 at 10:04 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Sat, Oct 31, 2009 at 8:48 AM, Paul Tomblin <pt...@xcski.com> wrote:
>> Am I right in thinking that a document that the sortable field is only
>> two sentences long and contains the search term once will score higher
>> than one that is 50 sentences long that contains the search term 4
>> times?
>
> Yep.  Assuming 15 tokens per sentence, doc1 will have
> lengthNorm = 1/(2*15)**.5 or 0.18 with  tf=1**.5 or 1
> doc2 will have
> lengthNorm  = 1/(50*15)**.5 or 0.04 with tf=4**.5 or 2
>
> Or if you don't want length normalization at all, simply use
> omitNorms=true in the schema for this field.
>
>>  Is there a way to change it to score higher based only on
>> number of hits?
>
> Yes, simply use omitNorms=true in the schema.xml for this field.
>
> If you still wanted a lengthNorm, you could change the balance by
> creating a custom similarity and overriding either lengthNorm() or
> tf()
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin

Re: Scoring algorithm?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Sat, Oct 31, 2009 at 8:48 AM, Paul Tomblin <pt...@xcski.com> wrote:
> Am I right in thinking that a document that the sortable field is only
> two sentences long and contains the search term once will score higher
> than one that is 50 sentences long that contains the search term 4
> times?

Yep.  Assuming 15 tokens per sentence, doc1 will have
lengthNorm = 1/(2*15)**.5 or 0.18 with  tf=1**.5 or 1
doc2 will have
lengthNorm  = 1/(50*15)**.5 or 0.04 with tf=4**.5 or 2

Or if you don't want length normalization at all, simply use
omitNorms=true in the schema for this field.

>  Is there a way to change it to score higher based only on
> number of hits?

Yes, simply use omitNorms=true in the schema.xml for this field.

If you still wanted a lengthNorm, you could change the balance by
creating a custom similarity and overriding either lengthNorm() or
tf()

-Yonik
http://www.lucidimagination.com