You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/12/15 09:09:39 UTC

Omitting tf but not positions

Hi,

I have a case where I use DisMax "pf" to boost on phrase match in a field. I use omitNorms=true to avoid length normalization to mess with my scores.

However, for some documents, the phrase "foo bar" occur more than one time in the same field, and I get an unintended TF boost for one of them

    1.4142135 = tf(phraseFreq=2.0)
vs
    1.0 = tf(phraseFreq=1.0)

I could use omitTermFreqAndPositions but that would disable phrase search ability, wouldn't it?
Any way to disable TF/IDF normalization without also disabling positions?

Solr 1.4.1

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

Re: Omitting tf but not positions

Posted by Robert Zotter <ro...@gmail.com>.

Jan,

You are correct, you'll need your own Similarity class.

Have a look at SweetSpotSimilarity 
(http://lucene.apache.org/java/3_0_3/api/contrib-misc/org/apache/lucene/misc/SweetSpotSimilarity.html)

On 2/25/11 10:57 AM, Jan Høydahl wrote:
> I also have a case (yellow-page) where IDF comes in and destroys the rank.
> A company listing with a word which occurs in few other listings is not necessarily better than others just because of that. When it gets to the extreme value of IDF=1, we get an artificially high IDF boost.
>
> It is not killed by omitNorms, neither by omitTermFrequencyAndPositions. Any per-field way to get rid of the IDF effect?
> Or should I override idf() in Similarity?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 15. des. 2010, at 13.27, Robert Muir wrote:
>
>> On Wed, Dec 15, 2010 at 3:09 AM, Jan Høydahl / Cominvent
>> <ja...@cominvent.com>  wrote:
>>> Any way to disable TF/IDF normalization without also disabling positions?
>>>
>> see Similarity.tf(float) and Similarity.tf(int)
>>
>> if you want to change this for both terms and phrases just override
>> Similarity.tf(float), since by default Similarity.tf(int) delegates to
>> that.
>> otherwise, override both.
>>
>> of course the big limitation being you cant customize Similarity per-field yet.

Re: Omitting tf but not positions

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Feb 25, 2011 at 1:57 PM, Jan Høydahl <ja...@cominvent.com> wrote:
> I also have a case (yellow-page) where IDF comes in and destroys the rank.
> A company listing with a word which occurs in few other listings is not necessarily better than others just because of that. When it gets to the extreme value of IDF=1, we get an artificially high IDF boost.
>
> It is not killed by omitNorms, neither by omitTermFrequencyAndPositions. Any per-field way to get rid of the IDF effect?
> Or should I override idf() in Similarity?
>

Hi Jan, my reply was back in december. These days in lucene/solr
trunk, you can customize Similarity on a per-field basis.
So your yellow-page field can have a completely different similarity
(tf, idf, lengthnorm, etc).

For that field you can disable things like TF and IDF entirely, e.g.
just set it to a constant such as 1 or if you think thats too risky,
consider an alternative ranking scheme that doesn't use the IDF at all
such as the example in
https://issues.apache.org/jira/browse/LUCENE-2864

For now, you have to implement SimilarityProvider in a java class
(with something like a hashmap returning different similaritys for
different fields), and set this up with the similarity hook in
schema.xml, but there is an issue open to make this easier:
https://issues.apache.org/jira/browse/SOLR-2338

Re: Omitting tf but not positions

Posted by Jan Høydahl <ja...@cominvent.com>.

I also have a case (yellow-page) where IDF comes in and destroys the rank.
A company listing with a word which occurs in few other listings is not necessarily better than others just because of that. When it gets to the extreme value of IDF=1, we get an artificially high IDF boost.

It is not killed by omitNorms, neither by omitTermFrequencyAndPositions. Any per-field way to get rid of the IDF effect?
Or should I override idf() in Similarity?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 15. des. 2010, at 13.27, Robert Muir wrote:

> On Wed, Dec 15, 2010 at 3:09 AM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> Any way to disable TF/IDF normalization without also disabling positions?
>> 
> 
> see Similarity.tf(float) and Similarity.tf(int)
> 
> if you want to change this for both terms and phrases just override
> Similarity.tf(float), since by default Similarity.tf(int) delegates to
> that.
> otherwise, override both.
> 
> of course the big limitation being you cant customize Similarity per-field yet.

Re: Omitting tf but not positions

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Dec 15, 2010 at 3:09 AM, Jan Høydahl / Cominvent
<ja...@cominvent.com> wrote:
> Any way to disable TF/IDF normalization without also disabling positions?
>

see Similarity.tf(float) and Similarity.tf(int)

if you want to change this for both terms and phrases just override
Similarity.tf(float), since by default Similarity.tf(int) delegates to
that.
otherwise, override both.

of course the big limitation being you cant customize Similarity per-field yet.