You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dotan Cohen <do...@gmail.com> on 2012/11/07 10:15:44 UTC

Exponential omitNorms

Hi all! One area when I am applying Solr deals with variable-length
posts by users: think of things from one word posts ("Cool!" with an
attached photo) to blog-post length (500-1000 words). Due to Field
Normalization, the short posts get the highest Solr score, while the
long, informative posts are pushed to the end of the results.
Therefore I am moving to remove Field Normalization with
omitNorms=true. However, there do exist the craft users who stuff
hundreds of irrelevant words together to get noticed. Therefore, I
would like to try some sort of exponential (or even linear) Field
Normalization where no normalization is performed on one-word
documents but the longest documents (over a few hundred words) get
some penalty.

Are there any facilities in Solr for performing this? I would of
course prefer query-time computation as that let me do other things
with the data, but if only index-time computation is possible the I
can accept that.

Thank you!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Exponential omitNorms

Posted by Dotan Cohen <do...@gmail.com>.

On Wed, Nov 7, 2012 at 5:16 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> You are probably thinking of SweetSpotSimilarity. You might also want to look at pivoted document normalization.
>

Thanks, I'll take a look at that.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Exponential omitNorms

Posted by Walter Underwood <wu...@wunderwood.org>.

You are probably thinking of SweetSpotSimilarity. You might also want to look at pivoted document normalization.

wunder

On Nov 7, 2012, at 4:22 AM, Dotan Cohen wrote:

> On Wed, Nov 7, 2012 at 2:01 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
>> The name escapes me but Lucene used to have a special Similarity impl for
>> this sort of stuff. I think it's still there.
>> 
>> We implemented a slightly better Similarity that used Gaussian distribution
>> and was thus smoother. Try doing that.
>> 
>> Otis
> 
> Thanks, Otis. I'll start googling for Solr and Lucene Similarity.
> 
> -- 
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com

Re: Exponential omitNorms

Posted by Dotan Cohen <do...@gmail.com>.

On Wed, Nov 7, 2012 at 2:01 PM, Otis Gospodnetic
<ot...@gmail.com> wrote:
> The name escapes me but Lucene used to have a special Similarity impl for
> this sort of stuff. I think it's still there.
>
> We implemented a slightly better Similarity that used Gaussian distribution
> and was thus smoother. Try doing that.
>
> Otis

Thanks, Otis. I'll start googling for Solr and Lucene Similarity.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Exponential omitNorms

Posted by Otis Gospodnetic <ot...@gmail.com>.

The name escapes me but Lucene used to have a special Similarity impl for
this sort of stuff. I think it's still there.

We implemented a slightly better Similarity that used Gaussian distribution
and was thus smoother. Try doing that.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Nov 7, 2012 4:16 AM, "Dotan Cohen" <do...@gmail.com> wrote:

> Hi all! One area when I am applying Solr deals with variable-length
> posts by users: think of things from one word posts ("Cool!" with an
> attached photo) to blog-post length (500-1000 words). Due to Field
> Normalization, the short posts get the highest Solr score, while the
> long, informative posts are pushed to the end of the results.
> Therefore I am moving to remove Field Normalization with
> omitNorms=true. However, there do exist the craft users who stuff
> hundreds of irrelevant words together to get noticed. Therefore, I
> would like to try some sort of exponential (or even linear) Field
> Normalization where no normalization is performed on one-word
> documents but the longest documents (over a few hundred words) get
> some penalty.
>
> Are there any facilities in Solr for performing this? I would of
> course prefer query-time computation as that let me do other things
> with the data, but if only index-time computation is possible the I
> can accept that.
>
> Thank you!
>
> --
> Dotan Cohen
>
> http://gibberish.co.il
> http://what-is-what.com
>