You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dotan Cohen <do...@gmail.com> on 2012/11/07 10:15:44 UTC
Exponential omitNorms
Hi all! One area when I am applying Solr deals with variable-length
posts by users: think of things from one word posts ("Cool!" with an
attached photo) to blog-post length (500-1000 words). Due to Field
Normalization, the short posts get the highest Solr score, while the
long, informative posts are pushed to the end of the results.
Therefore I am moving to remove Field Normalization with
omitNorms=true. However, there do exist the craft users who stuff
hundreds of irrelevant words together to get noticed. Therefore, I
would like to try some sort of exponential (or even linear) Field
Normalization where no normalization is performed on one-word
documents but the longest documents (over a few hundred words) get
some penalty.
Are there any facilities in Solr for performing this? I would of
course prefer query-time computation as that let me do other things
with the data, but if only index-time computation is possible the I
can accept that.
Thank you!
--
Dotan Cohen
http://gibberish.co.il
http://what-is-what.com
Re: Exponential omitNorms
Posted by Dotan Cohen <do...@gmail.com>.
On Wed, Nov 7, 2012 at 5:16 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> You are probably thinking of SweetSpotSimilarity. You might also want to look at pivoted document normalization.
>
Thanks, I'll take a look at that.
--
Dotan Cohen
http://gibberish.co.il
http://what-is-what.com
Re: Exponential omitNorms
Posted by Walter Underwood <wu...@wunderwood.org>.
You are probably thinking of SweetSpotSimilarity. You might also want to look at pivoted document normalization.
wunder
On Nov 7, 2012, at 4:22 AM, Dotan Cohen wrote:
> On Wed, Nov 7, 2012 at 2:01 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
>> The name escapes me but Lucene used to have a special Similarity impl for
>> this sort of stuff. I think it's still there.
>>
>> We implemented a slightly better Similarity that used Gaussian distribution
>> and was thus smoother. Try doing that.
>>
>> Otis
>
> Thanks, Otis. I'll start googling for Solr and Lucene Similarity.
>
> --
> Dotan Cohen
>
> http://gibberish.co.il
> http://what-is-what.com
Re: Exponential omitNorms
Posted by Dotan Cohen <do...@gmail.com>.
On Wed, Nov 7, 2012 at 2:01 PM, Otis Gospodnetic
<ot...@gmail.com> wrote:
> The name escapes me but Lucene used to have a special Similarity impl for
> this sort of stuff. I think it's still there.
>
> We implemented a slightly better Similarity that used Gaussian distribution
> and was thus smoother. Try doing that.
>
> Otis
Thanks, Otis. I'll start googling for Solr and Lucene Similarity.
--
Dotan Cohen
http://gibberish.co.il
http://what-is-what.com
Re: Exponential omitNorms
Posted by Otis Gospodnetic <ot...@gmail.com>.
The name escapes me but Lucene used to have a special Similarity impl for
this sort of stuff. I think it's still there.
We implemented a slightly better Similarity that used Gaussian distribution
and was thus smoother. Try doing that.
Otis
--
Performance Monitoring - http://sematext.com/spm
On Nov 7, 2012 4:16 AM, "Dotan Cohen" <do...@gmail.com> wrote:
> Hi all! One area when I am applying Solr deals with variable-length
> posts by users: think of things from one word posts ("Cool!" with an
> attached photo) to blog-post length (500-1000 words). Due to Field
> Normalization, the short posts get the highest Solr score, while the
> long, informative posts are pushed to the end of the results.
> Therefore I am moving to remove Field Normalization with
> omitNorms=true. However, there do exist the craft users who stuff
> hundreds of irrelevant words together to get noticed. Therefore, I
> would like to try some sort of exponential (or even linear) Field
> Normalization where no normalization is performed on one-word
> documents but the longest documents (over a few hundred words) get
> some penalty.
>
> Are there any facilities in Solr for performing this? I would of
> course prefer query-time computation as that let me do other things
> with the data, but if only index-time computation is possible the I
> can accept that.
>
> Thank you!
>
> --
> Dotan Cohen
>
> http://gibberish.co.il
> http://what-is-what.com
>