You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mihran Shahinian <sl...@gmail.com> on 2015/03/16 16:26:40 UTC

Relevancy : Keyword stuffing

Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.

Much appreciated,
Mihran

Re: Relevancy : Keyword stuffing

Posted by Mihran Shahinian <sl...@gmail.com>.
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran



On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> You should start by checking out the "SweetSpotSimilarity" .. it was
> heavily designed arround the idea of dealing with things like excessively
> verbose titles, and keyword stuffing in summary text ... so you can
> configure your expectation for what a "normal" length doc is, and they
> will be penalized for being longer then that.  similarly you can say what
> a 'resaonable' tf is, and docs that exceed that would't get added boost
> (which in conjunction with teh lengthNorm penality penalizes docs that
> stuff keywords)
>
>
> https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
>
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Relevancy : Keyword stuffing

Posted by Chris Hostetter <ho...@fucit.org>.
You should start by checking out the "SweetSpotSimilarity" .. it was 
heavily designed arround the idea of dealing with things like excessively 
verbose titles, and keyword stuffing in summary text ... so you can 
configure your expectation for what a "normal" length doc is, and they 
will be penalized for being longer then that.  similarly you can say what 
a 'resaonable' tf is, and docs that exceed that would't get added boost 
(which in conjunction with teh lengthNorm penality penalizes docs that 
stuff keywords)

https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg


-Hoss
http://www.lucidworks.com/

RE: Relevancy : Keyword stuffing

Posted by Markus Jelsma <ma...@openindex.io>.
Hello - setting (e)dismax' tie breaker to 0 or much low than default would `solve` this for now.
Markus 
 
-----Original message-----
> From:Mihran Shahinian <sl...@gmail.com>
> Sent: Monday 16th March 2015 16:29
> To: solr-user@lucene.apache.org
> Subject: Relevancy : Keyword stuffing
> 
> Hi all,
> I have a use case where the data is generated by SEO minded authors and
> more often than not
> they perfectly guess the synonym expansions for the document titles skewing
> results in their favor.
> At the moment I don't have an offline processing infrastructure to detect
> these (I can't punish these docs either... just have to level the playing
> field).
> I am experimenting with taking the max of the term scores, cutting off
> scores after certain number of terms,etc but would appreciate any hints if
> anyone has experience dealing with a similar use case in solr.
> 
> Much appreciated,
> Mihran
>