You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mihran Shahinian <sl...@gmail.com> on 2015/03/16 16:26:40 UTC
Relevancy : Keyword stuffing
Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.
Much appreciated,
Mihran
Re: Relevancy : Keyword stuffing
Posted by Mihran Shahinian <sl...@gmail.com>.
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran
On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <ho...@fucit.org>
wrote:
>
> You should start by checking out the "SweetSpotSimilarity" .. it was
> heavily designed arround the idea of dealing with things like excessively
> verbose titles, and keyword stuffing in summary text ... so you can
> configure your expectation for what a "normal" length doc is, and they
> will be penalized for being longer then that. similarly you can say what
> a 'resaonable' tf is, and docs that exceed that would't get added boost
> (which in conjunction with teh lengthNorm penality penalizes docs that
> stuff keywords)
>
>
> https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
>
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
>
>
> -Hoss
> http://www.lucidworks.com/
>
Re: Relevancy : Keyword stuffing
Posted by Chris Hostetter <ho...@fucit.org>.
You should start by checking out the "SweetSpotSimilarity" .. it was
heavily designed arround the idea of dealing with things like excessively
verbose titles, and keyword stuffing in summary text ... so you can
configure your expectation for what a "normal" length doc is, and they
will be penalized for being longer then that. similarly you can say what
a 'resaonable' tf is, and docs that exceed that would't get added boost
(which in conjunction with teh lengthNorm penality penalizes docs that
stuff keywords)
https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
-Hoss
http://www.lucidworks.com/
RE: Relevancy : Keyword stuffing
Posted by Markus Jelsma <ma...@openindex.io>.
Hello - setting (e)dismax' tie breaker to 0 or much low than default would `solve` this for now.
Markus
-----Original message-----
> From:Mihran Shahinian <sl...@gmail.com>
> Sent: Monday 16th March 2015 16:29
> To: solr-user@lucene.apache.org
> Subject: Relevancy : Keyword stuffing
>
> Hi all,
> I have a use case where the data is generated by SEO minded authors and
> more often than not
> they perfectly guess the synonym expansions for the document titles skewing
> results in their favor.
> At the moment I don't have an offline processing infrastructure to detect
> these (I can't punish these docs either... just have to level the playing
> field).
> I am experimenting with taking the max of the term scores, cutting off
> scores after certain number of terms,etc but would appreciate any hints if
> anyone has experience dealing with a similar use case in solr.
>
> Much appreciated,
> Mihran
>