You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2011/07/27 15:45:51 UTC

Dealing with keyword stuffing

I guess most of you have already handled and many of you might still be
handling keyword stuffing. Here is my scenario. We have a huge index
containing about 6m docs. (Not sure if that is huge :-) And every document
contains title, description, tags, content (textual data). People have been
doing keyword stuffing on the documents, so when searched for a "query
term", the first results are always the ones who are optimized.

So, instead of people getting relevant results, they get spam content
(highly optimized, keyword stuffed content) as first few results. I have
tried a couple of things like providing different boosts to different
fields, but almost everything seems to fail.

I'd like to know how did you guys fixed this thing?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Dealing with keyword stuffing

Posted by Pranav Prakash <pr...@gmail.com>.
Cool, So I used SweetSpotSimilarity with default params and I see some
improvements. However, I could still see some of the 'stuffed' documents
coming up in the results. I feel that SweetSpotSimilarity alone is not
enough. Going through
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out
that there are other things - Pivoted Length Normalization and term
frequency normalization that needs fine tuning too.

Should I create a custom Similarity Class that overrides all the default
behavior? I guess that should help me get more relevant results. Where
should I start beginning with it? Pl. do not assume less obvious things, I
am still learning !! :-)

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Thu, Jul 28, 2011 at 17:03, Gora Mohanty <go...@mimirtech.com> wrote:

> On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash <pr...@gmail.com> wrote:
> [...]
> > I am not sure how to use SweetSpotSimilarity. I am googling on this, but
> > any useful insights are so much appreciated.
>
> Replace the existing DefaultSimilarity class in schema.xml (look towards
> the bottom of the file) with the SweetSpotSimilarity class, e.g., have a
> line
> like:
>  <similarity class="org.apache.lucene.search.SweetSpotSimilarity"/>
>
> Regards,
> Gora
>

Re: Dealing with keyword stuffing

Posted by Gora Mohanty <go...@mimirtech.com>.
On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash <pr...@gmail.com> wrote:
[...]
> I am not sure how to use SweetSpotSimilarity. I am googling on this, but
> any useful insights are so much appreciated.

Replace the existing DefaultSimilarity class in schema.xml (look towards
the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line
like:
  <similarity class="org.apache.lucene.search.SweetSpotSimilarity"/>

Regards,
Gora

Re: Dealing with keyword stuffing

Posted by Pranav Prakash <pr...@gmail.com>.
On Thu, Jul 28, 2011 at 08:31, Chris Hostetter <ho...@fucit.org>wrote:

>
> : Presumably, they are doing this by increasing tf (term frequency),
> : i.e., by repeating keywords multiple times. If so, you can use a custom
> : similarity class that caps term frequency, and/or ensures that the
> scoring
> : increases less than linearly with tf. Please see
>

In some cases, yes they are repeating keywords multiple times. Stuffing
different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr
Guide.


>
> in paticular, using something like SweetSpotSimilarity tuned to know what
> values make sense for "good" content in your domain can be useful because
> it can actaully penalize docsuments that are too short/long or have term
> freqs that are outside of a reasonble expected range.
>

I am not a Solr expert, But I was thinking in this direction. The ratio of
tokens/total_length would be nearer to 1 for a stuffed document, while it
would be nearer to 0 for a bogus document. Somewhere between the two lies
documents that are more likely to be meaningful. I am not sure how to use
SweetSpotSimilarity. I am googling on this, but any useful insights are so
much appreciated.

Re: Dealing with keyword stuffing

Posted by Chris Hostetter <ho...@fucit.org>.
: Presumably, they are doing this by increasing tf (term frequency),
: i.e., by repeating keywords multiple times. If so, you can use a custom
: similarity class that caps term frequency, and/or ensures that the scoring
: increases less than linearly with tf. Please see

in paticular, using something like SweetSpotSimilarity tuned to know what 
values make sense for "good" content in your domain can be useful because 
it can actaully penalize docsuments that are too short/long or have term 
freqs that are outside of a reasonble expected range.

FWIW though: that's really just a generic answer to a generic question.  
the better you understand your data, the better you can configure solr for 
it -- and that goes equally for the advice people can give you about how 
to configure solr.  you haven't given any information about hte nature of 
your data: the types of documets, the authoritaive source, the fields 
involved, where/how/when people edit this data, who is keyword spamming, 
etc.; or how you wnat to use it: what types of queries you need to 
support, what your users objectives are, etc.  That makes it impossible 
for anyone to suggest anything but the most general answer "customize" 
your Similarity.

-Hoss

Re: Dealing with keyword stuffing

Posted by Gora Mohanty <go...@mimirtech.com>.
On Wed, Jul 27, 2011 at 7:15 PM, Pranav Prakash <pr...@gmail.com> wrote:
> I guess most of you have already handled and many of you might still be
> handling keyword stuffing. Here is my scenario. We have a huge index
> containing about 6m docs. (Not sure if that is huge :-) And every document
> contains title, description, tags, content (textual data). People have been
> doing keyword stuffing on the documents, so when searched for a "query
> term", the first results are always the ones who are optimized.
>
> So, instead of people getting relevant results, they get spam content
> (highly optimized, keyword stuffed content) as first few results. I have
> tried a couple of things like providing different boosts to different
> fields, but almost everything seems to fail.
[...]

Presumably, they are doing this by increasing tf (term frequency),
i.e., by repeating keywords multiple times. If so, you can use a custom
similarity class that caps term frequency, and/or ensures that the scoring
increases less than linearly with tf. Please see
http://wiki.apache.org/solr/SchemaXml#Similarity , and/or do a web
search for more details.

Regards,
Gora