You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by GW <th...@gmail.com> on 2016/04/23 16:02:12 UTC

need help with keyword spamming

Hey all,

I'm just finishing up a project and I'm hoping for some direction on
dealing with keyword spamming.

I don't have any urgent issues. I can foresee some bumps in the road.

I'm using a custom spider that pulls inventory data from several dozen
sources into a single doc schema. 1 record per item per location.

Data from several sources have an existing keyword field. Some records
coming in have empty or null data for keywords.

I concatenated my category and keyword data into the keyword field so I
would not have any empty keyword data to satisfy a query builder.

I have a recommended keyword list I could use to count hits before I index.
It's a painful thought.

I want to be able to detect people that are trying to do keyword spamming.

So my question is: Is there some kind of FM that I'm not aware of?

Thanks in advance,

GW

Re: need help with keyword spamming

Posted by Erick Erickson <er...@gmail.com>.
The problem here is defining "irrelevant". There's nothing in Solr
that magically can determine "this term is irrelevant in this doc, but
this other one isn't".

Best,
Erick

On Sat, Apr 23, 2016 at 11:08 AM, GW <th...@gmail.com> wrote:
> No. My project is retail based. I mean people putting in a slew of
> irrelevant keywords in addition to relevant keywords in an attempt to get
> hits on searches and hits outside of context.
>
> I used a filter factory to remove duplicates.
>
> On 23 April 2016 at 11:30, Doug Turnbull <
> dturnbull@opensourceconnections.com> wrote:
>
>> By keyword spamming, do you mean stuffing the same term over and over to
>> game term frequency?
>>
>> If so You might want to try tuning BM25 similarity for your needs. It has a
>> saturation point for term frequency.
>>
>>
>> http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
>>
>> You can also write your own similarity that sets a max for term frequency.
>>
>> I'd also consider figuring out if you can build a page rank like measure
>> that can signal content trustworthiness. Spammer sites won't be linked to
>> very heavily by trusted sites.
>>
>> If you just mean spamming like lots of unique keywords, length
>> normalization was built just for this reason: to bias relevance toward less
>> verbose and more specific matches
>>
>> Hope that helps
>>
>> Doug
>> On Sat, Apr 23, 2016 at 10:02 AM GW <th...@gmail.com> wrote:
>>
>> > Hey all,
>> >
>> > I'm just finishing up a project and I'm hoping for some direction on
>> > dealing with keyword spamming.
>> >
>> > I don't have any urgent issues. I can foresee some bumps in the road.
>> >
>> > I'm using a custom spider that pulls inventory data from several dozen
>> > sources into a single doc schema. 1 record per item per location.
>> >
>> > Data from several sources have an existing keyword field. Some records
>> > coming in have empty or null data for keywords.
>> >
>> > I concatenated my category and keyword data into the keyword field so I
>> > would not have any empty keyword data to satisfy a query builder.
>> >
>> > I have a recommended keyword list I could use to count hits before I
>> index.
>> > It's a painful thought.
>> >
>> > I want to be able to detect people that are trying to do keyword
>> spamming.
>> >
>> > So my question is: Is there some kind of FM that I'm not aware of?
>> >
>> > Thanks in advance,
>> >
>> > GW
>> >
>>

Re: need help with keyword spamming

Posted by GW <th...@gmail.com>.
No. My project is retail based. I mean people putting in a slew of
irrelevant keywords in addition to relevant keywords in an attempt to get
hits on searches and hits outside of context.

I used a filter factory to remove duplicates.

On 23 April 2016 at 11:30, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> By keyword spamming, do you mean stuffing the same term over and over to
> game term frequency?
>
> If so You might want to try tuning BM25 similarity for your needs. It has a
> saturation point for term frequency.
>
>
> http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
>
> You can also write your own similarity that sets a max for term frequency.
>
> I'd also consider figuring out if you can build a page rank like measure
> that can signal content trustworthiness. Spammer sites won't be linked to
> very heavily by trusted sites.
>
> If you just mean spamming like lots of unique keywords, length
> normalization was built just for this reason: to bias relevance toward less
> verbose and more specific matches
>
> Hope that helps
>
> Doug
> On Sat, Apr 23, 2016 at 10:02 AM GW <th...@gmail.com> wrote:
>
> > Hey all,
> >
> > I'm just finishing up a project and I'm hoping for some direction on
> > dealing with keyword spamming.
> >
> > I don't have any urgent issues. I can foresee some bumps in the road.
> >
> > I'm using a custom spider that pulls inventory data from several dozen
> > sources into a single doc schema. 1 record per item per location.
> >
> > Data from several sources have an existing keyword field. Some records
> > coming in have empty or null data for keywords.
> >
> > I concatenated my category and keyword data into the keyword field so I
> > would not have any empty keyword data to satisfy a query builder.
> >
> > I have a recommended keyword list I could use to count hits before I
> index.
> > It's a painful thought.
> >
> > I want to be able to detect people that are trying to do keyword
> spamming.
> >
> > So my question is: Is there some kind of FM that I'm not aware of?
> >
> > Thanks in advance,
> >
> > GW
> >
>

Re: need help with keyword spamming

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
By keyword spamming, do you mean stuffing the same term over and over to
game term frequency?

If so You might want to try tuning BM25 similarity for your needs. It has a
saturation point for term frequency.

http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

You can also write your own similarity that sets a max for term frequency.

I'd also consider figuring out if you can build a page rank like measure
that can signal content trustworthiness. Spammer sites won't be linked to
very heavily by trusted sites.

If you just mean spamming like lots of unique keywords, length
normalization was built just for this reason: to bias relevance toward less
verbose and more specific matches

Hope that helps

Doug
On Sat, Apr 23, 2016 at 10:02 AM GW <th...@gmail.com> wrote:

> Hey all,
>
> I'm just finishing up a project and I'm hoping for some direction on
> dealing with keyword spamming.
>
> I don't have any urgent issues. I can foresee some bumps in the road.
>
> I'm using a custom spider that pulls inventory data from several dozen
> sources into a single doc schema. 1 record per item per location.
>
> Data from several sources have an existing keyword field. Some records
> coming in have empty or null data for keywords.
>
> I concatenated my category and keyword data into the keyword field so I
> would not have any empty keyword data to satisfy a query builder.
>
> I have a recommended keyword list I could use to count hits before I index.
> It's a painful thought.
>
> I want to be able to detect people that are trying to do keyword spamming.
>
> So my question is: Is there some kind of FM that I'm not aware of?
>
> Thanks in advance,
>
> GW
>