You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ms...@aol.com on 2005/10/23 16:47:09 UTC

Classifier4J and Lucene

Hey-
 
I have an indexer at my company that I wrote while back that indexes database content (users and their profile)...one of the next req. of the project is to avoid 'spam' in hits. For example if I do a search for oracle, and oracle is in 25 places in someones bio field...and another person has it in one place in his company field, the 25 places will of course be higher. Unfortunatly, people who know the system know the more you have certain keywords in you user profile, the higher you will be on the list. I was thinking I can do one of two things:
 
1. Work with Lucene algo to lower scores in certain fields (boost in others)...this would work, but the boost has such a small factor in scoring (or so it seems), that in some cases it won't matter. (if I boost company to 2.0, and bio to 1.0 in some cases with xxx hits in bio, that is still first in score)
 
2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use same idea as a mail filter and use the Bayesian Classifier to train it that certain words would be spam...then just index the summary. Throwing this out there...not even sure that it will work...
 
Not sure if this makses sense...but curious if anyone has ideas, or has done something like this.
 
Regards!
-Joe

Re: Classifier4J and Lucene

Posted by ms...@aol.com.

interesting information you have here...I will look into this and let you know what I come up with.

Thanks! 

-----Original Message-----
From: Chris Hostetter <ho...@fucit.org>
To: java-user@lucene.apache.org
Sent: Sun, 23 Oct 2005 10:14:13 -0700 (PDT)
Subject: Re: Classifier4J and Lucene

: Not sure if this makses sense...but curious if anyone has ideas, or has
: done something like this.

I have a few ideas, none of which are mutuallly exclusive...

1) look at the Explain output for the various queries you are generating
to help you understand why your boosts aren't having as much of an affect
as you want.

2) subclass DefaultSimilarity and override the lengthNorm method with a
new one, which *heavilly* penilizes really long field values.  this method
gets the name of the field when asked to perform a calculation, so you can
use this special behavior just on fields that users have the ability to
keyword SPAM if you want.

3) subclass DefaultSimilarity and override the tf(float) method ... this
alllows you to specify how much of an impact the frequency any Term has on
the overall score.  Usually, high frequency itmes are given a high score
... but if you are dealing with records which are typically very small,
you may want to penalize docs with a high frequency.  at the very least,
you might want to flatten the curve.  if you really want to flatten it so
that spamming does no good at all, you can use something like this...

        public float tf(float freq) {
            if (freq > 0.0f) return 1.0f;
            else return 0.0f;
        }

...but that may be overkill, expperimentation should help you find a
happpy medium.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Classifier4J and Lucene

Posted by Chris Hostetter <ho...@fucit.org>.

: Not sure if this makses sense...but curious if anyone has ideas, or has
: done something like this.

I have a few ideas, none of which are mutuallly exclusive...

1) look at the Explain output for the various queries you are generating
to help you understand why your boosts aren't having as much of an affect
as you want.

2) subclass DefaultSimilarity and override the lengthNorm method with a
new one, which *heavilly* penilizes really long field values.  this method
gets the name of the field when asked to perform a calculation, so you can
use this special behavior just on fields that users have the ability to
keyword SPAM if you want.

3) subclass DefaultSimilarity and override the tf(float) method ... this
alllows you to specify how much of an impact the frequency any Term has on
the overall score.  Usually, high frequency itmes are given a high score
... but if you are dealing with records which are typically very small,
you may want to penalize docs with a high frequency.  at the very least,
you might want to flatten the curve.  if you really want to flatten it so
that spamming does no good at all, you can use something like this...

        public float tf(float freq) {
            if (freq > 0.0f) return 1.0f;
            else return 0.0f;
        }

...but that may be overkill, expperimentation should help you find a
happpy medium.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Classifier4J and Lucene

Posted by Jeff Rodenburg <je...@gmail.com>.

Sounds like you might have to consider both, if the first one doesn't solve
your issue. A company field sounds like it's a single entry, i.e. one that
can't be "spammed up" with multiple terms, i.e. "Oralce Oracle Oracle". It
also sounds as if you're searching multiple fields, and that some fields are
more important than others.

It sounds like there are expectations about what documents rise to the top
for a given search, so I would suggest starting by getting your boost
prioritization in order by working with a "clean" or non-spammed index.
After that, bring in the spammed index and go from there. You're right, you
won't be able to boost away the spammers.

I don't have much background with Classifier4j, but it seems that words
would need to be considered spam differently across different fields, if I
understand your indexing/querying structure. I like the approach of indexing
a boiled summary, not sure if Classifier4J doesn't have you doing a lot of
work.

Hope this helps.

-- jr

On 10/23/05, msftblows@aol.com <ms...@aol.com> wrote:
>
> Hey-
>
> I have an indexer at my company that I wrote while back that indexes
> database content (users and their profile)...one of the next req. of the
> project is to avoid 'spam' in hits. For example if I do a search for oracle,
> and oracle is in 25 places in someones bio field...and another person has it
> in one place in his company field, the 25 places will of course be higher.
> Unfortunatly, people who know the system know the more you have certain
> keywords in you user profile, the higher you will be on the list. I was
> thinking I can do one of two things:
>
> 1. Work with Lucene algo to lower scores in certain fields (boost in
> others)...this would work, but the boost has such a small factor in scoring
> (or so it seems), that in some cases it won't matter. (if I boost company to
> 2.0, and bio to 1.0 in some cases with xxx hits in bio, that is still
> first in score)
>
> 2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use
> same idea as a mail filter and use the Bayesian Classifier to train it that
> certain words would be spam...then just index the summary. Throwing this out
> there...not even sure that it will work...
>
> Not sure if this makses sense...but curious if anyone has ideas, or has
> done something like this.
>
> Regards!
> -Joe
>
>