You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Vincenzo Gianferrari Pini <vi...@praxis.it> on 2006/09/17 13:44:19 UTC

Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Dear Gary,

at the Apache James Server Project (http://james.apache.org) we have 
implemented a (java based) bayesian anti-spam filter following Paul 
Graham's approach (both the original one - 
http://paulgraham.com/spam.html - and the enhanced one - 
http://paulgraham.com/better.html). It is available in the new 2.3.0 
release that we are releasing these days.

We would like, for the next release, to implement the "chi-square-based 
spam filter" approach described in your "Handling Redundancy in Email 
Token Probabilities" paper 
(http://garyrob.blogs.com//handlingtokenredundancy94.pdf). But for doing 
that we need to understand a few points: can you help and advice us?

I'm CCing this email to the server-dev@james.apache.org list: can you 
reply to it in your answer?

Here follow our questions. I will explicitly refer to the terminology 
and formula numbers used by you in your above mentioned paper.

   1. Based on Paul Graham's approach, in computing b(w) and g(w) we use
      in the numerator of the formulas "the total count of occurrences
      of word w in the spam (ham) e-mails" instead of "the number of
      spam (ham) e-mails containing the word w" as you do. Paul's
      counters are persisted on disk, and there are already some users
      that have extensively trained their systems building their own
      "corpuses". It would be a pity not to be able to use such
      collected data when using your approach (we would like to simply
      add a configuration switch to our filters that optionally - or by
      default - activates your approach).
      In your blog
      (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
      I found the following comment from you:

          "Note 2: In calculating p(w) Graham counts every instance of
          word w in every email in which it appears. Obviously, if a
          word appears once in an email, there is a greater probability
          that it will appear again in that same email than if it hadn't
          already appeared at all. So, the random variable is not really
          independent under Graham's technique which is one reason why,
          in the description above, we only count the first occurrence.
          However, we are pragmatic and whatever works best in practice
          is what we should do. There is some evidence at this point
          that using the Graham counting technique leads to slightly
          better results than using the "pure" technique above. This may
          because it is not ignoring any of the data. So, p(w) and n
          should simply be computed the way that gives the best results."

      Looks quite clear to me; can you then confirm us that we can use
      Paul's counters in computing b(w) and g(w), and that you "endorse"
      it as leading "to slightly better results" than the technique
      mentioned in your paper?

   2. In computing f(w) in formula (1), what do you suggest to use for
      the values of "s" and "x"? We will let them be configuration
      parameters as others, but we should use a sound default.

   3. In computing "H" and "S" in formulas (2) and (3), which of the two
      definitions of the "inverse chi-square function" and related cdf
      (invC( ) below) should we use among the two definitions that I
      found for example in
      http://en.wikipedia.org/wiki/Inverse-chi-square_distribution?

   4. Still in computing "H" and "S", how many degrees of freedom "n"
      should we use? I would assume *one*, being two values (ham and
      spam) minus one constraint, as their probabilities must sum up to
      one (in this case question 3 above would be pointless). But I'n
      not sure at all: can you either confirm or give me a hint?

   5. As we have already available a java routine that computes the
      chi-square cdf C(X,n), I found out that to compute the invC(X,n)
      we could use one of the following formulas (depending on the
      outcome of question 3 above). Would you be able to help me
      confirming it?:
      invC(X,n) = 1 - C(1/X,n)
      invC(X,n) = 1 - C(n/X,n)

   6. What do you suggest for a practical default "cutoff" value for "I"
      and as default ESF value "y" in formula (6) for "H", and which "y"
      for "S"? And which default for the "exclusion radius"?

I hope that you do not find those questions as being too many :-) .

I'm looking forward for your answer .

If you are interested we will keep you informed about our future progress.

Thank you.

Vincenzo Gianferrari Pini


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Posted by Vincenzo Gianferrari Pini <vi...@praxis.it>.

Hi Gary,

thank you for your prompt response.

The James mail server *has already* many more anti-spam related features 
other than the statistical filter: blacklisting, whitelisting, 
server-side SMIME signature support and already working but yet to be 
released (in the next release) greylisting, spf, surbl and others. The 
just released Bayesian statistical filter that has full "Paul Graham" 
support, and that we want to extend to the "chi technique"  was written 
by me  more than 3 years ago and has been used successfully by several 
people .

Moreover, anti-spam suport is very central to our goals, and we are 
convinced that we can do good things in this area within James, as it is 
a *very* flexible and extensible system.

The willingness to implement the "chi technique" came after discussing 
about interfacing with external filters, and we decided to implement our 
own solution, as it gives us more future control and flexibility.
See: http://issues.apache.org/jira/browse/JAMES-514
and the following thread:
http://www.mail-archive.com/server-dev@james.apache.org/msg09608.html

That's why we will keep waiting for your help, when you will have time 
to :-) .

In the meantime thank you for your links, that I will read right away.

Vincenzo

Gary Robinson wrote:

>Hi Vincenzo -- 
>
>Let me ask you something. Most top spam filters have more stuff  in 
>them the the statistical filter. Would it be practical for you to just 
>copy the code of another open-source filter, either translating it into 
>Java or using directly? But SpamBayes is excellent and it's written in 
>Python -- technically, you should be able to run its engine directly in 
>Java using Jython, which translates python to Java bytecodes on the 
>fly. Bogofilter is also excellent -- it's written in C so you should be 
>able use its engine from Java. SpamAssassin is also excellent, though 
>I'm not sure what language  it's written in. They've all done very well 
>in head-to-head competitions and/or won awards. (It also happens that 
>they all use the chi technique.)
>
>Bogofilter is the only one currently using the "handling redundancy" 
>extension to the original technique. (See 
>http://www.bgl.nu/bogofilter/esf.html for more info.) I think it's only 
>used optionally though; I think the default is the original technique 
>described here in my  Linux Journal article: 
>http://www.linuxjournal.com/article/6467. It increases filtering 
>accuracy very significantly as you can see from that bogofilter link, 
>but does take a fair amount of cpu time to find the optimal parms, and 
>spending that time is crucial to make it work.
>
>I've been thinking about ways  to speed it up, but I haven't had time 
>to implement or test them. The person who implemented the approach 
>described in the paper, Greg Louis, is unable to spend time on 
>Bogofilter any more. If you're interested in going a bit farther than 
>other filters have with this, let me know, and I'd be happy to write it 
>up for you, although I won't have time for a couple of weeks.
>
>If you don't want to do that, which would be perfectly understandable, 
>I don't think you could do wrong by linking to the bogofilter C code, 
>which is fast, of use SpamBayes' Python code via Jython. (One caveat is 
>that I personally can't advise on whether there are difficulties in 
>breaking the filtering engine out from the interface code in those 
>projects.) I can understand, of course, that there may be a need for 
>pure Java source, in which case you'd have to translate or, as you're 
>already planning, write your own. 
>
>One advantage of using a pre-existing project is that most projects 
>evolve over time to match the evolution of spammer tactics, which do 
>change significantly over time. I can imagine that it might be too 
>distracting for a project that isn't focused on spam filtering to 
>really try to keep up, and thus it may be hard to stay competitive with 
>the dedicated projects.
>
>Anyway, those are thoughts that come to mind. I'll respond to your 
>specific question as soon as I can -- I've got a heavy meeting and 
>deadline schedule next week. Also, I'm not sure if you saw the Linux 
>Journal article mentioned above -- if you haven't you might want to 
>look at it. If so let me know if it resolves any of your questions.
>
>Gary
>
>
>
>
> 
>
>Gary Robinson
>CTO
>Emergent Music, LLC
>grobinson@goombah.com
>207-942-3463
>Company: http://www.goombah.com
>Blog:    http://www.garyrobinson.net
>
>  
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Posted by Vincenzo Gianferrari Pini <vi...@praxis.it>.

Hi Gary,

I had a look at your suggested links, that answered to some of my 
questions: see below.

Gary Robinson wrote:

>Hi Vincenzo -- 
>
>  
>
<snip/>

>Bogofilter is the only one currently using the "handling redundancy" 
>extension to the original technique. (See 
>http://www.bgl.nu/bogofilter/esf.html for more info.) I think it's only 
>used optionally though; I think the default is the original technique 
>described here in my  Linux Journal article: 
>http://www.linuxjournal.com/article/6467. It increases filtering 
>accuracy very significantly as you can see from that bogofilter link, 
>but does take a fair amount of cpu time to find the optimal parms, and 
>spending that time is crucial to make it work.
>
>  
>
<snip/>

>Anyway, those are thoughts that come to mind. I'll respond to your 
>specific question as soon as I can -- I've got a heavy meeting and 
>deadline schedule next week. Also, I'm not sure if you saw the Linux 
>Journal article mentioned above -- if you haven't you might want to 
>look at it. If so let me know if it resolves any of your questions.
>
>Gary
>
>
>
>
> 
>
>Gary Robinson
>CTO
>Emergent Music, LLC
>grobinson@goombah.com
>207-942-3463
>Company: http://www.goombah.com
>Blog:    http://www.garyrobinson.net
>
>On Sun, 17 Sep 2006 13:44:19 +0200, Vincenzo Gianferrari Pini wrote:
>  
>
>>Dear Gary,
>>
>>at the Apache James Server Project (http://james.apache.org) we have 
>>implemented a (java based) bayesian anti-spam filter following Paul 
>>Graham's approach (both the original one - 
>>http://paulgraham.com/spam.html - and the enhanced one - 
>>http://paulgraham.com/better.html). It is available in the new 2.3.0 
>>release that we are releasing these days.
>>
>>We would like, for the next release, to implement the 
>>"chi-square-based spam filter" approach described in your "Handling 
>>Redundancy in Email Token Probabilities" paper 
>>(http://garyrob.blogs.com//handlingtokenredundancy94.pdf). But for 
>>doing that we need to understand a few points: can you help and 
>>advice us?
>>
>>I'm CCing this email to the server-dev@james.apache.org list: can you 
>>reply to it in your answer?
>>
>>Here follow our questions. I will explicitly refer to the terminology 
>>and formula numbers used by you in your above mentioned paper.
>>
>>  1. Based on Paul Graham's approach, in computing b(w) and g(w) we use
>>     in the numerator of the formulas "the total count of occurrences
>>     of word w in the spam (ham) e-mails" instead of "the number of
>>     spam (ham) e-mails containing the word w" as you do. Paul's
>>     counters are persisted on disk, and there are already some users
>>     that have extensively trained their systems building their own
>>     "corpuses". It would be a pity not to be able to use such
>>     collected data when using your approach (we would like to simply
>>     add a configuration switch to our filters that optionally - or by
>>     default - activates your approach).
>>     In your blog
>>     
>>(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
>>     I found the following comment from you:
>>
>>         "Note 2: In calculating p(w) Graham counts every instance of
>>         word w in every email in which it appears. Obviously, if a
>>         word appears once in an email, there is a greater probability
>>         that it will appear again in that same email than if it hadn't
>>         already appeared at all. So, the random variable is not really
>>         independent under Graham's technique which is one reason why,
>>         in the description above, we only count the first occurrence.
>>         However, we are pragmatic and whatever works best in practice
>>         is what we should do. There is some evidence at this point
>>         that using the Graham counting technique leads to slightly
>>         better results than using the "pure" technique above. This may
>>         because it is not ignoring any of the data. So, p(w) and n
>>         should simply be computed the way that gives the best results."
>>
>>     Looks quite clear to me; can you then confirm us that we can use
>>     Paul's counters in computing b(w) and g(w), and that you "endorse"
>>     it as leading "to slightly better results" than the technique
>>     mentioned in your paper?
>>
>>  2. In computing f(w) in formula (1), what do you suggest to use for
>>     the values of "s" and "x"? We will let them be configuration
>>     parameters as others, but we should use a sound default.
>>    
>>
For "x" a good starting point might be 0.5, or the average p(w) for 
words in the training set that have counts of 10 or more.  For "s" see 
question 6 below.

>>  3. In computing "H" and "S" in formulas (2) and (3), which of the two
>>     definitions of the "inverse chi-square function" and related cdf
>>     (invC( ) below) should we use among the two definitions that I
>>     found for example in
>>     http://en.wikipedia.org/wiki/Inverse-chi-square_distribution?
>>    
>>
In your "A Python Implementation of the Inverse Chi-Square Function" 
(http://www.linuxjournal.com/articles/lj/0107/6467/6467s2.html) you show 
for such "inverse" an algorithm that is simply equivalent to "one minus 
the traditional chi-square cumulative distribution function":
 invC(X,n) = 1.0 - C(X,n)

that simply is P(X < Chi2) = 1.0 - P(Chi2 <= X)

If this is the case, and I understood well, IMHO you should not use in 
your papers the term "Inverse Chi-Square Function", but simply use the 
term "Chi-Square Function" defining it as "P(Chi2 <= X)", and use "1.0 - 
C(...)" in your formulas, as the word "inverse" has different meanings 
and can be misleading (at least it was for me).
For example see the wikipedia definition above, and notice that in 
Microsoft Excel the function CHIDIST(X,n)  is equivalent to our 
invC(X,n) above, and Excel's CHIINV(P,n) is quite different, meaning
X = CHIINV(CHIDIST(X,n),n)

>>  4. Still in computing "H" and "S", how many degrees of freedom "n"
>>     should we use? I would assume *one*, being two values (ham and
>>     spam) minus one constraint, as their probabilities must sum up to
>>     one (in this case question 3 above would be pointless). But I'n
>>     not sure at all: can you either confirm or give me a hint?
>>    
>>
"n" is "the number of unique tokens in the message" (I would add "that 
fall outside the exclusion radius (called 'mindev' by Greg)").

>>  5. As we have already available a java routine that computes the
>>     chi-square cdf C(X,n), I found out that to compute the invC(X,n)
>>     we could use one of the following formulas (depending on the
>>     outcome of question 3 above). Would you be able to help me
>>     confirming it?:
>>     invC(X,n) = 1 - C(1/X,n)
>>     invC(X,n) = 1 - C(n/X,n)
>>    
>>
This question disappears after the answer to question 3 above.

>>  6. What do you suggest for a practical default "cutoff" value for "I"
>>     and as default ESF value "y" in formula (6) for "H", and which "y"
>>     for "S"? And which default for the "exclusion radius"?
>>    
>>
Greg comes out with some good suggested values (including for "s"), 
though they show that they depend quite a lot on the training set used, 
and this disturbs me a little bit. I would have expected to find some 
invariance based on a common perception of "spamminess".

In addition to that, IMHO we should use an "S < 0.001" or less 
constraint to minimize as much as possible false positives. I think that 
the possibility  to assess the risk of false positives is the most 
important thing in all your chi-square approach.

>>I hope that you do not find those questions as being too many :-) .
>>
>>I'm looking forward for your answer .
>>
>>If you are interested we will keep you informed about our future progress.
>>
>>Thank you.
>>
>>Vincenzo Gianferrari Pini
>>
>>    
>>
What do you think about the above considerations?

Thanks,

Vincenzo


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Implementing a "chi-square-based spam filter" - asking Gary Robinson's advice

Posted by Gary Robinson <gr...@goombah.com>.

Hi Vincenzo -- 

Let me ask you something. Most top spam filters have more stuff  in 
them the the statistical filter. Would it be practical for you to just 
copy the code of another open-source filter, either translating it into 
Java or using directly? But SpamBayes is excellent and it's written in 
Python -- technically, you should be able to run its engine directly in 
Java using Jython, which translates python to Java bytecodes on the 
fly. Bogofilter is also excellent -- it's written in C so you should be 
able use its engine from Java. SpamAssassin is also excellent, though 
I'm not sure what language  it's written in. They've all done very well 
in head-to-head competitions and/or won awards. (It also happens that 
they all use the chi technique.)

Bogofilter is the only one currently using the "handling redundancy" 
extension to the original technique. (See 
http://www.bgl.nu/bogofilter/esf.html for more info.) I think it's only 
used optionally though; I think the default is the original technique 
described here in my  Linux Journal article: 
http://www.linuxjournal.com/article/6467. It increases filtering 
accuracy very significantly as you can see from that bogofilter link, 
but does take a fair amount of cpu time to find the optimal parms, and 
spending that time is crucial to make it work.

I've been thinking about ways  to speed it up, but I haven't had time 
to implement or test them. The person who implemented the approach 
described in the paper, Greg Louis, is unable to spend time on 
Bogofilter any more. If you're interested in going a bit farther than 
other filters have with this, let me know, and I'd be happy to write it 
up for you, although I won't have time for a couple of weeks.

If you don't want to do that, which would be perfectly understandable, 
I don't think you could do wrong by linking to the bogofilter C code, 
which is fast, of use SpamBayes' Python code via Jython. (One caveat is 
that I personally can't advise on whether there are difficulties in 
breaking the filtering engine out from the interface code in those 
projects.) I can understand, of course, that there may be a need for 
pure Java source, in which case you'd have to translate or, as you're 
already planning, write your own. 

One advantage of using a pre-existing project is that most projects 
evolve over time to match the evolution of spammer tactics, which do 
change significantly over time. I can imagine that it might be too 
distracting for a project that isn't focused on spam filtering to 
really try to keep up, and thus it may be hard to stay competitive with 
the dedicated projects.

Anyway, those are thoughts that come to mind. I'll respond to your 
specific question as soon as I can -- I've got a heavy meeting and 
deadline schedule next week. Also, I'm not sure if you saw the Linux 
Journal article mentioned above -- if you haven't you might want to 
look at it. If so let me know if it resolves any of your questions.

Gary

Gary Robinson
CTO
Emergent Music, LLC
grobinson@goombah.com
207-942-3463
Company: http://www.goombah.com
Blog:    http://www.garyrobinson.net

On Sun, 17 Sep 2006 13:44:19 +0200, Vincenzo Gianferrari Pini wrote:
> Dear Gary,
> 
> at the Apache James Server Project (http://james.apache.org) we have 
> implemented a (java based) bayesian anti-spam filter following Paul 
> Graham's approach (both the original one - 
> http://paulgraham.com/spam.html - and the enhanced one - 
> http://paulgraham.com/better.html). It is available in the new 2.3.0 
> release that we are releasing these days.
> 
> We would like, for the next release, to implement the 
> "chi-square-based spam filter" approach described in your "Handling 
> Redundancy in Email Token Probabilities" paper 
> (http://garyrob.blogs.com//handlingtokenredundancy94.pdf). But for 
> doing that we need to understand a few points: can you help and 
> advice us?
> 
> I'm CCing this email to the server-dev@james.apache.org list: can you 
> reply to it in your answer?
> 
> Here follow our questions. I will explicitly refer to the terminology 
> and formula numbers used by you in your above mentioned paper.
> 
>   1. Based on Paul Graham's approach, in computing b(w) and g(w) we use
>      in the numerator of the formulas "the total count of occurrences
>      of word w in the spam (ham) e-mails" instead of "the number of
>      spam (ham) e-mails containing the word w" as you do. Paul's
>      counters are persisted on disk, and there are already some users
>      that have extensively trained their systems building their own
>      "corpuses". It would be a pity not to be able to use such
>      collected data when using your approach (we would like to simply
>      add a configuration switch to our filters that optionally - or by
>      default - activates your approach).
>      In your blog
>      
> (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
>      I found the following comment from you:
> 
>          "Note 2: In calculating p(w) Graham counts every instance of
>          word w in every email in which it appears. Obviously, if a
>          word appears once in an email, there is a greater probability
>          that it will appear again in that same email than if it hadn't
>          already appeared at all. So, the random variable is not really
>          independent under Graham's technique which is one reason why,
>          in the description above, we only count the first occurrence.
>          However, we are pragmatic and whatever works best in practice
>          is what we should do. There is some evidence at this point
>          that using the Graham counting technique leads to slightly
>          better results than using the "pure" technique above. This may
>          because it is not ignoring any of the data. So, p(w) and n
>          should simply be computed the way that gives the best results."
> 
>      Looks quite clear to me; can you then confirm us that we can use
>      Paul's counters in computing b(w) and g(w), and that you "endorse"
>      it as leading "to slightly better results" than the technique
>      mentioned in your paper?
> 
>   2. In computing f(w) in formula (1), what do you suggest to use for
>      the values of "s" and "x"? We will let them be configuration
>      parameters as others, but we should use a sound default.
> 
>   3. In computing "H" and "S" in formulas (2) and (3), which of the two
>      definitions of the "inverse chi-square function" and related cdf
>      (invC( ) below) should we use among the two definitions that I
>      found for example in
>      http://en.wikipedia.org/wiki/Inverse-chi-square_distribution?
> 
>   4. Still in computing "H" and "S", how many degrees of freedom "n"
>      should we use? I would assume *one*, being two values (ham and
>      spam) minus one constraint, as their probabilities must sum up to
>      one (in this case question 3 above would be pointless). But I'n
>      not sure at all: can you either confirm or give me a hint?
> 
>   5. As we have already available a java routine that computes the
>      chi-square cdf C(X,n), I found out that to compute the invC(X,n)
>      we could use one of the following formulas (depending on the
>      outcome of question 3 above). Would you be able to help me
>      confirming it?:
>      invC(X,n) = 1 - C(1/X,n)
>      invC(X,n) = 1 - C(n/X,n)
> 
>   6. What do you suggest for a practical default "cutoff" value for "I"
>      and as default ESF value "y" in formula (6) for "H", and which "y"
>      for "S"? And which default for the "exclusion radius"?
> 
> I hope that you do not find those questions as being too many :-) .
> 
> I'm looking forward for your answer .
> 
> If you are interested we will keep you informed about our future progress.
> 
> Thank you.
> 
> Vincenzo Gianferrari Pini
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org