You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/07/02 00:41:17 UTC

Re: Spam research

hi Gordon!

Gordon V. Cormack writes:
> Justin,
> 
> I'm interested in pursuing the possibility of doing
> some investigations using your masscheck data.  I
> don't have a really concrete proposal yet, but I
> can outline a few ideas.
> 
> Is SpamAssassin your day job?  Are you the right
> person to talk to?  I thought I'd avoid spamming
> your dev list pending the formulation of a more
> concrete strategy.

SpamAssassin isn't my day job any more -- believe it or not, it paid
better and was less hassle to keep SpamAssassin as a hobby and do
something totally unrelated as the day job ;)   Unfortunately that means
no more trips to CEAS, as you've probably noticed.

Anyway, yep, feel free to talk directly to me.  Your best bet, though, is
generally to CC the dev list (which I've now done).  There won't be any
problem with it being seen as spammy -- it already gets copies of all bug
traffic, so that's pretty high-volume anyway.  You may find that someone
there is already working on something similar, too...

> In the "small incremental change" department, I have
> always wondered about the choice of GA and/or
> perceptron for the masscheck rule calculation
> (pardon me if I mangle your vernacular).  I guess
> Stern worked on this but now he's not involved?
> 
> Anyway, you seem to have reverted back to GA.  I'd
> be interested in knowing the details.

OK, what happened was this:

- SpamAssassin 3.1.0 -- we used the perceptron, thanks to Henry, and it
  worked well, although I think it required a lot of careful tuning

- SpamAssassin 3.2.0 -- tried to use it, Henry was unfortunately a bit
  busy to help.  It came out with some pretty horrendous results, despite
  a fair bit of attempted tuning.
  
  So I tried out the GA we'd used for SpamAssassin 3.0.0, and it produced
  *much* better results; we couldn't get the perceptron to produce results
  of similar quality.

  Henry theorised at the time that it was due to too many rule
  combinations being seen both in ham and in spam.  in other words, there
  was no clear "dividing line" that the perceptron could discover using
  gradient descent, if I recall correctly.  The GA, being a GA, could just
  semi-randomly plunk around until it found a sufficiently-good result,
  even if a large portion of the logs were the same for ham and spam or
  whatever the problem was.

> In any event, it seems to me that perceptron is
> close but no cigar.  Why not logistic regression
> or SVM?  It might be that traditional implementations
> like you find in stats packages sink under the weight
> of a million examples, but there are gradient methods
> akin to the perceptron that are quite simple and
> should work, viz:
> 
> http://www.ceas.cc/2006/listabs.html#22.pdf
> http://www.eecs.tufts.edu/~dsculley/papers/emailAndWebSpamSIGIR.pdf

Actually, we have recently had one of the SpamAssassin developers
implement LR as a replacement for the GA/perceptron:

  http://people.apache.org/~duncf/FindlayBirkThesis.pdf

I don't know what the status of that is at the moment though...

> More generally, you have access to a million example
> corpus!  Why not use it for investigating general
> learning algorithms.  An obvious first cut would
> be to see how well the above methods work on the
> raw messages, as opposed to the rules' output.
> I also have a pet algorithm that I'd like to try.
> 
> http://jmlr.csail.mit.edu/papers/v7/bratko06a.html
> 
> I don't know how closely you've followed the TREC
> efforts, but we have a toolkit that allows for
> unattended operation of filters.  So one could
> build an adapter jig to allow any TREC-configured
> filter to be run on your corpora while maintaining
> the same level of privacy you have right now.
> This brings to mind another obvious choice of
> algorithm to test:  osbf-lua.

I know! we have a bug open asking for people to implement it as 
a plugin ;)  

Our learning subsystem is nearly pluggable these days, which would allow
people to replace the "Bayes" code with an alternative plugin.  (Not quite
sure how complete that is, though.)

> (Your Wiki suggests that you also have some
> contributed corpora that you can experiment
> on without involving the contributors?)
> 
> Now it may not be obvious how immediately to
> translate results of these experiments into simple
> statistics that can be "shipped" with spamassassin,
> but it might be possible to devise such an approach.
> Or simply to come up with a set of "better bayes"
> rules.
> 
> Please feel free to respond to anything or everything
> I mention, and share this with whomever you like.

Thanks -- I've cc'd the dev list.

We do indeed allow people to get the mass-check logs, which are an
anonymised form of the data we feed to the perceptron or GA.  We also
publish the FP%/FN% rates we *currently* get -- as a target to beat ;)

Unfortunately, there are a few additional constraints in how we generate
scores; e.g. we have a general system of "score ranges", which establishes
allowed ranges for a rule's score, based on the rule's Bayesian
probability and hit-rate:

- we don't allow rules intended to hit spam to get massive negative
  points, since that allows spammers to exploit them.  This kind of thing
  can have big effects on how effective a score set will be in tests vs.
  the adversarial situation in the real world.

- Similarly, low-hit-rate rules are generally not allowed to get massive
  scores.

- we try to keep the scores for most rules below 4.5 or so.

So one thing I really want to get around to ;) is to build a "SpamAssassin
Challenge", a la the Netflix Challenge:

  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376

Basically, define a clear spec for how our rescoring system works, with
all those little details defined.  In conjunction with test data sets
containing the "real" mass-check logs from the 3.2.0 rescoring run, for
example, that would then allow third parties to measure algorithms in a
way that would directly apply to how we use them.  There's still a
lot of leeway for different algorithms, of course.

--j.