You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2012/11/09 18:48:11 UTC

Future of SA's bayes implementation

I haven't done as much testing on this as I'd like, but I've gotten away
from it, and wanted to get my thoughts in here before I forget them.

I have a strong suspicion that SA's bayes implementation sucks.

The two major problems, as I see them:
1) Lack of learn-on-fail.
2) Lack of multi-word tokens.

In the process I discovered that 9 years ago I did some testing that showed
multi-word tokens work better than single-word tokens:
http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01

It really blows my mind that we don't have these two features.

Learn-on-fail means, when you train an email as spam or ham, it
first checks the email to see if it would already have been classified
correctly, and then only does any training if it would've gotten it wrong.
So it doesn't modify the database unless there's actually evidence that
it would be beneficial (reducing non-beneficial modifications). It was
implemented for auto-learning in 2010, but not for manual training:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6447

Multi-word tokens is probably obvious? Currently, SA's bayes tokens
are single words. And there are better results when you also have two
word tokens. In 2003-2004, I wrote a bayesian filter from scratch,
and thought it was pretty neat how much you can get out of some
tweaks to tokenization (do you only split on white space? What about
non-alpha-numeric characters?)

The two word token thing was mentioned on
http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24.

One of my questions is, does it make sense to continue to maintain bayesian
stuff within SA at all? Or should we drop it, and encourage people to run
a pure bayesian classifier before SA (like spamprobe), then have rules that
read the headers from those classifiers? Are there options better than
spamprobe?

On one hand, spamprobe has been around forever, and almost certainly does
bayes at least as well as SA is ever likely to, it's pretty easy to run it
before SA, and creating the rules to read those scores would be easy.

On the other hand, keeping the bayes functionality within SA provides a
tidier package, a little easier to set up, one fewer process spawned per
email, and adding these two features really shouldn't be hard. Without
even breaking any backward compatibility (with the existing database
format). I hope.

The reason I'm playing with bayes is my interest in the possible usefulness
of shared bayes data.

I want to do more testing of using other people's bayes data on
my corpora. My assumption is that most end users don't do their own
training. So I haven't been using bayes, for some time, in an attempt
to better see what typical end users see. But I suspect that taking
multiple other people's bayes databases, merging them, and using them
on my corpora, could be very useful. And if I can prove that, then we /
I could distribute it to more people.

I tried it with patdk-wk's (from IRC) data, and 1.18% of my ham hit
BAYES_99, which I call terrible. But my hope is to see better results with
data merged from multiple people.

So, please send me your bayes data. Mailing me the output of
"sa-learn --backup | gzip >> sa-learn.backup.yourname.gz" off list should do.
Please let me know how much it's hand verified vs. auto-trained. And let
me know if you're comfortable with it being distributed to others.

Mine is: http://www.chaosreigns.com/tmp/sa-learn.backup.darxus.gz
No auto-training.

There was strong concern expressed about the idea of merging bayes DBs
eight years go:
http://mail-archives.apache.org/mod_mbox/spamassassin-users/200412.mbox/%3C20041204160616.GA2707@mail.herk.net%3E
I don't share that concern, but I also plan to find evidence that it's
useful before suggesting anybody else try it.

To test bayes, I grepped BAYES from the default rule set into
~/sa/bayesonly/local.cf, then copied /etc/spamassassin/*.pre to
~/sa/bayesonly/, simlinked ~/.spamassassin/bayes* into masses/spamassassin,
and ran:
./mass-check --bayes -c ~/sa/bayesonly/ --progress ham:dir:/home/darxus/masscheckwork/ham/ spam:dir:/home/darxus/masscheckwork/spam/

It's annoying that it doesn't seem possible to just run spamassassin with
only the bayes rules, instead of masscheck. It gives an error about not
having any rules defined.

--
"I offer the modest proposal that our Universe is simply one of those
things which happen from time to time."
- Is the Universe a Vacuum Fluctuation?
http://www.ChaosReigns.com

Re: Future of SA's bayes implementation

Posted by Axb <ax...@gmail.com>.

On 11/09/2012 06:48 PM, darxus@chaosreigns.com wrote:
> I haven't done as much testing on this as I'd like, but I've gotten away
> from it, and wanted to get my thoughts in here before I forget them.
>
> I have a strong suspicion that SA's bayes implementation sucks.
>
> The two major problems, as I see them:
> 1) Lack of learn-on-fail.
> 2) Lack of multi-word tokens.
>
> In the process I discovered that 9 years ago I did some testing that showed
> multi-word tokens work better than single-word tokens:
> http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01
>
> It really blows my mind that we don't have these two features.
>
> Learn-on-fail means, when you train an email as spam or ham, it
> first checks the email to see if it would already have been classified
> correctly, and then only does any training if it would've gotten it wrong.
> So it doesn't modify the database unless there's actually evidence that
> it would be beneficial (reducing non-beneficial modifications).  It was
> implemented for auto-learning in 2010, but not for manual training:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6447
>
> Multi-word tokens is probably obvious?  Currently, SA's bayes tokens
> are single words.  And there are better results when you also have two
> word tokens.  In 2003-2004, I wrote a bayesian filter from scratch,
> and thought it was pretty neat how much you can get out of some
> tweaks to tokenization (do you only split on white space?  What about
> non-alpha-numeric characters?)
>
> The two word token thing was mentioned on
> http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24.
>
> One of my questions is, does it make sense to continue to maintain bayesian
> stuff within SA at all?  Or should we drop it, and encourage people to run
> a pure bayesian classifier before SA (like spamprobe), then have rules that
> read the headers from those classifiers?  Are there options better than
> spamprobe?
>
> On one hand, spamprobe has been around forever, and almost certainly does
> bayes at least as well as SA is ever likely to, it's pretty easy to run it
> before SA, and creating the rules to read those scores would be easy.
>
> On the other hand, keeping the bayes functionality within SA provides a
> tidier package, a little easier to set up, one fewer process spawned per
> email, and adding these two features really shouldn't be hard.  Without
> even breaking any backward compatibility (with the existing database
> format).  I hope.
>
>
> The reason I'm playing with bayes is my interest in the possible usefulness
> of shared bayes data.
>
> I want to do more testing of using other people's bayes data on
> my corpora.  My assumption is that most end users don't do their own
> training.  So I haven't been using bayes, for some time, in an attempt
> to better see what typical end users see.  But I suspect that taking
> multiple other people's bayes databases, merging them, and using them
> on my corpora, could be very useful.  And if I can prove that, then we /
> I could distribute it to more people.
>
> I tried it with patdk-wk's (from IRC) data, and 1.18% of my ham hit
> BAYES_99, which I call terrible.  But my hope is to see better results with
> data merged from multiple people.
>
>
> So, please send me your bayes data.  Mailing me the output of
> "sa-learn --backup | gzip >> sa-learn.backup.yourname.gz" off list should do.
> Please let me know how much it's hand verified vs. auto-trained.  And let
> me know if you're comfortable with it being distributed to others.
>
> Mine is:  http://www.chaosreigns.com/tmp/sa-learn.backup.darxus.gz
> No auto-training.
>
>
> There was strong concern expressed about the idea of merging bayes DBs
> eight years go:
> http://mail-archives.apache.org/mod_mbox/spamassassin-users/200412.mbox/%3C20041204160616.GA2707@mail.herk.net%3E
> I don't share that concern, but I also plan to find evidence that it's
> useful before suggesting anybody else try it.
>
>
> To test bayes, I grepped BAYES from the default rule set into
> ~/sa/bayesonly/local.cf, then copied /etc/spamassassin/*.pre to
> ~/sa/bayesonly/, simlinked ~/.spamassassin/bayes* into masses/spamassassin,
> and ran:
>    ./mass-check --bayes -c ~/sa/bayesonly/ --progress ham:dir:/home/darxus/masscheckwork/ham/ spam:dir:/home/darxus/masscheckwork/spam/
>
> It's annoying that it doesn't seem possible to just run spamassassin with
> only the bayes rules, instead of masscheck.  It gives an error about not
> having any rules defined.


I don't think the way SA & Bayes work now should be changed or Bayes 
removed or touched in any way.

New methods can/should be pluginized, using another nomenclature giving 
ppl an option while these new methods mature.

Bayes data is not something which can be widely shared and expect 
reliable results. Nobody's traffic is equal to someobody elses, nor are 
the feeding methods/thresholds/local rules, etc, etc.
There are ppl which sustain it can be done, yet the spam hits reflect 
overlap wih hashing systems and published rules, and ham never gets a 
negative score because your ham will never, ever look like mine.

If you've come up with new methods, please make them a separate plugin, 
with it's own methods which can be run side by side with SA's Bayes as 
it is now and let it be tested & compared.
Something like Pyzor/iXhash using bayes tokens instead of fixed hashing 
methods could work, if there are enough reliable sources feeding it.
Finding ppl to help run such a setup is not trivial either.

Re: Future of SA's bayes implementation

Posted by RW <rw...@googlemail.com>.

On Fri, 9 Nov 2012 12:48:11 -0500
darxus@chaosreigns.com wrote:

> I haven't done as much testing on this as I'd like, but I've gotten
> away from it, and wanted to get my thoughts in here before I forget
> them.
> 
> I have a strong suspicion that SA's bayes implementation sucks.
> 
> The two major problems, as I see them:
> 1) Lack of learn-on-fail.
> 2) Lack of multi-word tokens.
> 
> In the process I discovered that 9 years ago I did some testing that
> showed multi-word tokens work better than single-word tokens:
> http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01
> 
> It really blows my mind that we don't have these two features.
> 
> Learn-on-fail means, when you train an email as spam or ham, it
> first checks the email to see if it would already have been classified
> correctly, and then only does any training if it would've gotten it
> wrong.

It wouldn't hurt to have the option, but I think a lot of people are
already doing this simply by being selective about what they learn.

One problem with it is that you get a lot of unnecessary failures
before the accuracy levels-out. DSPAM's TOE mode only switches on when
there are 2500  ham messages in the database. I think this is sensible
- particularly for per user databases.

> So it doesn't modify the database unless there's actually
> evidence that it would be beneficial (reducing non-beneficial
> modifications).

I've never really found that argument particularly compelling.
Correctly identified mails are often rich in useful tokens whereas
errors often occur because there's not much to go on. 


> The two word token thing was mentioned on
> http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24.
> 
> One of my questions is, does it make sense to continue to maintain
> bayesian stuff within SA at all?  Or should we drop it, and encourage
> people to run a pure bayesian classifier before SA (like spamprobe),
> then have rules that read the headers from those classifiers?  

One advantage is access to metadata and an interface that allows
plugins to contribute. I think there is probably scope for a lot more
to be done with Bayes in this area.

Maybe it would also be useful if plugins could get back the ham/spam
counts for tokens they contribute. 


> The reason I'm playing with bayes is my interest in the possible
> usefulness of shared bayes data.
> 
> I want to do more testing of using other people's bayes data on
> my corpora.  My assumption is that most end users don't do their own
> training.  So I haven't been using bayes, for some time, in an attempt
> to better see what typical end users see.  But I suspect that taking
> multiple other people's bayes databases, merging them, and using them
> on my corpora, could be very useful.  And if I can prove that, then
> we / I could distribute it to more people.

I think merging needs to be done per token so the global database
contributes most strongly on local low-count tokens.

Re: Future of SA's bayes implementation

Posted by Patrick Ben Koetter <p...@state-of-mind.de>.

* darxus@chaosreigns.com <da...@chaosreigns.com>:

...

> One of my questions is, does it make sense to continue to maintain bayesian
> stuff within SA at all?  Or should we drop it, and encourage people to run
> a pure bayesian classifier before SA (like spamprobe), then have rules that
> read the headers from those classifiers?  Are there options better than
> spamprobe?
>
> On one hand, spamprobe has been around forever, and almost certainly does
> bayes at least as well as SA is ever likely to, it's pretty easy to run it
> before SA, and creating the rules to read those scores would be easy.

>From a short glance it doesn't look as if spamprobe could be run in a
pre-queue setup. In Germany, for legal purposes, pre-queue filtering is a
must. Can it be?

> On the other hand, keeping the bayes functionality within SA provides a
> tidier package, a little easier to set up, one fewer process spawned per
> email, and adding these two features really shouldn't be hard.  Without
> even breaking any backward compatibility (with the existing database
> format).  I hope.

Personally, as a sysadmin, I'd prefer to get most of what I need from one
tool. I'd favour enhancing SA.

> The reason I'm playing with bayes is my interest in the possible usefulness
> of shared bayes data.
> 
> I want to do more testing of using other people's bayes data on
> my corpora.  My assumption is that most end users don't do their own
> training.  So I haven't been using bayes, for some time, in an attempt

Most users use their computers to do gain something else. They don't want to
deal with computers. Training personal bayes filters is undesirable to them.

p@rick


-- 
state of mind ()
 
http://www.state-of-mind.de
 
Franziskanerstraße 15      Telefon +49 89 3090 4664
81669 München              Telefax +49 89 3090 4666
 
Amtsgericht München        Partnerschaftsregister PR 563