You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arthur Dent <mi...@blueyonder.co.uk> on 2012/10/06 12:03:18 UTC

BAYES_00

Hello all,

Following a hard drive crash I am rebuilding my small home server on a
Fedora17 platform.

One of the casualties of the HD crash was my spam corpus. I had a (very
old) backup which happened to include a previous spam corpus so I used
that to sa-learn.

All my messages hit BAYES_00. 

I don't have many "fresh" spams. I do not run a SMTP server, I simply
collect mail for my family and myself from my ISP and other sources
using fetchmail. My ISP seem to filter most of the really bad stuff so I
get just a trickle of spams (about 1 per day - if that) but even those
hit BAYES_00 despite sometimes being identical to a previous FN that had
already been learned with sa-learn.

Here is my --dump magic:
================================8<=========================================
$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       4551          0  non-token data: nspam
0.000          0       3054          0  non-token data: nham
0.000          0     198095          0  non-token data: ntokens
0.000          0 1346143801          0  non-token data: oldest atime
0.000          0 1349506984          0  non-token data: newest atime
0.000          0 1349493620          0  non-token data: last journal sync atime
0.000          0 1349476411          0  non-token data: last expiry atime
0.000          0    1382400          0  non-token data: last expire atime delta
0.000          0     171403          0  non-token data: last expire reduction count
================================8<=========================================

What - if anything - can I do to improve bayes performance?

Thanks in advance

Mark



Re: BAYES_00

Posted by Jeff Mincy <je...@delphioutpost.com>.
   From: Arthur Dent <mi...@blueyonder.co.uk>
   Date: Sat, 06 Oct 2012 11:03:18 +0100
   
   Hello all,
   
   Following a hard drive crash I am rebuilding my small home server on a
   Fedora17 platform.
   
   One of the casualties of the HD crash was my spam corpus. I had a (very
   old) backup which happened to include a previous spam corpus so I used
   that to sa-learn.
   
   All my messages hit BAYES_00. 
   
   I don't have many "fresh" spams. I do not run a SMTP server, I simply
   collect mail for my family and myself from my ISP and other sources
   using fetchmail. My ISP seem to filter most of the really bad stuff so I
   get just a trickle of spams (about 1 per day - if that) but even those
   hit BAYES_00 despite sometimes being identical to a previous FN that had
   already been learned with sa-learn.
   
   Here is my --dump magic: ...
   
   What - if anything - can I do to improve bayes performance?

Get more spam?  Bayes really isn't going to do well with limited
amount of spam.  It does great when correctly trained using lots of
spam.  But with limited data, not so much.

You could try starting over.  It will take 6 months or so to get to
200 spam messages if you are really getting about 1 per day.  You
could just turn off Bayes.  Or you could just turn Bayes off.  I'm
almost at the same point with my home email, for the same reason.

-jeff

Re: BAYES_00

Posted by RW <rw...@googlemail.com>.
On Sat, 06 Oct 2012 11:03:18 +0100
Arthur Dent wrote:

> Hello all,
> 
> Following a hard drive crash I am rebuilding my small home server on a
> Fedora17 platform.
> 
> One of the casualties of the HD crash was my spam corpus. I had a
> (very old) backup which happened to include a previous spam corpus so
> I used that to sa-learn.
> 
> All my messages hit BAYES_00. 
> 
> I don't have many "fresh" spams. I do not run a SMTP server, I simply
> collect mail for my family and myself from my ISP and other sources
> using fetchmail. My ISP seem to filter most of the really bad stuff
> so I get just a trickle of spams (about 1 per day - if that) but even
> those hit BAYES_00 despite sometimes being identical to a previous FN
> that had already been learned with sa-learn.
> 
> ...
> What - if anything - can I do to improve bayes performance?

I don't know if anyone got my previous reply to this, it just seemed to
disappear into gmail.

What I suggested is that you retrain from the corpora without
allowing any expiry because  the spammy tokens may be preferentially
discarded.

In general the expiry algorithm may not work well if you have fewer
then a few hams or a few spams a day because not enough tokens are
having their atimes updated by classification.  

Re: BAYES_00

Posted by John Hardin <jh...@impsec.org>.
On Sat, 6 Oct 2012, Arthur Dent wrote:

> On Sat, 2012-10-06 at 12:36 -0700, John Hardin wrote:
>>
>> Well, you're probably going to have to re-train from scratch.
>
> Awwww...

That's not a big deal if you've kept your corpora...

>> Review every message in your training corpora to ensure they are 
>> properly classified.

This is important. Automatic training without manual review is hazardous.

>> Add a bunch of new ham and, if you have any, new spam.
>
> Well I have a bash script that runs every night. It copies mail from all
> the folders I have in which I have ham into a temporary folder and then
> learns them as ham (and deletes the temporary folder).

Are you sure there are no FNs (spam) in those folders?

> I have two other folders, one for spam caught by SA or manually put
> there by me, and another for "virus" infected emails caught by clamav
> (which, because I am using the Sanesecurity additional rules, are
> actually phishes, scams and good old spam). The script does a similar
> thing with these 2 folders and learns them as spam.

Are you sure there are no FPs (ham) in those folders?

> The same emails will get learned over and over again - but I believe
> this is OK?

Yes. It keeps track of messages it has already learned from. That's why 
you'll see things like "learned 0 (scanned 1000)" in the log.

>> Very old spam (say, >5 years) may not be too useful, and probably should
>> be omitted, unless you have a very small spam corpus.
>
> The backup I used was from ... ahem... 2008

That's not too bad. Older spams won't hurt but given how fluid spammer 
tactics are they might not help much.

>> Turn off autolearn. I'm in a similar situation and hand-training on the
>> rare misses works great for me.
>>
>> Also, given your low volume, I would recommend quarantining all spam, and
>> not having a discard threshold score over which spams are thrown out
>> unseen. Any that do get delivered can be reviewed and added to your
>> spam training corpus.
>>
>> Zap your Bayes database, re-train and see how it goes.
>
> I only have about 20 "fresh" spams in those two folders. Will bayes be
> deactivated until I get back to 200 spams?

Yep. It would also be okay to keep the newest 200 spams from your old 
corpus.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #9: Accuracy is relative: most combat
   shooting standards will be more dependent on "pucker factor" than
   the inherent accuracy of the gun.
-----------------------------------------------------------------------
  Tomorrow: the first private ISS resupply mission (SpaceX/Dragon)

Re: BAYES_00

Posted by Axb <ax...@gmail.com>.
On 10/06/2012 10:41 PM, Arthur Dent wrote:
>> >Zap your Bayes database, re-train and see how it goes.
> I only have about 20 "fresh" spams in those two folders. Will bayes be
> deactivated until I get back to 200 spams?

If you want to override the default 200 spam / 200 ham:

add to local.cf

# use whatever number you require
bayes_min_ham_num  50
bayes_min_spam_num 50


h2h



Re: BAYES_00

Posted by Arthur Dent <mi...@blueyonder.co.uk>.
On Sat, 2012-10-06 at 12:36 -0700, John Hardin wrote:
> On Sat, 6 Oct 2012, Arthur Dent wrote:
> 
> > Following a hard drive crash I am rebuilding my small home server on a
> > Fedora17 platform.
> >
> > One of the casualties of the HD crash was my spam corpus. I had a (very
> > old) backup which happened to include a previous spam corpus so I used
> > that to sa-learn.
> >
> > All my messages hit BAYES_00.
> 
> Well, you're probably going to have to re-train from scratch.

Awwww... 

> Review every message in your training corpora to ensure they are properly 
> classified.
> 
> Add a bunch of new ham and, if you have any, new spam.

Well I have a bash script that runs every night. It copies mail from all
the folders I have in which I have ham into a temporary folder and then
learns them as ham (and deletes the temporary folder).

I have two other folders, one for spam caught by SA or manually put
there by me, and another for "virus" infected emails caught by clamav
(which, because I am using the Sanesecurity additional rules, are
actually phishes, scams and good old spam). The script does a similar
thing with these 2 folders and learns them as spam.

The same emails will get learned over and over again - but I believe
this is OK? 

> Very old spam (say, >5 years) may not be too useful, and probably should 
> be omitted, unless you have a very small spam corpus.

The backup I used was from ... ahem... 2008

> Turn off autolearn. I'm in a similar situation and hand-training on the 
> rare misses works great for me.
> 
> Also, given your low volume, I would recommend quarantining all spam, and 
> not having a discard threshold score over which spams are thrown out 
> unseen. Any that do get delivered can be reviewed and added to your 
> spam training corpus.
> 
> Zap your Bayes database, re-train and see how it goes.

I only have about 20 "fresh" spams in those two folders. Will bayes be
deactivated until I get back to 200 spams?

Thanks (yet) again...

Mark
 


Re: BAYES_00

Posted by John Hardin <jh...@impsec.org>.
On Sat, 6 Oct 2012, Arthur Dent wrote:

> Following a hard drive crash I am rebuilding my small home server on a
> Fedora17 platform.
>
> One of the casualties of the HD crash was my spam corpus. I had a (very
> old) backup which happened to include a previous spam corpus so I used
> that to sa-learn.
>
> All my messages hit BAYES_00.

Well, you're probably going to have to re-train from scratch.

Review every message in your training corpora to ensure they are properly 
classified.

Add a bunch of new ham and, if you have any, new spam.

Very old spam (say, >5 years) may not be too useful, and probably should 
be omitted, unless you have a very small spam corpus.

Turn off autolearn. I'm in a similar situation and hand-training on the 
rare misses works great for me.

Also, given your low volume, I would recommend quarantining all spam, and 
not having a discard threshold score over which spams are thrown out 
unseen. Any that do get delivered can be reviewed and added to your 
spam training corpus.

Zap your Bayes database, re-train and see how it goes.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...wind turbines are not meant to actually be an efficient way to
   supply the power grid, rather they're prayer wheels for New Age
   iBuddhists, their whirring blades drawing white guilt from the
   atmosphere and pumping it safely underground.                -- Tam
-----------------------------------------------------------------------
  Tomorrow: the first private ISS resupply mission (SpaceX/Dragon)