You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ryan Coleman <ry...@cwis.biz> on 2015/10/17 03:59:52 UTC

Checking if sa-learn is actually learning

How do I go about checking that my automated scripts that handle spam learning are actually learning? I have literally hundreds of emails a day that go into the “new” folder I have set up and it does not seem to be learning from them. 

OS: Ubuntu 14.04.3 LTS
MTA: Postfix 2.11.0-1ubuntu1
postgrey 1.34-12
spamassassin/spamc 3.4.0-1ubuntu2.1


sa-learn commands:
[scans domains for specified folders and scans them]
> /usr/bin/find /var/mail/vhosts/ -name '*.Spam.New*' -type d -exec /usr/bin/sa-learn --no-sync --spam --progress {}* \;
> /usr/bin/find /var/mail/vhosts/ -name '*.Spam.Suspected*' -type d -exec /usr/bin/sa-learn --no-sync --spam --progress {}* \;

I swear I had issues in the past without having —no-sync, but is that causing it?



Re: Checking if sa-learn is actually learning

Posted by RW <rw...@googlemail.com>.
On Fri, 16 Oct 2015 20:59:52 -0500
Ryan Coleman wrote:

> How do I go about checking that my automated scripts that handle spam
> learning are actually learning? I have literally hundreds of emails a
> day that go into the ?new? folder I have set up and it does not seem
> to be learning from them. 
> ...
> 
> sa-learn commands:
> [scans domains for specified folders and scans them]
> > /usr/bin/find /var/mail/vhosts/ -name '*.Spam.New*' -type d
> > -exec /usr/bin/sa-learn --no-sync --spam --progress {}*
> > \; /usr/bin/find /var/mail/vhosts/ -name '*.Spam.Suspected*' -type
> > d -exec /usr/bin/sa-learn --no-sync --spam --progress {}* \;

There are a few thing wrong with this. 

The * in {}* is at very best superfluous and may be causing various
possible problems. It wouldn't work at all with a POSIX compliant shell.

Also, for a  maildir folder foo you are running sa-learn separately on
foo/, foo/cur, foo/new and foo/tmp. sa-learn understands maildir so
training on new & cur involves unnecessary parsing and invocations of
sa-learn. You shouldn't be training on tmp at all because you might get
an incomplete email.

Also I don't see anything about learning ham.


One you've fixed your script append the following:  

   sa-learn -D bayes --dump magic >> /var/tmp/sa-debug 2>&1

and then let the script run as it would do normally do, from cron or
whatever.

When you look at the output file, check nspam is increasing as new spam
is trained and that nspam and nham are both over 200. 

Then check that delivery and training are using the same database. Look
at the location of the bayes files in the debug. Take a look at the
mtime of the bayes journal file in the same directory, and check that
it's updated during a mail delivery scan.

Re: Checking if sa-learn is actually learning

Posted by Ian Zimmerman <it...@buug.org>.
On 2015-10-16 20:59 -0500, Ryan Coleman wrote:

> sa-learn commands:
> [scans domains for specified folders and scans them]
> > /usr/bin/find /var/mail/vhosts/ -name '*.Spam.New*' -type d -exec /usr/bin/sa-learn --no-sync --spam --progress {}* \;
> > /usr/bin/find /var/mail/vhosts/ -name '*.Spam.Suspected*' -type d -exec /usr/bin/sa-learn --no-sync --spam --progress {}* \;
> 
> I swear I had issues in the past without having —no-sync, but is that causing it?

If you do the routine learning with --no-sync, you must have one run with
--sync as well, maybe in a cron job.  Or just run with --sync once at
the end of this same script.  That much is straightforward, and should
be clear from the man/pod pages.

The part that caused me some trouble, and is somewhat underdocumented
IMO, is the interaction of --sync with --force-expire.  I'm afraid I
can't help you with that because I took the extreme step of disabling
expiration, and instead re-creating a fresh database monthly from the
recent corpus which I keep around.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.