You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kareem Dana <ka...@gmail.com> on 2013/07/24 17:48:47 UTC

Piping to sa-learn

I am using SpamAssassin 3.3.2 on FreeBSD 9.1. I'd just like to confirm that
I can pipe messages to sa-learn. The following commands should do the same
thing, correct?

# cat spammail | sa-learn --spam

# sa-learn --spam spammail

I have tested and they appear to be identical, but ultimately I will be
invoking sa-learn through the dovecot antispam plugin (
http://wiki2.dovecot.org/Plugins/Antispam) and their webpage has a whole
section about how sa-learn does not support piped input and I need a
wrapper script. I believe that is outdated, but I want to be absolutely
sure that I can pipe mail to sa-learn and it will work properly.

Thanks.

Re: Piping to sa-learn

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2013-07-25 at 01:10 +0200, Mark Martinec wrote:
> The SA 3.3.2 and the current 3.4.0 both contain a code
> that copies stdin to a temporary file in order to make the
> ArchiveIterator happy, which only accepts files or directories.
> 
> So the only current advantage of passing a message on stdin is a
> more comfortable use, but there is no speed or disk I/O advantage.
> 
> I'm not sure when this feature was introduced but left undocumented.

Dunno either, but I'd guess around the time AI was introduced and
sa-learn --dir and --file options got deprecated.


> I very much doubt it will ever go away, so you can use it, unless
> you want to comply with the current official documentation,
> which only mentions files.

Only files, in a rather fuzzy way. The -f and (ignored) --dir options
indicate support for directories, without the docs mentioning it.
Another one to document properly.

I too very much doubt it will ever go away. The simple 'spamassassin'
front-end accepts STDIN, as documented in the man-page and in probably
every single example for debugging. And both share the same code that
supports STDIN for SA in the first place...


  # ArchiveIterator doesn't really like STDIN, so if "-" is specified
  # as a target, make it a temp file instead.

There's quite a blob duplicated in sa-learn.raw and spamassassin.raw.
Candidate for moving to AI?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Piping to sa-learn

Posted by Mark Martinec <Ma...@ijs.si>.
On Wednesday 24 July 2013 17:48:47 Kareem Dana wrote:
> I am using SpamAssassin 3.3.2 on FreeBSD 9.1. I'd just like to confirm that
> I can pipe messages to sa-learn. The following commands should do the same
> thing, correct?
> 
> # cat spammail | sa-learn --spam
> # sa-learn --spam spammail

Yes, it happens to work the same,
although it is not documented in the sa-learn man page / usage page.

> I have tested and they appear to be identical, but ultimately I will be
> invoking sa-learn through the dovecot antispam plugin (
> http://wiki2.dovecot.org/Plugins/Antispam) and their webpage has a whole
> section about how sa-learn does not support piped input and I need a
> wrapper script. I believe that is outdated, but I want to be absolutely
> sure that I can pipe mail to sa-learn and it will work properly.

The SA 3.3.2 and the current 3.4.0 both contain a code
that copies stdin to a temporary file in order to make the
ArchiveIterator happy, which only accepts files or directories.

So the only current advantage of passing a message on stdin is a
more comfortable use, but there is no speed or disk I/O advantage.

I'm not sure when this feature was introduced but left undocumented.
I very much doubt it will ever go away, so you can use it, unless
you want to comply with the current official documentation,
which only mentions files.

Feel free to open a documentation update request in bugzilla.

See sa-learn code, search for:
  # Deal with the target listing, and STDIN -> tempfile


Mark

Re: Piping to sa-learn

Posted by Kareem Dana <ka...@gmail.com>.
Thank you both for those replies. That confirmed exactly what I was looking
for. Very helpful.

Karsten, concerning your note about sa-learn in different environments, I
did a few more tests and it looks like the dovecot antispam plugin does not
make any changes to the e-mail message and sa-learn treats the message the
same as if it was run from the command line. But in my environment all
learning will be routed through the plugin anyway.

Thanks,
Kareem



On Wed, Jul 24, 2013 at 6:31 PM, Karsten Bräckelmann <guenther@rudersport.de
> wrote:

> On Wed, 2013-07-24 at 10:48 -0500, Kareem Dana wrote:
> > I am using SpamAssassin 3.3.2 on FreeBSD 9.1. I'd just like to confirm
> > that I can pipe messages to sa-learn. The following commands should do
> > the same thing, correct?
> >
> > # cat spammail | sa-learn --spam
> > # sa-learn --spam spammail
>
> Correct.
>
> $ formail -1 -s  < mbox  > msg
> $ sa-learn --spam msg
> Learned tokens from 1 message(s) (1 message(s) examined)
>
> $ formail -1 -s  < mbox  | sa-learn --spam
> Learned tokens from 0 message(s) (1 message(s) examined)
>
> That confirms sa-learn accepts mail being piped in, and that there is
> absolutely no difference to using a file as intermediate storage.
>
>
> NOTE: Be careful of using sa-learn in different environments or ways in
> parallel. For example via the dovecot anti-spam plugin, from a cron job
> harvesting mbox files, maildir, processed through formail or even worse
> an MUA...
>
> Slight modifications to the MIME structure can result in SA treating
> them as different messages, learning twice. In fact, a mere (additional)
> trailing newline at the end of the MIME message suffices.
>
>
> > I have tested and they appear to be identical, but ultimately I will
> > be invoking sa-learn through the dovecot antispam plugin
> > (http://wiki2.dovecot.org/Plugins/Antispam) and their webpage has a
> > whole section about how sa-learn does not support piped input and I
> > need a wrapper script. I believe that is outdated, but I want to be
> > absolutely sure that I can pipe mail to sa-learn and it will work
> > properly.
>
> That claim appears to be based on reading the sa-learn documentation,
> rather than actually trying it. No, wait, I do not mean to imply it is
> bad to read the docs. Not at all. It is, maybe, just slightly inferior
> to verifying something not mentioned equals not supported...
>
> I can confirm this works at least since SA 3.2, which IIRC even predates
> the dovecot anti-spam plugin, let alone its documentation. ;)
>
>
> --
> char *t="\10pse\0r\0dtu\0.@ghno
> \x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8?
> c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
> }}}
>
>

Re: Piping to sa-learn

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2013-07-25 at 09:39 +0100, James Griffin wrote:
> Thu 25.Jul'13 at  1:31:16 +0200, Karsten Bräckelmann

> > NOTE: Be careful of using sa-learn in different environments or ways in
> > parallel. For example via the dovecot anti-spam plugin, from a cron job
> > harvesting mbox files, maildir, processed through formail or even worse
> > an MUA...
>  
> I'm new to this list but have been using SA for a number of years.
> Having read your note above, I thought I'd ask for a little more info,
> in particular piping a message from mutt, my MUA, to mark a single mail
> as spam and move it to an appropriate mailbox.

> You say this a bad idea - so I'm wondering if it's best I no longer do
> that, and why?
> 
> I'm using SA 3.3.2. My mail is also scanned using procmail prior to
> being filtered into MH mailboxes.

That's how almost everyone does it. ;)  Auto-learn prior to delivery
(procmail calling SA and later delivering in your case) and manually
training hand-classified mail.


The important part here is mixing different ways to feed mail for Bayes
training. I have observed trailing newline issues between

(a) the dovecot anti-spam plugin's output,  (b) 'formail' splitting out
single messages from mbox files, and  (c) running 'sa-learn --mbox'.
Over the years, and IIRC only. Might even have been specific to a
system.

The point is, differences in trailing newline, slightly altered MIME
structures or headers will be invisible to Bayes as far as the tokens
(the content) is concerned. The internal hash identifying a given
message to have been seen by Bayes will differ, though.

As a result, messages could be learned twice, or not be forgotten.


Simple test if you're safe and everything works as expected:

Identify a message M that has been learned already, e.g. via the dovecot
anti-spam plugin, or SA auto-learning. Then apply your usual other
method of training, like sa-learn'ing the whole mbox or maildir storage
containing the message M, or running the mutt macro in your case.

If the message M has been learned *again*, it has been altered by one of
the methods. Which is bad, obviously. If Bayes identifies M to have been
seen before and refuses re-training, you're good.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Piping to sa-learn

Posted by James Griffin <jm...@kontrol.kode5.net>.
................Thu 25.Jul'13 at  1:31:16 +0200, Karsten Bräckelmann................
 
[ ... ]


> NOTE: Be careful of using sa-learn in different environments or ways in
> parallel. For example via the dovecot anti-spam plugin, from a cron job
> harvesting mbox files, maildir, processed through formail or even worse
> an MUA...
 
I'm new to this list but have been using SA for a number of years.
Having read your note above, I thought I'd ask for a little more info,
in particular piping a message from mutt, my MUA, to mark a single mail
as spam and move it to an appropriate mailbox.

My mutt macro rule also has an option to blacklist the address, if I
want.

You say this a bad idea - so I'm wondering if it's best I no longer do
that, and why?

I'm using SA 3.3.2. My mail is also scanned using procmail prior to
being filtered into MH mailboxes.

Best wishes, Jamie

Re: Piping to sa-learn

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2013-07-24 at 10:48 -0500, Kareem Dana wrote:
> I am using SpamAssassin 3.3.2 on FreeBSD 9.1. I'd just like to confirm
> that I can pipe messages to sa-learn. The following commands should do
> the same thing, correct?
> 
> # cat spammail | sa-learn --spam
> # sa-learn --spam spammail

Correct.

$ formail -1 -s  < mbox  > msg
$ sa-learn --spam msg
Learned tokens from 1 message(s) (1 message(s) examined)

$ formail -1 -s  < mbox  | sa-learn --spam
Learned tokens from 0 message(s) (1 message(s) examined)

That confirms sa-learn accepts mail being piped in, and that there is
absolutely no difference to using a file as intermediate storage.


NOTE: Be careful of using sa-learn in different environments or ways in
parallel. For example via the dovecot anti-spam plugin, from a cron job
harvesting mbox files, maildir, processed through formail or even worse
an MUA...

Slight modifications to the MIME structure can result in SA treating
them as different messages, learning twice. In fact, a mere (additional)
trailing newline at the end of the MIME message suffices.


> I have tested and they appear to be identical, but ultimately I will
> be invoking sa-learn through the dovecot antispam plugin
> (http://wiki2.dovecot.org/Plugins/Antispam) and their webpage has a
> whole section about how sa-learn does not support piped input and I
> need a wrapper script. I believe that is outdated, but I want to be
> absolutely sure that I can pipe mail to sa-learn and it will work
> properly.

That claim appears to be based on reading the sa-learn documentation,
rather than actually trying it. No, wait, I do not mean to imply it is
bad to read the docs. Not at all. It is, maybe, just slightly inferior
to verifying something not mentioned equals not supported...

I can confirm this works at least since SA 3.2, which IIRC even predates
the dovecot anti-spam plugin, let alone its documentation. ;)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}