You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Scott <te...@msxc.com> on 2017/08/15 03:08:46 UTC

message/rfc822 to mbox script for use with sa-learn workflow

I have a script that can take spam/ham messages forwarded as attachments from
Outlook and turn them into rfc822 individual files.  It allows external
users to send me Outlook spam/ham for review.  I will in turn feed sa-learn
with those messages once vetted.  That part of the process is getting me the
messages in-tact as far as I can tell, as the user received them.  I could
pipe those messages to sa-learn directly; that's what the script is designed
to do.  But I don't trust the user's submissions, and prefer to review
first.  FYI, the script that handles the separation of the attachments is
from here:
http://www.localside.net/sal-wrapper/

I would like to turn around and put those individual messages back into mbox
format, again, without changing their original headers.  Anyone have a
script or a method which will accomplish that?  I tried to figure out how to
do it but was unsuccessful.










--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Scott <te...@msxc.com>.
Maybe not rf822 format.  This is a sample extracted single file:
https://pastebin.com/S9W4Z64N





--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138363.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Scott <te...@msxc.com>.
>It should be OK, but it wouldn't be ideal to combine it with=20
>autotraining because the manual training wont be able to counter any=20
>mistraining of the tokens from the stripped headers.=20

> It would probably be a good idea to use a comprehensive ignoreheader=20
> list. You could start with AXB's:=20

> =
https://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/axb/23_b
ayes_ignore_header.cf?view=3Dco

Seems like a good place to be able to use a wildcard "X-*"

To be sure I understand....  without the ignores, when a message arrives
that meets the auto-learn thresholds and is auto-learned, it would be 
learned WITH the X- headers, right?  And the issue might be that I could 
not be able to over-ride those with one of my manually taught messages 
that had those X- headers removed.  Right?

What about the DKIM signatures and Yahoo's X-YMail-OSG signatures?  I 
would not be able to counter those either, should they be ignored as 
well?  





--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138381.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by RW <rw...@googlemail.com>.
On Tue, 15 Aug 2017 07:55:39 -0700 (MST)
Scott wrote:

> Hmmm.  Doesn't sound good.  I sent a simple text message through a
> large ISP, to my server, arrived in a mbox.  Compared that message to
> the one that was POPed, then sent back as an attachment and stripped
> out via the existing script.
> 
> These sanitized messages are pretty short but I put in pastebin:
> https://pastebin.com/b38RXHgx
> 
> When looking in Outlook the headers all appear intact, but forwarding
> as an attachment appears to strip these:
> Delivered-To:
> All X- headers added by my SA
> All X- headers added by sending ISP (X-Yahoo*)
> Authentication results and DKIM signature
> Status: R
> 
> Otherwise the rest of the headers were unaffected.
> 
> I'm not sure how bad that stripping of X-headers, DKIM, etc screws up
> bayes learning?.  Doesnt' SEEM that bad, 

It should be OK, but it wouldn't be ideal to combine it with
autotraining because the manual training wont be able to counter any
mistraining of the tokens from the stripped headers.

It would probably be a good idea to use a comprehensive ignoreheader
list. You could start with AXB's: 

https://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf?view=co



Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Jesse Norell <je...@kci.net>.
On Tue, 2017-08-15 at 16:48 -0700, Scott wrote:
> 
> >An idea for an alternate collection method:  run an imap server on
> your 
> >sa-learn training box, setup a second email account in Outlook for
> the 
> >users who are training, and have them just drag the ham/spam to
> training 
> >folders.  I don't know if it's "better," but I'd prefer it myself to 
> >)re)training users to forward as attachment, then piecing things
> back 
> >together. 
> 
> >If that's an option you'll pursue and you can use dovecot as your
> imap 
> >server, check out https://github.com/jnorell/train-spam-scanner as a 
> >training script.  It's designed for exactly the goals you have in
> mind, 
> >ie. users supplying training messages which can be moderated and
> built 
> >into a corpus. 
> 
> Thanks Jesse.  I read through the doc, I gathered it expects all the
> user's imap account sand the respective spam/ham folders to be per
> user.  These users don't have any logons on the sa-learn box. 

If you want to allow training from all your users, then yes, they would
all need to have imap accounts on the sa-learn box.  Their primary mail
wouldn't be to that box, just a second imap connection for training in
your scenario; it would probably be more tenable with a limited set of
trusted users performing training.  You may be able to setup LDAP
authentication in dovecot and manage the actual account login/password
from your existing network, but the second connection would be required
unless you could find a way to keep your users' training folders in
Exchange, then sync them all to the sa-learn box.

It may not be the best solution for you, but you've now at leat
considered it...  :)



> There are other logistic hurdles in getting additional mail accounts
> setup on corporate PC's and firewall hurdles.   But I appreciate the
> suggestion and will look at it harder. 
> 
> Also figure I could 
> a) just let them send as attachment, pop then handle their attachments
> in windows, then auto-feed them myself (as attachment again). 
> b) let them send as attachment to a imap account, where I can
> "unattach" inspect, and put in ham/spam imap folders then train from
> those. 
> 


-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net


RE: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Scott <te...@msxc.com>.
>An idea for an alternate collection method:  run an imap server on your 
>sa-learn training box, setup a second email account in Outlook for the 
>users who are training, and have them just drag the ham/spam to training 
>folders.  I don't know if it's "better," but I'd prefer it myself to 
>)re)training users to forward as attachment, then piecing things back 
>together. 

>If that's an option you'll pursue and you can use dovecot as your imap 
>server, check out https://github.com/jnorell/train-spam-scanner as a 
>training script.  It's designed for exactly the goals you have in mind, 
>ie. users supplying training messages which can be moderated and built 
>into a corpus. 

Thanks Jesse.  I read through the doc, I gathered it expects all the user's imap account sand the respective spam/ham folders to be per user.  These users don't have any logons on the sa-learn box.

There are other logistic hurdles in getting additional mail accounts setup on corporate PC's and firewall hurdles.   But I appreciate the suggestion and will look at it harder.

Also figure I could 
a) just let them send as attachment, pop then handle their attachments in windows, then auto-feed them myself (as attachment again).
b) let them send as attachment to a imap account, where I can "unattach" inspect, and put in ham/spam imap folders then train from those.








 





--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138390.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Jesse Norell <je...@kci.net>.
On Tue, 2017-08-15 at 07:55 -0700, Scott wrote:
> I need a way to go from Outlook to train SA if I'm to train at all.
> FOr
> most of my users the inbound mail is handed off to a 3rd party
> Exchange
> server that I don't have access to.  So setting up a public IMAP
> folder on
> the exchange server type solution is probably not possible.  And I
> presume
> that process messes with the messages too anyway.  I can't cc the
> users mail
> on my server for later review, there would be too many.
> 
> If I'm forwarded spam as an attachment for learning, I would require
> ham
> from the same method.
> 
> My plan wasn't to make this a daily routine.  Only to help a few users
> who
> say they are getting too much spam slipping through all the other
> checks
> untagged.  To help train bayes to assist on those problem users.  Old
> email
> accounts that can't be changed and are on the golden spam lists.
> 
> The reason to "reassemble" the extracted attachments was just to make
> it
> easier for me to access the messages and review them.  Too tedious at
> the
> console.  Don't know how to use formal to do it, and wont' it add some
> more
> headers to the mess too?
> 
> FWIW, I did try sa-learn on a sample of extracted attachments in their
> raw
> form.  It was happy with them:
> [root@tn3 msg-1502747659-31280-0]# sa-learn --spam *
> Learned tokens from 97 message(s) (97 message(s) examined)
> 
> But picking through them to vet them would be too tedious at the
> console. 
> They get random number type filenames as part of the extraction.
> 
> My constraints are:
> - messages are sent to 3rd party exchange server
> - exchange server access does not exist at this time
> - users use Outlook client at least v2003
> - I use site wide bayes
> - I don't trust the users to feed bayes. 
> - I can't cc their Email on my server for later feeding.
> - I want to use this process for corpus building, not daily
> maintenance.
> 
> My plan was:
> - receive spam and ham (separately) "as attachments" form outlook
> - extract attachments
> - review attachments
> - feed attachments to sa-learn
> 
> Open for a better method..


An idea for an alternate collection method:  run an imap server on your
sa-learn training box, setup a second email account in Outlook for the
users who are training, and have them just drag the ham/spam to training
folders.  I don't know if it's "better," but I'd prefer it myself to
)re)training users to forward as attachment, then piecing things back
together.

If that's an option you'll pursue and you can use dovecot as your imap
server, check out https://github.com/jnorell/train-spam-scanner as a
training script.  It's designed for exactly the goals you have in mind,
ie. users supplying training messages which can be moderated and built
into a corpus.

-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net


Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Scott <te...@msxc.com>.
Hmmm.  Doesn't sound good.  I sent a simple text message through a large ISP,
to my server, arrived in a mbox.  Compared that message to the one that was
POPed, then sent back as an attachment and stripped out via the existing
script.

These sanitized messages are pretty short but I put in pastebin:
https://pastebin.com/b38RXHgx

When looking in Outlook the headers all appear intact, but forwarding as an
attachment appears to strip these:
Delivered-To:
All X- headers added by my SA
All X- headers added by sending ISP (X-Yahoo*)
Authentication results and DKIM signature
Status: R

Otherwise the rest of the headers were unaffected.

I'm not sure how bad that stripping of X-headers, DKIM, etc screws up bayes
learning?.  Doesnt' SEEM that bad, but it's out of my skillset.  Nor how bad
it munges other stuff that SA needs to see in a more complex message that
some of you mentioned.

I need a way to go from Outlook to train SA if I'm to train at all.  FOr
most of my users the inbound mail is handed off to a 3rd party Exchange
server that I don't have access to.  So setting up a public IMAP folder on
the exchange server type solution is probably not possible.  And I presume
that process messes with the messages too anyway.  I can't cc the users mail
on my server for later review, there would be too many.

If I'm forwarded spam as an attachment for learning, I would require ham
from the same method.

My plan wasn't to make this a daily routine.  Only to help a few users who
say they are getting too much spam slipping through all the other checks
untagged.  To help train bayes to assist on those problem users.  Old email
accounts that can't be changed and are on the golden spam lists.

The reason to "reassemble" the extracted attachments was just to make it
easier for me to access the messages and review them.  Too tedious at the
console.  Don't know how to use formal to do it, and wont' it add some more
headers to the mess too?

FWIW, I did try sa-learn on a sample of extracted attachments in their raw
form.  It was happy with them:
[root@tn3 msg-1502747659-31280-0]# sa-learn --spam *
Learned tokens from 97 message(s) (97 message(s) examined)

But picking through them to vet them would be too tedious at the console. 
They get random number type filenames as part of the extraction.

My constraints are:
- messages are sent to 3rd party exchange server
- exchange server access does not exist at this time
- users use Outlook client at least v2003
- I use site wide bayes
- I don't trust the users to feed bayes. 
- I can't cc their Email on my server for later feeding.
- I want to use this process for corpus building, not daily maintenance.

My plan was:
- receive spam and ham (separately) "as attachments" form outlook
- extract attachments
- review attachments
- feed attachments to sa-learn

Open for a better method..

Grateful for help with a formail command to assemble and try out if someone
is a guru.  To get it into mboxcl2 format that my Dovecot uses and SA would
be happy with (https://wiki2.dovecot.org/MailboxFormat/mbox)

Thanks 






















--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138379.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by RW <rw...@googlemail.com>.
On Tue, 15 Aug 2017 14:11:14 +0200
Matus UHLAR - fantomas wrote:

> On 14.08.17 21:34, Ian Zimmerman wrote:
> >On 2017-08-14 20:08, Scott wrote:
> >  
> >> I would like to turn around and put those individual messages back
> >> into mbox format, again, without changing their original headers.  
> >
> >The first question is: why?  sa-learn works on just about any format:
> >individual messages, multiple messages in a flat directory,
> >maildirs.  
> 
> the question here is whether results from sa-learn running over other
> (e.g. outlook) formats are useful when processing mbox format.

The OP wants to convert emails that are already in  "rfc822" format
into mbox.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
On 14.08.17 21:34, Ian Zimmerman wrote:
>On 2017-08-14 20:08, Scott wrote:
>
>> I would like to turn around and put those individual messages back
>> into mbox format, again, without changing their original headers.
>
>The first question is: why?  sa-learn works on just about any format:
>individual messages, multiple messages in a flat directory, maildirs.

the question here is whether results from sa-learn running over other
(e.g. outlook) formats are useful when processing mbox format.

if they are in some propietary binary format, they are apparently not
useful.

If they are converted back to mailbox format, they are quite useful,
although some information may be lost - outlook kind of "sanitizes" the
mail, in which case many details helping to trace spam are lost.

The best is, to catch mail before it hits microsoft clients or servers.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Emacs is a complicated operating system without good text editor.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Scott <te...@msxc.com>.
tried:

#!/bin/bash
FILES=/home/mail/msg-1502747659-31280-0/*
echo "" > /home/mail/test/out
for f in $FILES
do
  echo "Processing $f file..."
  # take action on each file. $f store current file name
  cat $f|formail >> /home/mail/test.out
done

Almost worked. It adds the needed "From" header and then the legacy "mail -f
file" program recognizes the messages.   Message sizes are listed correctly
if the attachment was there, but it's not.  But it's truncating the in-line
attachments.  Tried every option I could think of for formail, no dice.





--
View this message in context: http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138388.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Ian Zimmerman <it...@very.loosely.org>.
On 2017-08-14 20:08, Scott wrote:

> I would like to turn around and put those individual messages back
> into mbox format, again, without changing their original headers.

The first question is: why?  sa-learn works on just about any format:
individual messages, multiple messages in a flat directory, maildirs.

If in spite of the above you _must_ have a mbox file, I would just setup
a trivial procmail config (maybe even an empty one, supplemented with
one or two environment variables including DEFAULT) and pipe the
messages through procmail one by one.

You probably need the -f option to force generation of the From_ mbox
delimiter.

-- 
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
Do obvious transformation on domain to reply privately _only_ on Usenet.

Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by RW <rw...@googlemail.com>.
On Mon, 14 Aug 2017 20:08:46 -0700 (MST)
Scott wrote:

> I would like to turn around and put those individual messages back
> into mbox format, again, without changing their original headers.
> Anyone have a script or a method which will accomplish that?  I tried
> to figure out how to do it but was unsuccessful.

If you must use mbox, make sure you get it right:

 
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7445




Re: message/rfc822 to mbox script for use with sa-learn workflow

Posted by Merijn van den Kroonenberg <me...@web2all.nl>.
> I have a script that can take spam/ham messages forwarded as attachments
> from
> Outlook and turn them into rfc822 individual files.  It allows external
> users to send me Outlook spam/ham for review.  I will in turn feed
> sa-learn
> with those messages once vetted.  That part of the process is getting me
> the
> messages in-tact as far as I can tell, as the user received them.

As long as you aware, once outlook touches a message, there is nothing
original left. It adds/removes/reorders headers and modifies mime parts
(even html).

> I could
> pipe those messages to sa-learn directly; that's what the script is
> designed
> to do.  But I don't trust the user's submissions, and prefer to review
> first.  FYI, the script that handles the separation of the attachments is
> from here:

For reviewing this sounds ok. But I am unsure what all the outlook
mangling does to the effectiveness of sa-learn. I guess its better than
nothing as most of the tokens are probably still the same...

All the 'outlook' tokens trained is probably balanced by training ham
which arrives from people using outlook, so i guess that should cause no
problem...right?

> http://www.localside.net/sal-wrapper/
>
> I would like to turn around and put those individual messages back into
> mbox
> format, again, without changing their original headers.  Anyone have a
> script or a method which will accomplish that?  I tried to figure out how
> to
> do it but was unsuccessful.
>
>