You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by MySQL Student <my...@gmail.com> on 2009/09/21 03:15:14 UTC

Re-running SA on an mbox

Hi,

I have an mbox with about a 100 messages in it from a few days ago.
The mbox is a combination of spam and ham. What is the best way to run
SA through these messages again, so I can catch the ones that have
URLs in them that weren't on the blacklist at the time they were
received?

Must I break them all apart to do this, or can SA somehow parse the
whole mbox? If not, what program do you suggest I use to accomplish
this?

Thanks,
Alex

Re: Re-running SA on an mbox

Posted by John Hardin <jh...@impsec.org>.

On Tue, 22 Sep 2009, Jeff Mincy wrote:

>   From: MySQL Student <my...@gmail.com>
>   Date: Tue, 22 Sep 2009 15:38:47 -0400
>
>   > Try using a local SA setup for stripping the headers. By local, I mean
>   > don't use your main production SA - run a separate copy with its own
>   > (cut down) configuration and all data base accesses and UBL calls etc
>   > turned off.
>
>   Much better idea, thanks. Thanks for the script, too.
>   Alex
>
> formail can be used to remove headers, for example:
>
>       To remove all Received: fields from the header:
>              formail -I Received:
>
> The following should do what you wanted to remove the X-Spam headers:
>  formail -I X-Spam < msg

And if it's still in multiple-message format:

   formail -Yb -I X-Spam -s  <in >out

You can add more headers, like:

   formail -Yb -I X-Spam -I X-Greylist -s <in >out


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   A sword is never a killer, it is but a tool in the killer's hands.
                           -- Lucius Annaeus Seneca (Martial) 4BC-65AD
-----------------------------------------------------------------------
  Approximately 8761620 firearms legally purchased in the U.S. this year

Re: Re-running SA on an mbox

Posted by Jeff Mincy <je...@delphioutpost.com>.

   From: MySQL Student <my...@gmail.com>
   Date: Tue, 22 Sep 2009 15:38:47 -0400
   
   > Try using a local SA setup for stripping the headers. By local, I mean
   > don't use your main production SA - run a separate copy with its own
   > (cut down) configuration and all data base accesses and UBL calls etc
   > turned off.
   
   Much better idea, thanks. Thanks for the script, too.
   Alex

formail can be used to remove headers, for example:

       To remove all Received: fields from the header:
              formail -I Received:

The following should do what you wanted to remove the X-Spam headers:
  formail -I X-Spam < msg

-jeff

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

> Try using a local SA setup for stripping the headers. By local, I mean
> don't use your main production SA - run a separate copy with its own
> (cut down) configuration and all data base accesses and UBL calls etc
> turned off.

Much better idea, thanks. Thanks for the script, too.

Best,
Alex

Re: Re-running SA on an mbox

Posted by RW <rw...@googlemail.com>.

On Tue, 22 Sep 2009 13:03:16 +0100
Martin Gregorie <ma...@gregorie.org> wrote:


>         gawk '
>                 BEGIN           { act = "copy" }
>                 /^X-Spam/       { act = "skip" }
>                 /^[A-WYZ]/      { act = "copy" }
>                                 {  
>                                   if (act == "copy")
>                                         { print }
>                                 }
>         '

There are a few problem with that: 

1 - it deletes all consecutive headers starting with an X that follow an
    X-Spam header e.g. "X-Delivered-to"

2 - if the bottom header is deleted, the header-body separator is
    also deleted

3 - ^X-Spam can match on the body causing part of it to be deleted, in
    the worst case corrupting the mime structure. 


I think the following is a bit more robust:

awk '
       /^[^[:space:]]/   { remove = 0 }
       /^X-Spam/         { remove = 1 }
       /^$/              { isbody = 1 }
       isbody || !remove { print }

    '

Re: Re-running SA on an mbox

Posted by Martin Gregorie <ma...@gregorie.org>.

On Mon, 2009-09-21 at 23:18 -0400, MySQL Student wrote:
> How can I tell when another process is using the database and when it
> is free for my script to use?
> 
> Is there a faster way to run spamassassin just to strip the SA headers?
> 
Try using a local SA setup for stripping the headers. By local, I mean
don't use your main production SA - run a separate copy with its own
(cut down) configuration and all data base accesses and UBL calls etc
turned off.

By using a separate SA instance you'll avoid access conflicts with your
production SA and by using a minimal configuration it will initialise
and run faster than if it was setting up for a normal scan run.

I have a similar spamc/spamd system that is only used for testing new
local rules. It works well and (important to me anyway) doesn't write
anything to the production maillog, so testing new rules doesn't
contaminate my daily SA performance report.

> Maybe there is a faster way, like passing the messages through the
> running amavisd instead of having to restart spamassassin each time to
> re-process each message?
> 
I maintain a cleaned spam corpus for developing and regression testing
local rules. I use the following script to delete SA headers from this
corpus:

========================= cleaner ===============================
#!/bin/bash

for f in data/*.txt
do
        echo "Cleaning $f" 
        gawk '
                BEGIN           { act = "copy" }
                /^X-Spam/       { act = "skip" }
                /^[A-WYZ]/      { act = "copy" }
                                {  
                                  if (act == "copy")
                                        { print }
                                }
        ' <$f >temp.txt
        mv temp.txt $f
done
====================== end of cleaner ===========================

This is certainly much faster than using SA for that job: it scans 167
spam messages in 2.3 seconds on a 1.6 GHz Core Duo laptop as compared
with a spamc/spamd run on the same corpus and host, which takes 155
seconds.  

Martin

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

It's certainly not a fast operation, but using the following will
split an mbox into individual messages:

export FILENO=00000
mkdir msgs
formail -s sh -c 'cat - >msgs/$FILENO' < mbox-name.mbox

I also created a loop that would strip all the SA headers from the messages:

for file in *; do echo Processing: $file; spamassassin -d < $file >
$file.txt; done

This worked for a few hundred of the messages, but then started to
fail on my production system with:

[22135] warn: bayes: cannot open bayes databases
/home/user/.spamassassin/bayes_* R/W: lock failed: File exists

How can I tell when another process is using the database and when it
is free for my script to use?

Is there a faster way to run spamassassin just to strip the SA headers?

Maybe there is a faster way, like passing the messages through the
running amavisd instead of having to restart spamassassin each time to
re-process each message?

Thanks,
Alex

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

> IIRC you previously mentioned using Pine. Just in case you're not aware
> the default format for Pine/Alpine is MBX, an extended version of
> MBOX. You can tell the difference because MBX mailboxes start with a
> dummy email that's hidden by the software.

It seems that if you save messages into a separate folder it does not
add the DUMMY information at the top. I believe this is why the system
was set up to use "mbox" and not "mbx". Does this sound correct?

> I'd be very wary about allowing any tool to modify an MBX file unless
> you know it's safe. Where locking is an issue, Mark Crispin recommends
> that they only be accessed via the c-client library.

This isn't the actual spool file, but a copy in the home directory.

Thanks,
Alex

Re: Re-running SA on an mbox

Posted by RW <rw...@googlemail.com>.

On Sun, 20 Sep 2009 21:15:14 -0400
MySQL Student <my...@gmail.com> wrote:

> Hi,
> 
> I have an mbox with about a 100 messages in it from a few days ago.
> The mbox is a combination of spam and ham. What is the best way to run
> SA through these messages again, so I can catch the ones that have
> URLs in them that weren't on the blacklist at the time they were
> received?

IIRC you previously mentioned using Pine. Just in case you're not aware
the default format for Pine/Alpine is MBX, an extended version of
MBOX. You can tell the difference because MBX mailboxes start with a
dummy email that's hidden by the software. 

I'd be very wary about allowing any tool to modify an MBX file unless
you know it's safe. Where locking is an issue, Mark Crispin recommends
that they only be accessed via the c-client library.

Re: Re-running SA on an mbox

Posted by Matt Kettler <mk...@verizon.net>.

MySQL Student wrote:
> Hi,
>
>   
>> Do you just want to re-scan the whole mbox and see what rules hit now
>> for research reasons?
>>     
>
> That's a good start, but I'd like to see if I can break out the ham to
> train bayes.
>
>   
>> There's no way to (directly) get SA to modify email that's already in an
>> mbox file. The mass-check and sa-learn tools can read them, but nothing
>> in SA can write to that. However, there might be a utility out there to
>> do this (although I'm not aware of any)..
>>     
>
> Yeah, that's kind of what I thought. Maybe a program that can split
> each message back into an individual file? Would procmail even help
> here? Or even a simple shell script that looks for '^From ', redirects
> it to a file, runs spamassassin -d on it, then re-runs SA on each
> file? I could then concatenate each of them back together and pass it
> through sa-learn.
>   

That sounds like a good plan.

If you google around for "mbox split" or "mbox splitter" you can find
some sample code out there that does it. It's all just simple code
looking for the "^From " boundary.

Re: Re-running SA on an mbox

Posted by Matt Kettler <mk...@verizon.net>.

Theo Van Dinter wrote:
> You probably want "spamassassin --mbox". :)
> It won't modify the messages in-place, but you can do something like
> "spamassassin --mbox infile > outfile".
>
> If you're talking about sa-learn, though, it also knows --mbox.
>   
Yes, but he's got mixed spam and nonspam in one mbox. You've got to
split that before you can feed sa-learn.

>
> On Sun, Sep 20, 2009 at 9:46 PM, MySQL Student <my...@gmail.com> wrote:
>   
>> Yeah, that's kind of what I thought. Maybe a program that can split
>> each message back into an individual file? Would procmail even help
>> here? Or even a simple shell script that looks for '^From ', redirects
>> it to a file, runs spamassassin -d on it, then re-runs SA on each
>> file? I could then concatenate each of them back together and pass it
>> through sa-learn.
>>     
>
>
>

Re: Re-running SA on an mbox

Posted by Mark Martinec <Ma...@ijs.si>.

On Tuesday September 22 2009 06:32:12 Benny Pedersen wrote:
> On man 21 sep 2009 20:33:57 CEST, MySQL Student wrote
> >> but this will invalidtate dkim headers if this headers
> >> is signed, are spamassassin aware of this problem ? (in general)
> >
> > Are you saying there is a bug?
> 
> partly yes, its not a bug as long you keep the orginal email
> but spamassassin --mbox < infile > outfile invalidate dkim signed mails
> no ?

It is not common nor wise to have X-Spam-* header fields included in a
DKIM signature. Neither amavisd nor dkim-milter/OpenDKIM or dkimproxy
would do it, without special effort. I wouldn't expect striping of
X-Spam-* header fields to be problematic in view of invalidating signatures.

What can be detrimental to signatures is modifications to existing
header fields like From or Subject by inserting 'tags' like **SPAM**.
Whether this matters or not depends on what will happen next with
such mail.

  Mark

Re: Re-running SA on an mbox

Posted by Benny Pedersen <me...@junc.org>.

On man 21 sep 2009 20:33:57 CEST, MySQL Student wrote
>> but this will invalidtate dkim headers if this headers
>> is signed, are spamassassin aware of this problem ? (in general)
> Are you saying there is a bug?

partly yes, its not a bug as long you keep the orginal email

but spamassassin --mbox < infile > outfile invalidate dkim signed mails

no ?

>> mutt -f mbox
>> in mutt save to another folder if missclassified
> Yes, I use pine for that, but would like to eliminate as many of
> the FNs as possible, particularly ones that I can't determine visually.

can pine sort mails from header contense ?

if yes it will be less manuel work for you

-- 
xpoint

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

> but this will invalidtate dkim headers if this headers is signed, are
> spamassassin aware of this problem ? (in general)

Are you saying there is a bug?

> mutt -f mbox
>
> in mutt save to another folder if missclassified

Yes, I use pine for that, but would like to eliminate as many of the
FNs as possible, particularly ones that I can't determine visually.

Thanks,
Dave

Re: Re-running SA on an mbox

Posted by Benny Pedersen <me...@junc.org>.

On man 21 sep 2009 04:47:23 CEST, MySQL Student wrote

> Wait, my mistake. I read that too fast. Does that work, and rewrite
> the X-Spam-Status header?

imho spamassassin always remove its own known headers, but only once  
it can add self so yes the trick is to retest, where you will see if  
its still listed in rbl :)

but this will invalidtate dkim headers if this headers is signed, are  
spamassassin aware of this problem ? (in general)

> Guess I could find out for myself, but it just contradicts my
> experience and info I've learned previously.

mutt -f mbox

in mutt save to another folder if missclassified

-- 
xpoint

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

>> You probably want "spamassassin --mbox". :)
>> It won't modify the messages in-place, but you can do something like
>> "spamassassin --mbox infile > outfile".
>
> My apologies if it wasn't clear, but these messages have already been

Wait, my mistake. I read that too fast. Does that work, and rewrite
the X-Spam-Status header?

Guess I could find out for myself, but it just contradicts my
experience and info I've learned previously.

Thanks again,
Alex

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

>> Thank you all for your help. The "mbox split" suggestion is a good
>> one. I'll follow that route and post my experience later.
>
> formail -s is the way to go.

I thought about that as a component of procmail. Sounds great.

Thanks,
Alex

Re: Re-running SA on an mbox

Posted by LuKreme <kr...@kreme.com>.

On Sep 20, 2009, at 20:45, MySQL Student <my...@gmail.com> wrote:
> Thank you all for your help. The "mbox split" suggestion is a good
> one. I'll follow that route and post my experience later.

formail -s is the way to go.

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

> You probably want "spamassassin --mbox". :)
> It won't modify the messages in-place, but you can do something like
> "spamassassin --mbox infile > outfile".

My apologies if it wasn't clear, but these messages have already been
marked by SA. Some are ham, and the rest are FPs that I'd like to
re-run through SA, in hopes of it now properly detecting them as spam.

Thank you all for your help. The "mbox split" suggestion is a good
one. I'll follow that route and post my experience later.

Thanks again,
Alex

Re: Re-running SA on an mbox

Posted by Theo Van Dinter <fe...@apache.org>.

You probably want "spamassassin --mbox". :)
It won't modify the messages in-place, but you can do something like
"spamassassin --mbox infile > outfile".

If you're talking about sa-learn, though, it also knows --mbox.

On Sun, Sep 20, 2009 at 9:46 PM, MySQL Student <my...@gmail.com> wrote:
> Yeah, that's kind of what I thought. Maybe a program that can split
> each message back into an individual file? Would procmail even help
> here? Or even a simple shell script that looks for '^From ', redirects
> it to a file, runs spamassassin -d on it, then re-runs SA on each
> file? I could then concatenate each of them back together and pass it
> through sa-learn.

Re: Re-running SA on an mbox

Posted by MySQL Student <my...@gmail.com>.

Hi,

> Do you just want to re-scan the whole mbox and see what rules hit now
> for research reasons?

That's a good start, but I'd like to see if I can break out the ham to
train bayes.

> There's no way to (directly) get SA to modify email that's already in an
> mbox file. The mass-check and sa-learn tools can read them, but nothing
> in SA can write to that. However, there might be a utility out there to
> do this (although I'm not aware of any)..

Yeah, that's kind of what I thought. Maybe a program that can split
each message back into an individual file? Would procmail even help
here? Or even a simple shell script that looks for '^From ', redirects
it to a file, runs spamassassin -d on it, then re-runs SA on each
file? I could then concatenate each of them back together and pass it
through sa-learn.

Thanks,
Alex

Re: Re-running SA on an mbox

Posted by Matt Kettler <mk...@verizon.net>.

MySQL Student wrote:
> Hi,
>
> I have an mbox with about a 100 messages in it from a few days ago.
> The mbox is a combination of spam and ham. What is the best way to run
> SA through these messages again, so I can catch the ones that have
> URLs in them that weren't on the blacklist at the time they were
> received?
>
> Must I break them all apart to do this, or can SA somehow parse the
> whole mbox? If not, what program do you suggest I use to accomplish
> this?
>   
Do you just want to re-scan the whole mbox and see what rules hit now
for research reasons?

You could probably abuse the mass-check tool for that purpose:

http://svn.apache.org/repos/asf/spamassassin/branches/3.2/masses/

It's normally used to generate logs we feed into the score generation
process, but it can be run on a single mbox.

The downside, is all it does is generate a report, one line per message,
with a list of hits.

There's no way to (directly) get SA to modify email that's already in an
mbox file. The mass-check and sa-learn tools can read them, but nothing
in SA can write to that. However, there might be a utility out there to
do this (although I'm not aware of any)..