You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Martin Gregorie <ma...@gregorie.org> on 2008/02/01 00:54:21 UTC

Re: Bulk spam scan

> > spamassassin --mbox <mbox >scanned.mbox
> 
> No, SA doesn't know how to split up messages for scanning;  sa-learn
> is the only SA component that can extract messages from an mbox mail
> folder.
> 
In that case, what does the --mbox option do? Not what I expected,
evidently.

> If I accidentally mangled my own personal mail flow such that
> everything got put in my system inbox, for instance, I might just move
> my system mailbox file from /var/spool/mail to ~/spammy-inbox, and
> run:
> 
> $ formail -s procmail -m ~/.procmailrc < ~/spammy-inbox
> 
No accident: I've been collecting all inbound and outbound mail with an
"always_bcc" Postfix directive that pushes it through a procmail recipe
and shell script that stores it in a set of mbox files and switches
files when they get near the mbox size limit defined in Postfix.

Meanwhile I've built a proper archive system with a loader that can
extract mail from mbox files, split it up and index the messages. 

I'm pretty certain that some of the mbox files precede me installing SA,
so I'd like to push them through SA before pushing them through the
archive loader and. hopefully, end up with a similar spam scanned set of
mbox files. 

> (I'd move the mailbox out of /var/spool/mail so I didn't keep
> appending old messages to the end of it over and over;  some mail
> *does* get delivered there.)
> 
Yes, that makes sense. Thanks for the formail tip. I can build a script
round that to do my scan and refiling job.
 
> Hmm.  I'm pretty sure it's pointed out in several places that SA does
> not know how to process more than one message per call, but I've been
> using it long enough that I just know that's how it works.  <g>
> 
I'd got that message for SA's normal operation and have looked at the
innards of spamc closely enough to see that can only handle a single
message at a time. As I said above, it was the --mbox option that
confused me because, in general, an mbox file contains multiple
messages.

Given that I'm running spamc + spamd, I have two final questions:

- would it be better to use spamc/spamd for the scan in place of
  SpamAssassin?

- if spamd is the way to go, do I need to stop my normal mail
  system while the scan is running or will spamd keep the two
  streams separate? I assume it does, but its always good to check.

TIA,
Martin



Re: Bulk spam scan

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jan 31, 2008 at 07:22:48PM -0500, Matt Kettler wrote:
> Ok, open mouth, insert foot.. there *IS* a --mbox option to spamassassin 
> in the 3.2 branch. I'm not sure if it will output in mbox format.. you 
> can give it a shot and see if the new mailbox file works..

Yes, yes it will.  The history is that originally "spamassassin" only
dealt with one mail at a time.  I thought this was pretty stupid, since
we have the ArchiveIterator (AI) module which handles file, mbox, mbx,
and dir formats.  sa-learn uses AI, but spamassassin did not.

So I fixed it.  In 2004.  :)

-- 
Randomly Selected Tagline:
'cd /usr/lib/X11 | more'            - Bill

Re: Bulk spam scan

Posted by Matt Kettler <mk...@verizon.net>.
Matt Kettler wrote:
> Martin Gregorie wrote:
>>>> spamassassin --mbox <mbox >scanned.mbox
>>>>       
>>> No, SA doesn't know how to split up messages for scanning;  sa-learn
>>> is the only SA component that can extract messages from an mbox mail
>>> folder.
>>>
>>>     
>> In that case, what does the --mbox option do? Not what I expected,
>> evidently.
>>   
> There is no --mbox option for spamassassin. Period.
>
> There *IS* a --mbox option for sa-learn, but that's ONLY sa-learn. It 
> works as you expect, but cannot be applied to spamassassin or spamc.
>
>
Ok, open mouth, insert foot.. there *IS* a --mbox option to spamassassin 
in the 3.2 branch. I'm not sure if it will output in mbox format.. you 
can give it a shot and see if the new mailbox file works..

However, it might just come out as a string of rfc-822 messages 
concatenated together..



Re: Bulk spam scan

Posted by Matt Kettler <mk...@verizon.net>.
Martin Gregorie wrote:
>>> spamassassin --mbox <mbox >scanned.mbox
>>>       
>> No, SA doesn't know how to split up messages for scanning;  sa-learn
>> is the only SA component that can extract messages from an mbox mail
>> folder.
>>
>>     
> In that case, what does the --mbox option do? Not what I expected,
> evidently.
>   
There is no --mbox option for spamassassin. Period.

There *IS* a --mbox option for sa-learn, but that's ONLY sa-learn. It 
works as you expect, but cannot be applied to spamassassin or spamc.


Re: Bulk spam scan

Posted by Martin Gregorie <ma...@gregorie.org>.
> Ummmm.  What do you mean by "keep the two streams separate"?  SA 
> processes what's handed to it, one message at a time;  the only reason I 
> could see trying to separate things is if running your archive through 
> SA would bog down the server enough to impact regular mail flow.  The 
> processed message goes back where it came from.
> 
That was a fairly dumb question on my part - it was getting late when I
asked it.

I've run a short (12 message) test file through spamassassin using the
--mbox option and capturing the output in another file. My
JavaMail+mstor application read the output OK. I'm currently putting my
very large, torture test message collection through spamassassin and
will write a fuller description when it finishes and the output has been
fed into my application. 

Thanks for your help.


Martin





Re: Bulk spam scan

Posted by Kris Deugau <kd...@vianet.ca>.
Martin Gregorie wrote:
> I'd got that message for SA's normal operation and have looked at the
> innards of spamc closely enough to see that can only handle a single
> message at a time. As I said above, it was the --mbox option that
> confused me because, in general, an mbox file contains multiple
> messages.

I have a feeling that's a leftover in the --help output from *way* back; 
  SA hasn't supported single-pass direct processing of multiple messages 
like that since I can recall.

> Given that I'm running spamc + spamd, I have two final questions:
> 
> - would it be better to use spamc/spamd for the scan in place of
>   SpamAssassin?

Probably better to call spamc, if only to speed up your processing.  For 
your usage the startup cost for calling spamassassin vs spamc isn't 
critical unless you're looking to finish the task as quickly as possible.

> - if spamd is the way to go, do I need to stop my normal mail
>   system while the scan is running or will spamd keep the two
>   streams separate? I assume it does, but its always good to check.

Ummmm.  What do you mean by "keep the two streams separate"?  SA 
processes what's handed to it, one message at a time;  the only reason I 
could see trying to separate things is if running your archive through 
SA would bog down the server enough to impact regular mail flow.  The 
processed message goes back where it came from.

For instance, I run two machines that call SA on final delivery via 
procmail.  Quite often there will be more than one message being run 
through SA at any given time;  spamd wouldn't be much use in an ISP 
environment if it couldn't handle this.

-kgd

Re: Bulk spam scan

Posted by mouss <mo...@netoyen.net>.
Martin Gregorie wrote:
>>> spamassassin --mbox <mbox >scanned.mbox
>>>       
>> No, SA doesn't know how to split up messages for scanning;  sa-learn
>> is the only SA component that can extract messages from an mbox mail
>> folder.
>>
>>     
> In that case, what does the --mbox option do? Not what I expected,
> evidently.
>   

it tells spamassassin that the mail is stored in mbox format.

from the man page:
       ... and files are assumed to be in file format, with a single 
message per file.

>   
>> If I accidentally mangled my own personal mail flow such that
>> everything got put in my system inbox, for instance, I might just move
>> my system mailbox file from /var/spool/mail to ~/spammy-inbox, and
>> run:
>>
>> $ formail -s procmail -m ~/.procmailrc < ~/spammy-inbox
>>
>>     
> No accident: I've been collecting all inbound and outbound mail with an
> "always_bcc" Postfix directive that pushes it through a procmail recipe
> and shell script that stores it in a set of mbox files and switches
> files when they get near the mbox size limit defined in Postfix.
>   

why not deliver these dups to maildir instead of mbox?

> Meanwhile I've built a proper archive system with a loader that can
> extract mail from mbox files, split it up and index the messages. 
>
> I'm pretty certain that some of the mbox files precede me installing SA,
> so I'd like to push them through SA before pushing them through the
> archive loader and. hopefully, end up with a similar spam scanned set of
> mbox files. 
>
>   
>> (I'd move the mailbox out of /var/spool/mail so I didn't keep
>> appending old messages to the end of it over and over;  some mail
>> *does* get delivered there.)
>>
>>     
> Yes, that makes sense. Thanks for the formail tip. I can build a script
> round that to do my scan and refiling job.
>  
>   
>> Hmm.  I'm pretty sure it's pointed out in several places that SA does
>> not know how to process more than one message per call, but I've been
>> using it long enough that I just know that's how it works.  <g>
>>
>>     

man spamassassin-run (excerpt above).
> I'd got that message for SA's normal operation and have looked at the
> innards of spamc closely enough to see that can only handle a single
> message at a time. As I said above, it was the --mbox option that
> confused me because, in general, an mbox file contains multiple
> messages.
>
> Given that I'm running spamc + spamd, I have two final questions:
>
> - would it be better to use spamc/spamd for the scan in place of
>   SpamAssassin?
>   

yes. This way, SA code is loaded once (spamd).
> - if spamd is the way to go, do I need to stop my normal mail
>   system while the scan is running or will spamd keep the two
>   streams separate? I assume it does, but its always good to check.
>   

sorry, I don't understand. someone else probably will...