You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Chris Hastie <li...@oak-wood.co.uk> on 2006/10/19 14:58:10 UTC

Spam Reporting - reducing the load

I have a number of spamtrap addresses that between them receive between about
3000 and 6000 messages a day. Until recently I have used this mail to simply
populate a database of machines that have sent me spam in the last 48 hours,
which is used as part of a series of checks on incoming connections.

I've just decided to try and do something more with all these data, and add
reporting through spamassin. Quick and easy to add to the existing script (a
perl script that mail is piped to).

    require Mail::SpamAssassin;

    $spamtest = Mail::SpamAssassin->new({
        debug => $sa_debug,
        dont_copy_prefs   => 1,
        home_dir_for_helpers => $helpers_home,
        stop_at_threshold => 0,
        username => $sa_user,
        userprefs_filename => $sa_userprefs,
      });

    $samail = $spamtest->parse(\*STDIN);
    my $sastatus = $spamtest->report_as_spam($samail);

Trouble is, this is absoloutely killing me server. Even with the MTA (Postfix)
configured to limit concurrent deliveries to it to 2, it's eating all the CPU
and grinding things to a halt. Each report is taking around 8 - 12 seconds,
sometimes more.

I'm reporting using DCC (dccifd), Razor, SpamCop, and bayes. Debug output shows
that there's an awful lot of parsing of the message going on. Is there a way to
avoid this? My guess is that this is necessary for bayes, but I can't see why
it's needed for the others. If I set

bayes_learn_during_report 0

am I likely to see an improvement? Or would I be better going back to first
principles and writing some non-SA based code to report to SpamCop, Razor and
DCC?

Better still, has someone else done it? Is there some nice efficient fast code
out there for spamtraps?

Cheers

-- 
Chris Hastie

Re: Spam Reporting - reducing the load

Posted by Rich Puhek <rp...@etnsystems.com>.

Chris Hastie wrote:
> I have a number of spamtrap addresses that between them receive between about
> 3000 and 6000 messages a day. Until recently I have used this mail to simply
> populate a database of machines that have sent me spam in the last 48 hours,
> which is used as part of a series of checks on incoming connections.
> 
> I've just decided to try and do something more with all these data, and add
> reporting through spamassin. Quick and easy to add to the existing script (a
> perl script that mail is piped to).
> 
>     require Mail::SpamAssassin;
> 
>     $spamtest = Mail::SpamAssassin->new({
>         debug => $sa_debug,
>         dont_copy_prefs   => 1,
>         home_dir_for_helpers => $helpers_home,
>         stop_at_threshold => 0,
>         username => $sa_user,
>         userprefs_filename => $sa_userprefs,
>       });
> 
>     $samail = $spamtest->parse(\*STDIN);
>     my $sastatus = $spamtest->report_as_spam($samail);
> 
> Trouble is, this is absoloutely killing me server. Even with the MTA (Postfix)
> configured to limit concurrent deliveries to it to 2, it's eating all the CPU
> and grinding things to a halt. Each report is taking around 8 - 12 seconds,
> sometimes more.
> 
> I'm reporting using DCC (dccifd), Razor, SpamCop, and bayes. Debug output shows
> that there's an awful lot of parsing of the message going on. Is there a way to
> avoid this? My guess is that this is necessary for bayes, but I can't see why
> it's needed for the others. If I set
> 
> bayes_learn_during_report 0
> 
> am I likely to see an improvement? Or would I be better going back to first
> principles and writing some non-SA based code to report to SpamCop, Razor and
> DCC?
> 
> Better still, has someone else done it? Is there some nice efficient fast code
> out there for spamtraps?
> 
> Cheers
> 

Chris,

I've had some decent luck with dropping my spamtrap email into a folder. 
Then a cron job runs the imap-sa-learn.pl script (I found the script 
somewhere, probably from this list, then modified it slightly for my 
application). The script connects to the IMAP spam trap folder, learns 
the spam, and moves it to a spam archive folder (for reference, for 
building a corpus, and so that I can unlearn any false positives). In my 
application, it also processes a ham folder as well (set up a shared 
IMAP folder system for training bayes).

It wouldn't be difficult to also have the script run reporting.

The benefits are that you get to process the spam at a steady rate, at 
what intervals (I run it as a cron job once every 20 minutes) and you 
can offload it to whatever machine you want to (I currently run it on 
the primary spamassassin machine, which is separate from the MX servers).

Search the archives for the imap-sa-learn.pl script. If you don't find 
it, let me know, and I can post a link to my hacked script.

--Rich