You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Larry Starr <la...@fullcompass.com> on 2009/06/02 20:32:29 UTC

Question on add-to-blacklist

I have been using the AWL ( --add-addr-to-blacklist ) for some time, to bump 
new spam senders above the "Bayes-99" score.

My problem is that this feature seems, extreemly slow.

I'm now trying to use the "( --add-to-blacklist )" option and am finding that 
this is, equally, slow.

I'm running it as:
spamassassin  -d --progress --add-to-blacklist --mbox <mboxfile>

The <mboxfile> contains the messages whose senders I wish to blacklist, via 
AWL.

The process seems to take anywhere from 5 to 15 minutes, per address.

Can anyone offer a faster alternative?

Thanks,
-- 
Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
Software Engineer: Full Compass Systems LTD.
Phone: 608-831-7330 x 1347  FAX: 608-831-6330
===================================================================
There are only three sports: bullfighting, mountaineering and motor
racing, all the rest are merely games! - Ernest Hemmingway

Re: Question on add-to-blacklist

Posted by Larry Starr <la...@fullcompass.com>.
On Tuesday 02 June 2009, Adam Katz wrote:
> Larry Starr <la...@fullcompass.com> wrote:
> >> I have been using the AWL ( --add-addr-to-blacklist ) for some
> >> time, to bump new spam senders above the "Bayes-99" score.
>
> Theo Van Dinter responded:
> > Well, the first problem is that the AWL has no impact on Bayes.
> > They're totally independent.
> > Perhaps you want "sa-learn" ?
>
> I believe Larry, like myself, trains both Bayes and AWL (I also
> report, and my SpamCop account copies KnujOn since bug 6085 is yet to
> be addressed).  The AWL bump gives me a head start on that senders'
> future spams since they obviously beat SA in the past ... without that
> bump, AWL would actually be subtracting points!
>
> Note, I just started this practice, and the recent "83MB
> auto-whitelist?" thread's trim_whitelist script looks like it will
> help me tremendously.

You are correct, I use it as a temporary bump to score spam higher, which will 
come down if I receive any volume of HAM from the sender, or take care of 
itself as the sender is added to various DNSBL's.

The AWL is used for those messages that have been fed to Bayes, and still 
score below a threshold of 6.0 (an arbritrary number to optomize the process.


-- 
Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
Software Engineer: Full Compass Systems LTD.
Phone: 608-831-7330 x 1347  FAX: 608-831-6330
===================================================================
There are only three sports: bullfighting, mountaineering and motor
racing, all the rest are merely games! - Ernest Hemmingway

Re: Question on add-to-blacklist

Posted by Adam Katz <an...@khopis.com>.
Larry Starr <la...@fullcompass.com> wrote:
>> I have been using the AWL ( --add-addr-to-blacklist ) for some 
>> time, to bump new spam senders above the "Bayes-99" score.

Theo Van Dinter responded:
> Well, the first problem is that the AWL has no impact on Bayes. 
> They're totally independent.
> Perhaps you want "sa-learn" ?

I believe Larry, like myself, trains both Bayes and AWL (I also
report, and my SpamCop account copies KnujOn since bug 6085 is yet to
be addressed).  The AWL bump gives me a head start on that senders'
future spams since they obviously beat SA in the past ... without that
bump, AWL would actually be subtracting points!

Note, I just started this practice, and the recent "83MB
auto-whitelist?" thread's trim_whitelist script looks like it will
help me tremendously.

Re: Question on add-to-blacklist

Posted by Theo Van Dinter <fe...@apache.org>.
Well, the first problem is that the AWL has no impact on Bayes.
They're totally independent.
Perhaps you want "sa-learn" ?

On Tue, Jun 2, 2009 at 2:32 PM, Larry Starr <la...@fullcompass.com> wrote:
> I have been using the AWL ( --add-addr-to-blacklist ) for some time, to bump
> new spam senders above the "Bayes-99" score.
>
> My problem is that this feature seems, extreemly slow.
>
> I'm now trying to use the "( --add-to-blacklist )" option and am finding that
> this is, equally, slow.
>
> I'm running it as:
> spamassassin  -d --progress --add-to-blacklist --mbox <mboxfile>
>
> The <mboxfile> contains the messages whose senders I wish to blacklist, via
> AWL.
>
> The process seems to take anywhere from 5 to 15 minutes, per address.
>
> Can anyone offer a faster alternative?
>
> Thanks,
> --
> Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
> Software Engineer: Full Compass Systems LTD.
> Phone: 608-831-7330 x 1347  FAX: 608-831-6330
> ===================================================================
> There are only three sports: bullfighting, mountaineering and motor
> racing, all the rest are merely games! - Ernest Hemmingway
>

Re: Question on add-to-blacklist

Posted by Larry Starr <la...@fullcompass.com>.
Just wanted to pass along a thank you to those who helped out here and provide 
a few notes, on my experience, that may help anyone else that is looking at 
this.

By the way, converting to MySQL did alleviate the problems that I was seeing 
when attempting to apply updates to AWL ... processes that were taking hours 
are now taking seconds!

Along the way I discovered a few things:
1)  We have been running SA for several years and the AWL was far too large 
for any of the scripts, that I downloaded for conversion.
2)  The time to load the MySQL database, with per-row insert logic was 
daunting.

Note:  I made a few passes at this before hitting on a working solution.
    
I found it, much, faster, to unload the existing DB to a delimited file, and 
use "mysql load" to load the AWL from that file.
    
    What I wound up with was a scrit to:
    . Unload the DB (formatting a flat" delimited file):
      This was created from the "convert_awl_dbm_to_sql" script downloaded 
      from the spamassassin/tools.  Substituting output, to stdout, for the 
      mysql insert statement, and modifying the loop to use "each" rather than 
      attempting to build and array of "@k" as well as adding some trim 
      functionality:

        # my @k = grep(!/totscore$/,keys(%h));
        # for my $key (@k) {
        our $rowsread;
        our $rowsinsert;
        print stderr "Processing: $db \n";
        while(my ($key, $v) = each %h)
        {
          next if $key =~ /totscore$/; # Skip totscore entries 
          $rowsread++;
          next if ($v < 2);             # skip one off entries.
          .
          .
          .
          $rowsinsert++;
          print '"',$opt{'username'},'",','"',$email,'",','"',$ip,'",',
$count,',',$totscore,"\n";
          .
          .
          .

    . Drop and create the AWL table (the drop becuase of earlier experiments).
        drop table awl;
        CREATE TABLE awl (
          username varchar(100) NOT NULL default '',
          email varchar(200) NOT NULL default '',
          ip varchar(10) NOT NULL default '',
          count int(11) default '0',
          totscore float default '0',
          lastupdate timestamp(14) NOT NULL DEFAULT CURRENT_TIMESTAMP ON
                  UPDATE CURRENT_TIMESTAMP,
          PRIMARY KEY  (username,email,ip)
        ) TYPE=MyISAM;

    . load the MySQL database:
        LOAD DATA INFILE '/home/larrys/sa_tools/awl.load' 
            REPLACE 
            INTO TABLE awl 
            FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' 
            (username, email, ip, count, totscore) 
            SET lastupdate = CURRENT_TIMESTAMP;

Following this it was a simple matter of sliding the new "local.cf" into place 
and restarting processesses.

So far everything seems to be running well.

Thanks again
-- 
Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
Software Engineer: Full Compass Systems LTD.
Phone: 608-831-7330 x 1347  FAX: 608-831-6330
===================================================================
There are only three sports: bullfighting, mountaineering and motor
racing, all the rest are merely games! - Ernest Hemmingway

Re: Question on add-to-blacklist

Posted by Mike Cardwell <sp...@lists.grepular.com>.
LuKreme wrote:

>>  `ip` varchar(10) NOT NULL DEFAULT '',
> 
> 10?

I'm missing some of the context here, but usually if someone is storing 
an ip in 10 characters it's because they're storing the ip number rather 
than the ip address.

root@haven:~# perl -e 'print length(256*256*256*256)."\n";'
10
root@haven:~#

Still, if you were doing that, you'd want to use an integer rather than 
a varchar preferably.

-- 
Mike Cardwell - IT Consultant and LAMP developer
Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/

Re: Question on add-to-blacklist

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> On 3-Jun-2009, at 14:02, Jari Fredriksson wrote:
>>  `ip` varchar(10) NOT NULL DEFAULT '',

On 03.06.09 17:48, LuKreme wrote:
> 10?

7 could be enough for now, afaik AWL only stores /16 prefix...

PostgreSQL has a IPv4 type btw

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Honk if you love peace and quiet. 

Re: Question on add-to-blacklist

Posted by Jari Fredriksson <ja...@iki.fi>.
> On 3-Jun-2009, at 14:02, Jari Fredriksson wrote:
>>  `ip` varchar(10) NOT NULL DEFAULT '',
> 
> 
> 10?

It's on wiki

http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl?highlight=%28awl%29%7C%28sql%29



Re: Question on add-to-blacklist

Posted by LuKreme <kr...@kreme.com>.
On 3-Jun-2009, at 14:02, Jari Fredriksson wrote:
>  `ip` varchar(10) NOT NULL DEFAULT '',


10?

-- 
There is NO Rule six!


Re: Question on add-to-blacklist

Posted by Larry Starr <la...@fullcompass.com>.
On Wednesday 03 June 2009, Jari Fredriksson wrote:
> > On Tuesday 02 June 2009, Michael Scheidell wrote:
> > What "optional" fields are you refering to?
> >
> > I have seen this, on the spamassassin WIKI:
> >
> > CREATE TABLE awl (
> >  username varchar(100) NOT NULL default '',
> >  email varchar(200) NOT NULL default '',
> >  ip varchar(10) NOT NULL default '',
> >  count int(11) default '0',
> >  totscore float default '0',
> >  PRIMARY KEY  (username,email,ip)
> > ) TYPE=MyISAM;
> >
> > Is there a better reference?
>
>  CREATE TABLE `awl` (
>   `username` varchar(100) NOT NULL DEFAULT '',
>   `email` varchar(200) NOT NULL DEFAULT '',
>   `ip` varchar(10) NOT NULL DEFAULT '',
>   `count` int(11) DEFAULT '0',
>   `totscore` float DEFAULT '0',
>   `lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> CURRENT_TIMESTAMP, PRIMARY KEY (`username`,`email`,`ip`)
> ) ENGINE=InnoDB ;

Thanks, I can see that the 'lastupdate' field would prove handy over time.

I'm noticing that most of the scripts, that I've found for processing AWL 
data, including 'convert_awl_dbm_to_sql' will not run, for me, as written.

They seem to, invariably, use "tie %hash" and then use something like:

 my @k = grep(!/totscore$/,keys(%h));

to build a list of keys.   My existing AWL has something over 10Million 
records, and my system only contains 4G of RAM.

I've been rewriting the loops in the scripts that I'm trying to use from:
  my @k = grep(!/totscore$/,keys(%h));
  for my $key (@k) {

to:
  while(my ($key, $v) = each %h)

This may be, a bit slower, but scales better to large data volumes.

-- 
Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
Software Engineer: Full Compass Systems LTD.
Phone: 608-831-7330 x 1347  FAX: 608-831-6330
===================================================================
There are only three sports: bullfighting, mountaineering and motor
racing, all the rest are merely games! - Ernest Hemmingway

Re: Question on add-to-blacklist

Posted by Jari Fredriksson <ja...@iki.fi>.
> On Tuesday 02 June 2009, Michael Scheidell wrote:
> What "optional" fields are you refering to?
> 
> I have seen this, on the spamassassin WIKI:
> 
> CREATE TABLE awl (
>  username varchar(100) NOT NULL default '',
>  email varchar(200) NOT NULL default '',
>  ip varchar(10) NOT NULL default '',
>  count int(11) default '0',
>  totscore float default '0',
>  PRIMARY KEY  (username,email,ip)
> ) TYPE=MyISAM;
> 
> Is there a better reference?

 CREATE TABLE `awl` (
  `username` varchar(100) NOT NULL DEFAULT '',
  `email` varchar(200) NOT NULL DEFAULT '',
  `ip` varchar(10) NOT NULL DEFAULT '',
  `count` int(11) DEFAULT '0',
  `totscore` float DEFAULT '0',
  `lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`username`,`email`,`ip`)
) ENGINE=InnoDB ;


Re: Question on add-to-blacklist

Posted by Larry Starr <la...@fullcompass.com>.
On Tuesday 02 June 2009, Michael Scheidell wrote:
> > I have been using the AWL ( --add-addr-to-blacklist ) for some time, to
> > bump new spam senders above the "Bayes-99" score.
> >
> > My problem is that this feature seems, extreemly slow.
> >
> > I'm now trying to use the "( --add-to-blacklist )" option and am finding
> > that this is, equally, slow.
> >
> > I'm running it as:
> > spamassassin  -d --progress --add-to-blacklist --mbox <mboxfile>
> >
> > The <mboxfile> contains the messages whose senders I wish to blacklist,
> > via AWL.
> >
> > The process seems to take anywhere from 5 to 15 minutes, per address.
>
> Most likely, you are still using the flat file (db2 , 3 or 4).  If you
> convert to MYSQL based file, and add in some of the optional fields and
> trim it weekly, it should be faster

Yes, I am still using the Flat file (default) DB.
I have been toying with setting it up in MySQL, but have not bitten that 
bullet yet.

What "optional" fields are you refering to?  

I have seen this, on the spamassassin WIKI:

CREATE TABLE awl (
  username varchar(100) NOT NULL default '',
  email varchar(200) NOT NULL default '',
  ip varchar(10) NOT NULL default '',
  count int(11) default '0',
  totscore float default '0',
  PRIMARY KEY  (username,email,ip)
) TYPE=MyISAM;

Is there a better reference?
-- 
Larry G. Starr - larrys@fullcompass.com or starrl@globaldialog.com
Software Engineer: Full Compass Systems LTD.
Phone: 608-831-7330 x 1347  FAX: 608-831-6330
===================================================================
There are only three sports: bullfighting, mountaineering and motor
racing, all the rest are merely games! - Ernest Hemmingway

Re: Question on add-to-blacklist

Posted by Michael Scheidell <sc...@secnap.net>.
> I have been using the AWL ( --add-addr-to-blacklist ) for some time, to bump
> new spam senders above the "Bayes-99" score.
> 
> My problem is that this feature seems, extreemly slow.
> 
> I'm now trying to use the "( --add-to-blacklist )" option and am finding that
> this is, equally, slow.
> 
> I'm running it as:
> spamassassin  -d --progress --add-to-blacklist --mbox <mboxfile>
> 
> The <mboxfile> contains the messages whose senders I wish to blacklist, via
> AWL.
> 
> The process seems to take anywhere from 5 to 15 minutes, per address.
> 
Most likely, you are still using the flat file (db2 , 3 or 4).  If you
convert to MYSQL based file, and add in some of the optional fields and trim
it weekly, it should be faster
-- 
Michael Scheidell, CTO
>|SECNAP Network Security
Finalist 2009 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer


_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/
_________________________________________________________________________