You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kris Deugau <kd...@webhart.net> on 2004/02/27 23:26:39 UTC

AWL bloat-reducer

I've been having user quota problems due to AWL bloat in a growing
number of accounts.  Most customers' AWL files include a *long* list of
one-off spam addresses which *SIGNIFICANTLY* increase disk usage.

I finally got disgusted with this, and hacked check_whitelist into
trim_whitelist.  It makes a backup copy of the "old" AWL db, creates a
fresh db and copies only those addresses that have a count greater than
1 from old to new.  It then moves the new db over the old one and makes
sure ownership of the new db is correct if running as root.  I didn't
want to autodelete the old db in case something broke.

At the moment, it only understands AWL files in "Berkeley DB (Hash,
version 5, native byte-order)" format (or any other file-based hash with
files that end with .db), but it could probably be expanded to
understand others without too much trouble;  and could probably accept
other options to control which addresses it discards (ie, anything with
a *really* high AWL entry likely doesn't need to be kept; chose the
count cutoff, etc).  It could also be adapted to upgrade AWL dbs as
necessary.

Size reduction varied a LOT;  I checked it on a number of users whose
AWL db has grown to over 8M.  Typical reduction was ~8:1, with a few
dropping to ~300K (~27:1).  Smaller dbs showed even more drastic
reductions;  one went from 4500K to 86K (!!!).  Given that I have this
server set up for per-user AWLs, and a 20M per-user quota on the home
directory, this is pretty significant.  (I've had to move quite a few
user's SA directories into another partition, and symlink them back in
order to allow them 20M of "non-inbox" email folder space.)

If you or your users are running short on disk space due to ballooning
AWL files, (in total, or within the system quota) you may want to play
with this.

Download at http://www.deepnet.cx/~kdeugau/spamtools/trim_whitelist

-kgd
-- 
"Sendmail administration is not black magic.  There are legitimate
technical reasons why it requires the sacrificing of a live chicken."
   - Unknown

Re: AWL bloat-reducer

Posted by Daniel Quinlan <qu...@pathname.com>.
Justin Mason <jm...@jmason.org> writes:

> BTW, I'm considering maybe we should have a command for running
> periodic expire tasks for Bayes and AWL, and other long-running modes
> of operation; this would:

The problem with this is that it goes against the goal of usability.

cron jobs?!?  The only reason we have cron jobs is because we're
software developers, system administrators, etc.  Think like a user who
might struggle through setting up .procmail file.

>   (a) do bayes expires, if needed
>   (b) do AWL expires if needed
>   (c) other long-runtime tasks that may be suited to "offline" generation,
>       e.g. generating trusted_networks caches from a Bayes db dump
>       or similar
>   (d) possibly downloading frequently-updated data from a central
>       server if needed for future rules
>
> something like "sa-cron".

sa-update

At most, I might be able to live with a once-a-month type of program.
Anything that happens more often should not require a separate program,
I think.  No separate program would be better.

> Right now, we just suggest that large-scale bayes users can run
> "sa-learn --rebuild" from cron; strikes me that there'll be other jobs
> that may need that treatment too.

That's suboptimal too.

> Or should we just have some kind of inference code to do that stuff from
> the engine automatically, like we currently have for bayes?

Isn't there some way we do work in smaller amounts?  Argh.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: AWL bloat-reducer

Posted by Daniel Quinlan <qu...@pathname.com>.
Kris Deugau <kd...@webhart.net> writes:

> I've been having user quota problems due to AWL bloat in a growing
> number of accounts.  Most customers' AWL files include a *long* list of
> one-off spam addresses which *SIGNIFICANTLY* increase disk usage.

Definitely!
 
> I finally got disgusted with this, and hacked check_whitelist into
> trim_whitelist.  It makes a backup copy of the "old" AWL db, creates a
> fresh db and copies only those addresses that have a count greater than
> 1 from old to new.  It then moves the new db over the old one and makes
> sure ownership of the new db is correct if running as root.  I didn't
> want to autodelete the old db in case something broke.

Makes sense to me.
 
> At the moment, it only understands AWL files in "Berkeley DB (Hash,
> version 5, native byte-order)" format (or any other file-based hash with
> files that end with .db), but it could probably be expanded to
> understand others without too much trouble;  and could probably accept
> other options to control which addresses it discards (ie, anything with
> a *really* high AWL entry likely doesn't need to be kept; chose the
> count cutoff, etc).  It could also be adapted to upgrade AWL dbs as
> necessary.

Why not keep really high AWL entries?  It can't hurt.
 
> Size reduction varied a LOT;  I checked it on a number of users whose
> AWL db has grown to over 8M.  Typical reduction was ~8:1, with a few
> dropping to ~300K (~27:1).  Smaller dbs showed even more drastic
> reductions;  one went from 4500K to 86K (!!!).  Given that I have this
> server set up for per-user AWLs, and a 20M per-user quota on the home
> directory, this is pretty significant.  (I've had to move quite a few
> user's SA directories into another partition, and symlink them back in
> order to allow them 20M of "non-inbox" email folder space.)
> 
> If you or your users are running short on disk space due to ballooning
> AWL files, (in total, or within the system quota) you may want to play
> with this.
> 
> Download at http://www.deepnet.cx/~kdeugau/spamtools/trim_whitelist

Sounds like an initial version of what has been proposed in this bug:

  http://bugzilla.spamassassin.org/show_bug.cgi?id=3082

Separate program seems like the way to go, but I am very hesitant at
adding new commands/options to handle expiry rather than just doing it
all automatically behind the scenes.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: AWL bloat-reducer

Posted by Kris Deugau <kd...@webhart.net>.
Rob Mangiafico wrote:
> Just wanted to say thanks for this script! It works great,

So far.  I haven't yet set up any automated calls to it- I'm a little
too paranoid about losing customer AWL files.  <g>  (Although I haven't
heard any complaints yet.)  I've just used in in extreme cases where the
customer is getting close to the automated "You are almost out of disk
space for you spam folder" warning...

> can be run
> as root globally on all users,

Or by individual users;  although I've just realized a regular user
could attempt to run it on any .db file they have write access to.  :/

Not a *major* security problem, but it could be a nuisance.

> Disk space saved was quite large, and it helps
> our users' quotas as well.

Which is why I wrote it.  <g>

-kgd
-- 
"Sendmail administration is not black magic.  There are legitimate
technical reasons why it requires the sacrificing of a live chicken."
   - Unknown

Re: AWL bloat-reducer

Posted by Rob Mangiafico <rm...@lexiconn.com>.
On Fri, 27 Feb 2004, Kris Deugau wrote:

> I've been having user quota problems due to AWL bloat in a growing
> number of accounts.  Most customers' AWL files include a *long* list of
> one-off spam addresses which *SIGNIFICANTLY* increase disk usage.
> ...
> I finally got disgusted with this, and hacked check_whitelist into
> trim_whitelist.  It makes a backup copy of the "old" AWL db, creates a
> fresh db and copies only those addresses that have a count greater than
> 1 from old to new.  It then moves the new db over the old one and makes
> sure ownership of the new db is correct if running as root.  I didn't
> want to autodelete the old db in case something broke.
> 
> Download at http://www.deepnet.cx/~kdeugau/spamtools/trim_whitelist

Just wanted to say thanks for this script! It works great, can be run as 
root globally on all users, and we even scripted it to run on multiple 
servers from one command. Disk space saved was quite large, and it helps 
our users' quotas as well.

Rob M.