You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Tammy George <tm...@gmail.com> on 2007/07/19 02:30:25 UTC

huge auto-whitelist file etc

Hello.

Our Linux server is running SpamAssassin version 3.1.5.

Backups started dying with 'inactivity timeout'.  Dug around & found the
following:

drwx------   3 vscan  vscan            512 Jul 18 16:28 .
-rw-------   1 vscan  vscan  1099983372288 Jul 18 16:28 auto-whitelist
-rw-------   1 vscan  vscan     1205862400 Jul 18 16:28 bayes_seen
-rw-------   1 vscan  vscan       10846208 Jul 18 16:28 bayes_toks
-rw-------   1 vscan  vscan          18240 Jul 18 16:28 bayes_journal
drwxr-x---  12 vscan  vscan           1024 Jul 18 12:12 ..
-rw-------   1 vscan  vscan        2654208 Jan 26  2005
bayes_toks.expire42066
-rw-------   1 vscan  vscan         606208 Mar 30  2004
bayes_toks.expire93303
drwxr-xr-x   2 vscan  vscan            512 Jan 28  2004 old
-rw-r--r--   1 vscan  vscan           1165 Jan 27  2004 user_prefs

A du -k shows auto-whitelist as being 1747968.

Surprisingly, we aren't experiencing any problems other than the backups.
Our site handles A LOT of email.

After I send this email, I'm going to look into check_whitelist and
trim_whitelist (and probably sa-learn re: the bayes files), however, any
suggestions would be most appreciated!  Our sys admin is on vacation and
he's our expert.

Thanks in advance for any advice.

Re: huge auto-whitelist file etc

Posted by Matt Kettler <mk...@verizon.net>.
Tammy George wrote:
> Thanks for the responses.
>  
> Few questions - will running 'check_whitelist' affect our server's
> performance?
It really shouldn't. It's a pretty lightweight tool, and will only
remove "one off" entries by default.

If you're concerned, you can also execute it through the "nice" command,
which will lower its priority.

> Do I risk creating other problems if I leave things as they are until
> our sys admin returns?  :)

Not likely. The large AWL file might slow things down a little, but if
you're load average isn't really high I wouldn't worry.

Re: huge auto-whitelist file etc

Posted by Tammy George <tm...@gmail.com>.
Thanks for the responses.

Few questions - will running 'check_whitelist' affect our server's
performance?  Do I risk creating other problems if I leave things as they
are until our sys admin returns?  :)




On 7/18/07, Matt Kettler <mk...@verizon.net> wrote:
>
> Tammy George wrote:
> > Hello.
> >
> > Our Linux server is running SpamAssassin version 3.1.5.
> >
> > Backups started dying with 'inactivity timeout'.  Dug around & found
> > the following:
> >
> > drwx------   3 vscan  vscan            512 Jul 18 16:28 .
> > -rw-------   1 vscan  vscan  1099983372288 Jul 18 16:28 auto-whitelist
> > -rw-------   1 vscan  vscan     1205862400 Jul 18 16:28 bayes_seen
> > -rw-------   1 vscan  vscan       10846208 Jul 18 16:28 bayes_toks
> > -rw-------   1 vscan  vscan          18240 Jul 18 16:28 bayes_journal
> > drwxr-x---  12 vscan  vscan           1024 Jul 18 12:12 ..
> > -rw-------   1 vscan  vscan        2654208 Jan 26  2005
> > bayes_toks.expire42066
> > -rw-------   1 vscan  vscan         606208 Mar 30  2004
> > bayes_toks.expire93303
> > drwxr-xr-x   2 vscan  vscan            512 Jan 28  2004 old
> > -rw-r--r--   1 vscan  vscan           1165 Jan 27  2004 user_prefs
> >
> > A du -k shows auto-whitelist as being 1747968.
> >
> > Surprisingly, we aren't experiencing any problems other than the
> > backups.  Our site handles A LOT of email.
> >
> > After I send this email, I'm going to look into check_whitelist and
> > trim_whitelist (and probably sa-learn re: the bayes files), however,
> > any suggestions would be most appreciated!  Our sys admin is on
> > vacation and he's our expert.
> for the auto-whitelist file you need to run this command:
>
>    check_whitelist --clean /path/to/auto-whitelist
>
> That said, IMHO, the AWL isn't really ready for production use on large
> systems unless you're going to run it on SQL and use your own scripts to
> do expiry.
>
> The bayes_toks and bayes_journal files auto-expire, so you don't need to
> do anything to them.
>
> The bayes_seen file doesn't have any kind of date information, so it
> can't auto-expire. However, you can remove the file reasonably safely.
> This file is just a list of all the files that have already been run
> through sa-learn. The only drawback to deleting it is that it will allow
> you to re-train a message that you've already learned. So if you
> maintain a massive directory of files to be "relearned" but don't clean
> it out, you might have a minor amount of over-learning (no big deal).
>
>
>
> >
> > Thanks in advance for any advice.
> >
>
>

Re: huge auto-whitelist file etc

Posted by Matt Kettler <mk...@verizon.net>.
Tammy George wrote:
> Hello.
>  
> Our Linux server is running SpamAssassin version 3.1.5. 
>  
> Backups started dying with 'inactivity timeout'.  Dug around & found
> the following:
>  
> drwx------   3 vscan  vscan            512 Jul 18 16:28 .
> -rw-------   1 vscan  vscan  1099983372288 Jul 18 16:28 auto-whitelist
> -rw-------   1 vscan  vscan     1205862400 Jul 18 16:28 bayes_seen
> -rw-------   1 vscan  vscan       10846208 Jul 18 16:28 bayes_toks
> -rw-------   1 vscan  vscan          18240 Jul 18 16:28 bayes_journal
> drwxr-x---  12 vscan  vscan           1024 Jul 18 12:12 ..
> -rw-------   1 vscan  vscan        2654208 Jan 26  2005
> bayes_toks.expire42066
> -rw-------   1 vscan  vscan         606208 Mar 30  2004
> bayes_toks.expire93303
> drwxr-xr-x   2 vscan  vscan            512 Jan 28  2004 old
> -rw-r--r--   1 vscan  vscan           1165 Jan 27  2004 user_prefs
>  
> A du -k shows auto-whitelist as being 1747968.
>  
> Surprisingly, we aren't experiencing any problems other than the
> backups.  Our site handles A LOT of email.
>  
> After I send this email, I'm going to look into check_whitelist and
> trim_whitelist (and probably sa-learn re: the bayes files), however,
> any suggestions would be most appreciated!  Our sys admin is on
> vacation and he's our expert.
for the auto-whitelist file you need to run this command:

    check_whitelist --clean /path/to/auto-whitelist

That said, IMHO, the AWL isn't really ready for production use on large
systems unless you're going to run it on SQL and use your own scripts to
do expiry.

The bayes_toks and bayes_journal files auto-expire, so you don't need to
do anything to them.

The bayes_seen file doesn't have any kind of date information, so it
can't auto-expire. However, you can remove the file reasonably safely.
This file is just a list of all the files that have already been run
through sa-learn. The only drawback to deleting it is that it will allow
you to re-train a message that you've already learned. So if you
maintain a massive directory of files to be "relearned" but don't clean
it out, you might have a minor amount of over-learning (no big deal).



>  
> Thanks in advance for any advice.
>  


Re: huge auto-whitelist file etc

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Jul 18, 2007 at 09:30:25PM -0300, Tammy George wrote:
> -rw-------   1 vscan  vscan  1099983372288 Jul 18 16:28 auto-whitelist
> A du -k shows auto-whitelist as being 1747968.

Ah, the magic of sparse files. :)

> After I send this email, I'm going to look into check_whitelist and
> trim_whitelist (and probably sa-learn re: the bayes files), however, any
> suggestions would be most appreciated!  Our sys admin is on vacation and
> he's our expert.

Removing entries from the DB will likely not decrease the size of the file, it
would likely require a "remove entries, db_dump, db_load" type thing.

-- 
Randomly Selected Tagline:
"I'll kick your butt up so high you'll look like a hunchback."
                      - Delores Claiborne