You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Chris Jett <ch...@jettfuel.net> on 2007/01/11 22:34:41 UTC

Huge File Size

I am seeing a problem where my bayes_seen and autowhitelist files are  
HUGE.  My bayes_seen is 2.05 GB and my autowhitelist file is 4.02  
GB.  Forcing an expiry on the database doesn't seem to do anything.   
What do I need to do?
--
Chris Jett
chris@jettfuel.net

Re: Huge File Size

Posted by Gary V <mr...@hotmail.com>.
>>I don't see why this method could not also be used for bayes_seen.
>>I was not aware bayes_seen would grow forever so I am going to implement 
>>this
>>on my own system next week.
>>
>>ALTER TABLE bayes_seen ADD lastupdate timestamp(14) NOT NULL;
>>
>>Then wait a few weeks before implementing:
>>
>>DELETE FROM bayes_seen WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 2 
>>MONTH);
>>
>>I am not that familiar with MySQL and Bayes however so I would appreciate 
>>it
>>if someone would point out potential problems with this.
>>
>
>Ok, I do see one issue with bayes_seen. When a bayes_seen record is 
>created, the lastupde field is updated but of course the time stamp does 
>not change when the record is simply read. So if you have the same message 
>getting learned every day (for example) cleaning bayes_seen on a regular 
>basis would not be a good idea. You could clean it up something like every 
>four months or so however by using the lastupdate field but you would have 
>to put up with all the added lastupdate data.
>

I have to correct my correction. How often the command to delete the data is 
performed is not the issue but rather how long the data is allowed to stay 
in the database. Maybe something like:
DELETE FROM bayes_seen WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 6 
MONTH);

This way all new bayes_seen records would stay in the database for 6 months, 
then get deleted.

Gary V

_________________________________________________________________
Get FREE Web site and company branded e-mail from Microsoft Office Live 
http://clk.atdmt.com/MRT/go/mcrssaub0050001411mrt/direct/01/


Re: Huge File Size

Posted by Gary V <mr...@hotmail.com>.
>>I *think* you're in agreement with what I just said. Using last-accessed
>>time instead of hit-count makes substantially more sense.
>>
>
>By moving AWL to SQL this can be accomplished. Here is a sample for MySQL:
>Add a new field:
>ALTER TABLE awl ADD lastupdate timestamp(14) NOT NULL;
>
>If you have a small data set, optionally initialize existing records:
>UPDATE awl SET lastupdate = NOW( ) WHERE lastupdate < 1;
>
>NOTE: to prevent compounding the problem by adding all this extra 
>lastupdate
>data if you have a large record set it would probably be better to NOT
>initialize existing records, letting only new records get time stamped.
>Then be patient enough to wait a couple weeks or so before deleting any
>records (because the first command below should delete any records that
>are not time stamped).
>
>then start daily or weekly maintenance:
>DELETE FROM awl WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 4 MONTH);
>DELETE FROM awl WHERE count = 1 AND lastupdate <= DATE_SUB(SYSDATE(), 
>INTERVAL 15 DAY);
>
>I don't see why this method could not also be used for bayes_seen.
>I was not aware bayes_seen would grow forever so I am going to implement 
>this
>on my own system next week.
>
>ALTER TABLE bayes_seen ADD lastupdate timestamp(14) NOT NULL;
>
>Then wait a few weeks before implementing:
>
>DELETE FROM bayes_seen WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 2 
>MONTH);
>
>I am not that familiar with MySQL and Bayes however so I would appreciate 
>it
>if someone would point out potential problems with this.
>
>Gary V
>

Ok, I do see one issue with bayes_seen. When a bayes_seen record is created, 
the lastupde field is updated but of course the time stamp does not change 
when the record is simply read. So if you have the same message getting 
learned every day (for example) cleaning bayes_seen on a regular basis would 
not be a good idea. You could clean it up something like every four months 
or so however by using the lastupdate field but you would have to put up 
with all the added lastupdate data.

Gary V

_________________________________________________________________
Your Hotmail address already works to sign into Windows Live Messenger! Get 
it now 
http://clk.atdmt.com/MSN/go/msnnkwme0020000001msn/direct/01/?href=http://get.live.com/messenger/overview


Re: Huge File Size

Posted by Gary V <mr...@hotmail.com>.
>Benny Pedersen wrote:
> > On Fri, January 12, 2007 02:14, Matt Kettler wrote:
> >
> >
> >> form of expiry is one reason why I say the AWL isn't really ready for
> >> production use on any servers that have decent mail volume)
> >>
> >
> > if one entry is just deleted when will there be records with 2 ?
> >
>I don't understand what you're saying here, at all. I'll take a wild
>guess at what you might mean..
>
>IMHO, the AWL should use atime based expiry, just like bayes. As it
>stands now, the "number of hits" based purge algorithm is an absurdly
>cheap hack at best and is a significant downside to the practical
>usability of the AWL for anyone with a decent-sized mailserver.
>
>This of course means the format of the AWL database needs to change,
>because right now it doesn't store atime.
> > awl is tricky but good, we have to live with it or make some changes to 
>how
> > its updated, eg if and email adresse is seen just long time  ago and 
>newer
> > later delete it from avl, just delete the one 1 entrys makes it not work
> >
> >
>I *think* you're in agreement with what I just said. Using last-accessed
>time instead of hit-count makes substantially more sense.
>

By moving AWL to SQL this can be accomplished. Here is a sample for MySQL:
Add a new field:
ALTER TABLE awl ADD lastupdate timestamp(14) NOT NULL;

If you have a small data set, optionally initialize existing records:
UPDATE awl SET lastupdate = NOW( ) WHERE lastupdate < 1;

NOTE: to prevent compounding the problem by adding all this extra lastupdate
data if you have a large record set it would probably be better to NOT
initialize existing records, letting only new records get time stamped.
Then be patient enough to wait a couple weeks or so before deleting any
records (because the first command below should delete any records that
are not time stamped).

then start daily or weekly maintenance:
DELETE FROM awl WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 4 MONTH);
DELETE FROM awl WHERE count = 1 AND lastupdate <= DATE_SUB(SYSDATE(), 
INTERVAL 15 DAY);

I don't see why this method could not also be used for bayes_seen.
I was not aware bayes_seen would grow forever so I am going to implement 
this
on my own system next week.

ALTER TABLE bayes_seen ADD lastupdate timestamp(14) NOT NULL;

Then wait a few weeks before implementing:

DELETE FROM bayes_seen WHERE lastupdate <= DATE_SUB(SYSDATE(), INTERVAL 2 
MONTH);

I am not that familiar with MySQL and Bayes however so I would appreciate it
if someone would point out potential problems with this.

Gary V

_________________________________________________________________
Get live scores and news about your team: Add the Live.com Football Page 
www.live.com/?addtemplate=football&icid=T001MSN30A0701


Re: Huge File Size

Posted by Matt Kettler <mk...@verizon.net>.
Benny Pedersen wrote:
> On Fri, January 12, 2007 02:14, Matt Kettler wrote:
>
>   
>> form of expiry is one reason why I say the AWL isn't really ready for
>> production use on any servers that have decent mail volume)
>>     
>
> if one entry is just deleted when will there be records with 2 ?
>   
I don't understand what you're saying here, at all. I'll take a wild
guess at what you might mean..

IMHO, the AWL should use atime based expiry, just like bayes. As it
stands now, the "number of hits" based purge algorithm is an absurdly
cheap hack at best and is a significant downside to the practical
usability of the AWL for anyone with a decent-sized mailserver.

This of course means the format of the AWL database needs to change,
because right now it doesn't store atime.
> awl is tricky but good, we have to live with it or make some changes to how
> its updated, eg if and email adresse is seen just long time  ago and newer
> later delete it from avl, just delete the one 1 entrys makes it not work
>
>   
I *think* you're in agreement with what I just said. Using last-accessed
time instead of hit-count makes substantially more sense.



Re: Huge File Size

Posted by Benny Pedersen <me...@junc.org>.
On Fri, January 12, 2007 02:14, Matt Kettler wrote:

> form of expiry is one reason why I say the AWL isn't really ready for
> production use on any servers that have decent mail volume)

if one entry is just deleted when will there be records with 2 ?

awl is tricky but good, we have to live with it or make some changes to how
its updated, eg if and email adresse is seen just long time  ago and newer
later delete it from avl, just delete the one 1 entrys makes it not work

-- 
This message was sent using 100% recycled spam mails.


Re: Huge File Size

Posted by "Peter G." <pg...@fabel.dk>.
Christopher Jett <ch...@jettfuel.net> writes:

> OK - thanks.  So, for example, it's safe to delete the bayes_seen  file after it
> gets over a certain size?  Is there a particular size  after which performance
> degrades significantly?

>From what I've googled it should be OK to delete bayes_seen, provided no
previously received emails need re-classification.

-- 
Regards,
Peter

Re: Huge File Size

Posted by Matt Kettler <mk...@verizon.net>.
Christopher Jett wrote:
>
>>
>>
>> For the autowhitelist database, grab the check_whitelist script out of
>> the tools subdirectory in the tarball.  Run check_whitelist --clean on
>> the AWL file. This will eliminate any "one-off" entries from it. Not
>> much of an expiry, but its a start. (note: the lack of any reasonable
>> form of expiry is one reason why I say the AWL isn't really ready for
>> production use on any servers that have decent mail volume)
>
> OK - thanks.  So, for example, it's safe to delete the bayes_seen file
> after it gets over a certain size?  Is there a particular size after
> which performance degrades significantly?

No, there's no "cliff" in it. It should be something like O(n) or O(n
log(n)).

ie: the bigger the bayes seen, the larger the database to search when
performing learning, so the longer that takes, but there's no cliff..
it's probably some kind of linear or close to linear relationship
between size and speed here.



Re: Huge File Size

Posted by Benny Pedersen <me...@junc.org>.
On Fri, January 12, 2007 03:35, Christopher Jett wrote:

> OK - thanks.  So, for example, it's safe to delete the bayes_seen
> file after it gets over a certain size?  Is there a particular size
> after which performance degrades significantly?

i remember that file based bayes is huge, where sql based is working wirh
better expire in all aspects, so you might try to use sql for the bayes/awl

it sounds silly but it works

-- 
This message was sent using 100% recycled spam mails.


Re: Huge File Size

Posted by Christopher Jett <ch...@jettfuel.net>.
On Jan 11, 2007, at 7:14 PM, Matt Kettler wrote:

> Chris Jett wrote:
>> I am seeing a problem where my bayes_seen and autowhitelist files are
>> HUGE.  My bayes_seen is 2.05 GB and my autowhitelist file is 4.02 GB.
>> Forcing an expiry on the database doesn't seem to do anything.  What
>> do I need to do?
>> --  
>
> SA doesn't, at present, support expiry on either file.
>
> For bayes_seen, you can just delete the file if you're using 3.0.0 or
> higher. However, be aware this will allow retraining of old already
> learned messages, so make some efforts to avoid doing so, but one  
> or two
> dozen won't hurt much.
>
> For the autowhitelist database, grab the check_whitelist script out of
> the tools subdirectory in the tarball.  Run check_whitelist --clean on
> the AWL file. This will eliminate any "one-off" entries from it. Not
> much of an expiry, but its a start. (note: the lack of any reasonable
> form of expiry is one reason why I say the AWL isn't really ready for
> production use on any servers that have decent mail volume)

OK - thanks.  So, for example, it's safe to delete the bayes_seen  
file after it gets over a certain size?  Is there a particular size  
after which performance degrades significantly?
--
Chris Jett
chris@jettfuel.net

Re: Huge File Size

Posted by Matt Kettler <mk...@verizon.net>.
Chris Jett wrote:
> I am seeing a problem where my bayes_seen and autowhitelist files are
> HUGE.  My bayes_seen is 2.05 GB and my autowhitelist file is 4.02 GB. 
> Forcing an expiry on the database doesn't seem to do anything.  What
> do I need to do?
> -- 

SA doesn't, at present, support expiry on either file.

For bayes_seen, you can just delete the file if you're using 3.0.0 or
higher. However, be aware this will allow retraining of old already
learned messages, so make some efforts to avoid doing so, but one or two
dozen won't hurt much.

For the autowhitelist database, grab the check_whitelist script out of
the tools subdirectory in the tarball.  Run check_whitelist --clean on
the AWL file. This will eliminate any "one-off" entries from it. Not
much of an expiry, but its a start. (note: the lack of any reasonable
form of expiry is one reason why I say the AWL isn't really ready for
production use on any servers that have decent mail volume)