You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Boven <p....@chello.nl> on 2006/07/03 11:22:55 UTC

Re: Bayes_seen is 320MB

Paul Boven wrote:
> Hi everyone,
> 
> The message-ID's of mails that have been (auto-)learned by Bayes are 
> stored indefinitely in bayes_seen. Which, over the years that we've used 
> SpamAssassin now, has grown to a 320MB file. We're using site-wide Bayes 
> databases. What would be the best way to trim down this database, safely?
> Given that it only stores message-ID and spam status, I assume there is 
> no way to rescue more recent entries, and I'd have to wipe it altogether?

No replies yet, so I'll clarify my question a bit:

1.) How much of a performance impact would it have to have a Bayes_seen 
that is this large?

2.) What is the safest way of trimming it down? Can I simply stop 
SpamAssassin (called by Mimedefang in our case) and remove it, or do I 
need to recreate it in some way?

It would perhaps be usefull if the Bayes seen database also had 
timestamps, so this kind of purging could be done automatically and 
properly.

Regards, Paul Boven.

Re: Bayes_seen is 320MB

Posted by Paul Boven <p....@chello.nl>.
Hi everyone,

Paul Boven wrote:
>> The message-ID's of mails that have been (auto-)learned by Bayes are 
>> stored indefinitely in bayes_seen. Which, over the years that we've 
>> used SpamAssassin now, has grown to a 320MB file. We're using 
>> site-wide Bayes databases. What would be the best way to trim down 
>> this database, safely?
>> Given that it only stores message-ID and spam status, I assume there 
>> is no way to rescue more recent entries, and I'd have to wipe it 
>> altogether?

Follow-up:
I've deleted the Bayes_seen database two days ago: Bayes is still 
working and has created a new one. Looking at the performance graphs of 
my server, it's not made any difference in memory consumption or CPU load.

Regards, Paul Boven.

Re: Bayes_seen is 320MB

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Paul Boven wrote:
> Paul Boven wrote:
>> Hi everyone,
>>
>> The message-ID's of mails that have been (auto-)learned by Bayes are 
>> stored indefinitely in bayes_seen. Which, over the years that we've 
>> used SpamAssassin now, has grown to a 320MB file. We're using 
>> site-wide Bayes databases. What would be the best way to trim down 
>> this database, safely?
>> Given that it only stores message-ID and spam status, I assume there 
>> is no way to rescue more recent entries, and I'd have to wipe it 
>> altogether?
> 
> No replies yet, so I'll clarify my question a bit:
> 
> 1.) How much of a performance impact would it have to have a Bayes_seen 
> that is this large?

Depending on how busy your disk is, it could hurt a bit when learning.


> 2.) What is the safest way of trimming it down? Can I simply stop 
> SpamAssassin (called by Mimedefang in our case) and remove it, or do I 
> need to recreate it in some way?

IIRC you can do just that and SA will recreate a bayes_seen file.  Make 
sure all SA processes are killed off before doing it.

Of course, making a copy of all the bayes datafiles before doing so 
wouldn't hurt.


> It would perhaps be usefull if the Bayes seen database also had 
> timestamps, so this kind of purging could be done automatically and 
> properly.

Code welcome. :)


Daryl


Re: Bayes_seen is 320MB

Posted by Ralf Hildebrandt <Ra...@charite.de>.
* Paul Boven <p....@chello.nl>:

> That, as far as I can tell, only does an expire on the Bayes_tokens. The 
> Bayes_seen does not contain any timestamps and therefore can only keep 
> growing for ever - there is no way to tell if a msgid is new or old.

Yep, my mistake. It would be useful, though. The face of spam is ever
changing, thus keeping 2y old records around won't help much.

Have you added a journal to your bayes DB?
This usually speeds things up!

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums)         Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin            Tel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin    Fax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF                 send no mail to spamtrap@charite.de

Re: Bayes_seen is 320MB

Posted by Paul Boven <p....@chello.nl>.
Hi Ralf,

Thanks for your quick reply.

Ralf Hildebrandt wrote:

>> 1.) How much of a performance impact would it have to have a Bayes_seen 
>> that is this large?
> 
> Dunno, I use this:
> -rw------- 1 amavis amavis 1,3G 2006-07-03 11:26 bayes_seen

I'm certainly having performance problems, so I'm going to try to wipe 
my Bayes_seen.

>> 2.) What is the safest way of trimming it down? Can I simply stop 
>> SpamAssassin (called by Mimedefang in our case) and remove it, 
> Yes.
>> or do I need to recreate it in some way?
> No.

Ok, I'll do that tonight.

>> It would perhaps be usefull if the Bayes seen database also had 
>> timestamps, so this kind of purging could be done automatically and 
>> properly.
> 
> It has:
> /usr/bin/sa-learn --sync --force-expire

That, as far as I can tell, only does an expire on the Bayes_tokens. The 
Bayes_seen does not contain any timestamps and therefore can only keep 
growing for ever - there is no way to tell if a msgid is new or old.
Afaik, this is an inherent limitation in the current Bayes setup.
I've just done the command you suggested and it certainly didn't shrink 
my Bayes_seen database.

Regards, Paul Boven.





Re: Bayes_seen is 320MB

Posted by Ralf Hildebrandt <Ra...@charite.de>.
* Paul Boven <p....@chello.nl>:

> No replies yet, so I'll clarify my question a bit:
> 
> 1.) How much of a performance impact would it have to have a Bayes_seen 
> that is this large?

Dunno, I use this:
-rw------- 1 amavis amavis 1,3G 2006-07-03 11:26 bayes_seen
 
> 2.) What is the safest way of trimming it down? Can I simply stop 
> SpamAssassin (called by Mimedefang in our case) and remove it, 
Yes.

> or do I need to recreate it in some way?
No.

> It would perhaps be usefull if the Bayes seen database also had 
> timestamps, so this kind of purging could be done automatically and 
> properly.

It has:
/usr/bin/sa-learn --sync --force-expire

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums)         Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin            Tel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin    Fax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF                 send no mail to spamtrap@charite.de