You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Boven <p....@chello.nl> on 2006/07/03 11:22:55 UTC
Re: Bayes_seen is 320MB
Paul Boven wrote:
> Hi everyone,
>
> The message-ID's of mails that have been (auto-)learned by Bayes are
> stored indefinitely in bayes_seen. Which, over the years that we've used
> SpamAssassin now, has grown to a 320MB file. We're using site-wide Bayes
> databases. What would be the best way to trim down this database, safely?
> Given that it only stores message-ID and spam status, I assume there is
> no way to rescue more recent entries, and I'd have to wipe it altogether?
No replies yet, so I'll clarify my question a bit:
1.) How much of a performance impact would it have to have a Bayes_seen
that is this large?
2.) What is the safest way of trimming it down? Can I simply stop
SpamAssassin (called by Mimedefang in our case) and remove it, or do I
need to recreate it in some way?
It would perhaps be usefull if the Bayes seen database also had
timestamps, so this kind of purging could be done automatically and
properly.
Regards, Paul Boven.
Re: Bayes_seen is 320MB
Posted by Paul Boven <p....@chello.nl>.
Hi everyone,
Paul Boven wrote:
>> The message-ID's of mails that have been (auto-)learned by Bayes are
>> stored indefinitely in bayes_seen. Which, over the years that we've
>> used SpamAssassin now, has grown to a 320MB file. We're using
>> site-wide Bayes databases. What would be the best way to trim down
>> this database, safely?
>> Given that it only stores message-ID and spam status, I assume there
>> is no way to rescue more recent entries, and I'd have to wipe it
>> altogether?
Follow-up:
I've deleted the Bayes_seen database two days ago: Bayes is still
working and has created a new one. Looking at the performance graphs of
my server, it's not made any difference in memory consumption or CPU load.
Regards, Paul Boven.
Re: Bayes_seen is 320MB
Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Paul Boven wrote:
> Paul Boven wrote:
>> Hi everyone,
>>
>> The message-ID's of mails that have been (auto-)learned by Bayes are
>> stored indefinitely in bayes_seen. Which, over the years that we've
>> used SpamAssassin now, has grown to a 320MB file. We're using
>> site-wide Bayes databases. What would be the best way to trim down
>> this database, safely?
>> Given that it only stores message-ID and spam status, I assume there
>> is no way to rescue more recent entries, and I'd have to wipe it
>> altogether?
>
> No replies yet, so I'll clarify my question a bit:
>
> 1.) How much of a performance impact would it have to have a Bayes_seen
> that is this large?
Depending on how busy your disk is, it could hurt a bit when learning.
> 2.) What is the safest way of trimming it down? Can I simply stop
> SpamAssassin (called by Mimedefang in our case) and remove it, or do I
> need to recreate it in some way?
IIRC you can do just that and SA will recreate a bayes_seen file. Make
sure all SA processes are killed off before doing it.
Of course, making a copy of all the bayes datafiles before doing so
wouldn't hurt.
> It would perhaps be usefull if the Bayes seen database also had
> timestamps, so this kind of purging could be done automatically and
> properly.
Code welcome. :)
Daryl
Re: Bayes_seen is 320MB
Posted by Ralf Hildebrandt <Ra...@charite.de>.
* Paul Boven <p....@chello.nl>:
> That, as far as I can tell, only does an expire on the Bayes_tokens. The
> Bayes_seen does not contain any timestamps and therefore can only keep
> growing for ever - there is no way to tell if a msgid is new or old.
Yep, my mistake. It would be useful, though. The face of spam is ever
changing, thus keeping 2y old records around won't help much.
Have you added a journal to your bayes DB?
This usually speeds things up!
--
Ralf Hildebrandt (i.A. des IT-Zentrums) Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to spamtrap@charite.de
Re: Bayes_seen is 320MB
Posted by Paul Boven <p....@chello.nl>.
Hi Ralf,
Thanks for your quick reply.
Ralf Hildebrandt wrote:
>> 1.) How much of a performance impact would it have to have a Bayes_seen
>> that is this large?
>
> Dunno, I use this:
> -rw------- 1 amavis amavis 1,3G 2006-07-03 11:26 bayes_seen
I'm certainly having performance problems, so I'm going to try to wipe
my Bayes_seen.
>> 2.) What is the safest way of trimming it down? Can I simply stop
>> SpamAssassin (called by Mimedefang in our case) and remove it,
> Yes.
>> or do I need to recreate it in some way?
> No.
Ok, I'll do that tonight.
>> It would perhaps be usefull if the Bayes seen database also had
>> timestamps, so this kind of purging could be done automatically and
>> properly.
>
> It has:
> /usr/bin/sa-learn --sync --force-expire
That, as far as I can tell, only does an expire on the Bayes_tokens. The
Bayes_seen does not contain any timestamps and therefore can only keep
growing for ever - there is no way to tell if a msgid is new or old.
Afaik, this is an inherent limitation in the current Bayes setup.
I've just done the command you suggested and it certainly didn't shrink
my Bayes_seen database.
Regards, Paul Boven.
Re: Bayes_seen is 320MB
Posted by Ralf Hildebrandt <Ra...@charite.de>.
* Paul Boven <p....@chello.nl>:
> No replies yet, so I'll clarify my question a bit:
>
> 1.) How much of a performance impact would it have to have a Bayes_seen
> that is this large?
Dunno, I use this:
-rw------- 1 amavis amavis 1,3G 2006-07-03 11:26 bayes_seen
> 2.) What is the safest way of trimming it down? Can I simply stop
> SpamAssassin (called by Mimedefang in our case) and remove it,
Yes.
> or do I need to recreate it in some way?
No.
> It would perhaps be usefull if the Bayes seen database also had
> timestamps, so this kind of purging could be done automatically and
> properly.
It has:
/usr/bin/sa-learn --sync --force-expire
--
Ralf Hildebrandt (i.A. des IT-Zentrums) Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to spamtrap@charite.de