You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jason Frisvold <xe...@gmail.com> on 2006/11/23 17:19:54 UTC

Bayes - Autoexpiry, bayes_seen, and bayes_tok

Greetings,

Just a few quick questions.  First, I noticed that prior to 3.1.0
bayes_seen was not auto expiriing.  That bug is marked as fixed, so is
it safe to say that bayes_seen is now expiring automatically and that
a 20+ meg bayes_seen file is valid?

Next, the bayes_tok database is over 3 Gig at this point.  I'd like to
cut that down a bit as the machine is having considerable trouble
dealing with it.  So I have a few questions concerning this.

First, can I modify the expiry time, causing an earlier expiration?
If so, what are the consequences of such an action?

Second, does the autoexpire run for every instance of spamassassin?
ie, does it run every time a message is processed?  If not, how does
it determine when to run it?  Would it be better to disable auto
expire and create a cron job that runs later in the evening to deal
with auto expire?

I noticed in the wiki that when forcing an expire, you should stop
spamassassin first.  Is this strictly necessary?  What are the
consequences of not doing this?

Any other suggestions for increasing the speed of the database?

Thanks!

-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
On 11/26/06, Matt Kettler <mk...@verizon.net> wrote:
> No, you can leave SA running.. however, while it's running sa-learn will
> have the R/W lock on the bayes database, so no autolearning will happen
> unless you're using the bayes_learn_to_journal option. (normally only
> atime updates are journaled.)

yeah, I think I determined what is likely a showstopper..  I'm running
bayes per-user, not per domain or globally..  So expiring is going to
be one hell of a chore...  *sigh*

> Yep..  at present the only way to "expire" it is to use check-whitelist
> --clean on the database file. This will purge all the "single hit"
> entries. This script comes in the "tools" subdir of the tarball, but
> isn't installed by default. It also only seems to work with db_file type
> AWL databases. No SQL support.

Hrm... I'm using MySQL so that script won't work, tho I'm sure I can
create a quick query to expire single hit entities...

> IMHO, this limitation makes the AWL not-ready for primetime. I would not
> use it on anything but fairly small-scale systems (ie: less than 5k
> messages a day) until a reasonable expiry system is added.

Heh..  so several thousand users isn't a good idea, eh?  Well I think
I'll add a timestamp field and start expiring that way...

-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Matt Kettler <mk...@verizon.net>.
Jason Frisvold wrote:
> Wow.. that sucked..  Finished message below..  My apologies for the
> previous unfinished message.
>
> On 11/26/06, Jason Frisvold <xe...@gmail.com> wrote:
>> On 11/25/06, Matt Kettler <mk...@verizon.net> wrote:
>> > Bayes_toks should trim itself automatically.
>>
>> I understand that, but I was wondering if it's possible to halt that
>> and do a manual expire at specific intervals so I can control the load
>> on the system.  Expiry seems to take a while...
>>
>> > Have you done a sa-learn --dump magic? How long has it been since
>> expiry
>> > ran on your system? How may tokens are in the DB?
>>
>> I use spamc/spamd so I think expiry is happening all the time, tho I'm
>> not sure how to tell.
>
> I'm checking now how many tokens there are..  It's taking forever..
> If I get a result, I'll let you know...
>
>> > Try running sa-learn --force-expire. Does that run (probably for a
>> long
>> > time) and then fix the problem?
>
> I'll give this a shot and see what happens..  Does spamassassin have
> to be halted to do this?
No, you can leave SA running.. however, while it's running sa-learn will
have the R/W lock on the bayes database, so no autolearning will happen
unless you're using the bayes_learn_to_journal option. (normally only
atime updates are journaled.)

>
>
> While we're looking at database usage..  Does the AWL expire?
Nope.
> There's
> only 17 million rows in my AWL database...  *grin*  There doesn't seem
> to be any sort of time related field by default, so adding one may
> help?
Yep..  at present the only way to "expire" it is to use check-whitelist
--clean on the database file. This will purge all the "single hit"
entries. This script comes in the "tools" subdir of the tarball, but
isn't installed by default. It also only seems to work with db_file type
AWL databases. No SQL support.

IMHO, this limitation makes the AWL not-ready for primetime. I would not
use it on anything but fairly small-scale systems (ie: less than 5k
messages a day) until a reasonable expiry system is added.



Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
Wow.. that sucked..  Finished message below..  My apologies for the
previous unfinished message.

On 11/26/06, Jason Frisvold <xe...@gmail.com> wrote:
> On 11/25/06, Matt Kettler <mk...@verizon.net> wrote:
> > Bayes_toks should trim itself automatically.
>
> I understand that, but I was wondering if it's possible to halt that
> and do a manual expire at specific intervals so I can control the load
> on the system.  Expiry seems to take a while...
>
> > Have you done a sa-learn --dump magic? How long has it been since expiry
> > ran on your system? How may tokens are in the DB?
>
> I use spamc/spamd so I think expiry is happening all the time, tho I'm
> not sure how to tell.

I'm checking now how many tokens there are..  It's taking forever..
If I get a result, I'll let you know...

> > Try running sa-learn --force-expire. Does that run (probably for a long
> > time) and then fix the problem?

I'll give this a shot and see what happens..  Does spamassassin have
to be halted to do this?

> > Do you call SA through MailScanner or amavis which might be interfering
> > with expire runs by timing out and killing SA? (This only matters for
> > tools that use the API.. Or technically a tool that uses the
> > spamassassin command line script, but no decent integration tools that
> > monitor execution time do that. Anything that calls spamc to access
> > spamd won't be able to kill off the expire process by killing spamc.)

Spamassassin is called via simscan which does not, to my knowledge,
halt the process at any point.  In addition, it uses spamc/spamd so by
your explanation, even if it did, it wouldn't affect the expiry
process.

While we're looking at database usage..  Does the AWL expire?  There's
only 17 million rows in my AWL database...  *grin*  There doesn't seem
to be any sort of time related field by default, so adding one may
help?

> --
> Jason 'XenoPhage' Frisvold
> XenoPhage0@gmail.com
>


-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Magnus Holmgren <ho...@lysator.liu.se>.
On Sunday 26 November 2006 16:16, Jason Frisvold wrote:
> On 11/26/06, Matt Kettler <mk...@verizon.net> wrote:
> > Make sure you run the --force-expire as the proper userid.
> > run sa-learn --dump magic, as I asked. If you need help interpreting it,
> > post the output.
>
> This doesn't look right to me..  ?  Half are new and half old?  I'm
> going right now to google this to death..  :)
>
> [friz@mail1 ~]$ sudo sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0          0          0  non-token data: nspam
> 0.000          0          1          0  non-token data: nham
> 0.000          0         72          0  non-token data: ntokens
> 0.000          0 1106663054          0  non-token data: oldest atime
> 0.000          0 1106663054          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal sync
> atime 0.000          0          0          0  non-token data: last expiry
> atime 0.000          0          0          0  non-token data: last expire
> atime delta 0.000          0          0          0  non-token data: last
> expire reduction count

Looks like you're looking at the wrong database here. The above means that you 
have 72 tokens from 1 ham mail and no spam. 1106663054 is a unix timestamp 
meaning Tue, 25 Jan 2005 14:24:14 UTC.

su to the right user or use --dbpath (it works like bayes_path in local.cf).

-- 
Magnus Holmgren        holmgren@lysator.liu.se
                       (No Cc of list mail needed, thanks)

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
On 11/26/06, Matt Kettler <mk...@verizon.net> wrote:
> Erm.. That's not half old and half new...That's all the same age,
> because that's an almost completely empty database. It's only got the
> learning from ONE message in it. There are only 72 tokens, and they're
> all the same age (oldest and newest atime are the same, therefore all
> tokens are the same age)

Yeah, I understand the output now..  It seems my problems are a tad
bigger since I do per-user bayes rather than global bayes...

-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Matt Kettler <mk...@verizon.net>.
Jason Frisvold wrote:
> On 11/26/06, Matt Kettler <mk...@verizon.net> wrote:
>> Yes, you can do that.. you can set: bayes_auto_expire 0 and have a
>> cronjob call sa-learn --force-expire.
>
> Is this a recommended thing?
I do it.
>
>> Make sure you run the --force-expire as the proper userid.
>> run sa-learn --dump magic, as I asked. If you need help interpreting it,
>> post the output.
>
> This doesn't look right to me..  ?  Half are new and half old?  I'm
> going right now to google this to death..  :)
>
> [friz@mail1 ~]$ sudo sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0          0          0  non-token data: nspam
> 0.000          0          1          0  non-token data: nham
> 0.000          0         72          0  non-token data: ntokens
> 0.000          0 1106663054          0  non-token data: oldest atime
> 0.000          0 1106663054          0  non-token data: newest atime
>
Erm.. That's not half old and half new...That's all the same age,
because that's an almost completely empty database. It's only got the
learning from ONE message in it. There are only 72 tokens, and they're
all the same age (oldest and newest atime are the same, therefore all
tokens are the same age)  


Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
On 11/26/06, Matt Kettler <mk...@verizon.net> wrote:
> Yes, you can do that.. you can set: bayes_auto_expire 0 and have a
> cronjob call sa-learn --force-expire.

Is this a recommended thing?

> Make sure you run the --force-expire as the proper userid.
> run sa-learn --dump magic, as I asked. If you need help interpreting it,
> post the output.

This doesn't look right to me..  ?  Half are new and half old?  I'm
going right now to google this to death..  :)

[friz@mail1 ~]$ sudo sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0         72          0  non-token data: ntokens
0.000          0 1106663054          0  non-token data: oldest atime
0.000          0 1106663054          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire
reduction count


-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Matt Kettler <mk...@verizon.net>.
Jason Frisvold wrote:
> On 11/25/06, Matt Kettler <mk...@verizon.net> wrote:
>> Bayes_toks should trim itself automatically.
>
> I understand that, but I was wondering if it's possible to halt that
> and do a manual expire at specific intervals so I can control the load
> on the system.  Expiry seems to take a while...
Yes, you can do that.. you can set: bayes_auto_expire 0 and have a
cronjob call sa-learn --force-expire.

Make sure you run the --force-expire as the proper userid.
>
>> Have you done a sa-learn --dump magic? How long has it been since expiry
>> ran on your system? How may tokens are in the DB?
>
> I use spamc/spamd so I think expiry is happening all the time, tho I'm
> not sure how to tell.
run sa-learn --dump magic, as I asked. If you need help interpreting it,
post the output.

>
>> Try running sa-learn --force-expire. Does that run (probably for a long
>> time) and then fix the problem?
>>
>> Do you call SA through MailScanner or amavis which might be interfering
>> with expire runs by timing out and killing SA? (This only matters for
>> tools that use the API.. Or technically a tool that uses the
>> spamassassin command line script, but no decent integration tools that
>> monitor execution time do that. Anything that calls spamc to access
>> spamd won't be able to kill off the expire process by killing spamc.)
>>
>>
>>
>
>


Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
On 11/25/06, Matt Kettler <mk...@verizon.net> wrote:
> Bayes_toks should trim itself automatically.

I understand that, but I was wondering if it's possible to halt that
and do a manual expire at specific intervals so I can control the load
on the system.  Expiry seems to take a while...

> Have you done a sa-learn --dump magic? How long has it been since expiry
> ran on your system? How may tokens are in the DB?

I use spamc/spamd so I think expiry is happening all the time, tho I'm
not sure how to tell.

> Try running sa-learn --force-expire. Does that run (probably for a long
> time) and then fix the problem?
>
> Do you call SA through MailScanner or amavis which might be interfering
> with expire runs by timing out and killing SA? (This only matters for
> tools that use the API.. Or technically a tool that uses the
> spamassassin command line script, but no decent integration tools that
> monitor execution time do that. Anything that calls spamc to access
> spamd won't be able to kill off the expire process by killing spamc.)
>
>
>


-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Matt Kettler <mk...@verizon.net>.
Jason Frisvold wrote:
>
> With respect to bayes_tok though, can that be trimmed at all with
> minimal impact?  3GB is a tad large for the database, though I guess
> that depends on the number of users.  I can't think of any way to
> limit that, though, and I wonder how even larger entities can deal
> with databases that much be much larger.
>
Bayes_toks should trim itself automatically.

Have you done a sa-learn --dump magic? How long has it been since expiry
ran on your system? How may tokens are in the DB?

Try running sa-learn --force-expire. Does that run (probably for a long
time) and then fix the problem?

Do you call SA through MailScanner or amavis which might be interfering
with expire runs by timing out and killing SA? (This only matters for
tools that use the API.. Or technically a tool that uses the
spamassassin command line script, but no decent integration tools that
monitor execution time do that. Anything that calls spamc to access
spamd won't be able to kill off the expire process by killing spamc.)



Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Nigel Frankcom <ni...@blue-canoe.net>.
On Sat, 25 Nov 2006 13:55:37 -0500, Theo Van Dinter
<fe...@apache.org> wrote:

>On Sat, Nov 25, 2006 at 01:41:50PM -0500, Jason Frisvold wrote:
>> With respect to bayes_tok though, can that be trimmed at all with
>> minimal impact?  3GB is a tad large for the database, though I guess
>> that depends on the number of users.  I can't think of any way to
>> limit that, though, and I wonder how even larger entities can deal
>> with databases that much be much larger.
>
>It depends why the file is 3GB.  Yes, that's *WAY* huge.
>
>So there's a few possibilities here:
>
>1) You have a huge (HUGE) number of tokens.
>2) It could be a sparse file, so "file size 3GB" does not mean "using
>   3GB on disk".
>3) Something is crazy with your installed Berkeley DB libs that causes
>   it to have huge files.
>
>So if you don't have a crazy huge number of tokens (on my system, ~500k tokens
>equates to ~10MB of DB fwiw), I'd look at the libdb/DB_File stuff.  Converting
>to SQL may also be useful.

125k+ tokens here takes 2.6 MB in mysql

Nigel

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Theo Van Dinter <fe...@apache.org>.
On Sat, Nov 25, 2006 at 01:41:50PM -0500, Jason Frisvold wrote:
> With respect to bayes_tok though, can that be trimmed at all with
> minimal impact?  3GB is a tad large for the database, though I guess
> that depends on the number of users.  I can't think of any way to
> limit that, though, and I wonder how even larger entities can deal
> with databases that much be much larger.

It depends why the file is 3GB.  Yes, that's *WAY* huge.

So there's a few possibilities here:

1) You have a huge (HUGE) number of tokens.
2) It could be a sparse file, so "file size 3GB" does not mean "using
   3GB on disk".
3) Something is crazy with your installed Berkeley DB libs that causes
   it to have huge files.

So if you don't have a crazy huge number of tokens (on my system, ~500k tokens
equates to ~10MB of DB fwiw), I'd look at the libdb/DB_File stuff.  Converting
to SQL may also be useful.

-- 
Randomly Selected Tagline:
"It's a good cause... Cause it's good...?"      - Hardcore TV

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
On 11/24/06, Matt Kettler <mk...@verizon.net> wrote:
> It's not "fixed", it's only hack-fixed. There is no real expiry of
> bayes_seen, nor the AWL, in SA 3.1.x.
>
> It's now safe to delete bayes_seen, you won't corrupt your whole bayes
> DB if you do that. That's the only fix I know of that's been applied.
>
> See  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2975

The very last comment combined with a status of resolved and
resolution of fixed made me think they actually fixed the problem...

Good to know I can trim this at will though.

With respect to bayes_tok though, can that be trimmed at all with
minimal impact?  3GB is a tad large for the database, though I guess
that depends on the number of users.  I can't think of any way to
limit that, though, and I wonder how even larger entities can deal
with databases that much be much larger.

-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Matt Kettler <mk...@verizon.net>.
It's not "fixed", it's only hack-fixed. There is no real expiry of
bayes_seen, nor the AWL, in SA 3.1.x.

It's now safe to delete bayes_seen, you won't corrupt your whole bayes
DB if you do that. That's the only fix I know of that's been applied.

See  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2975

>From the bottom:

---------------------
'We need to do something, but a full seen expiry system isn't going to happen
for 3.1.'

'I still like the idea of just letting bayes_seen be optional.  If people want
to trim it, let them delete the file and have it be recreated.  IIRC, the only
place that's an issue is when going r/o w/ the DB where it requires the file
right now.'
--------------------


Jason Frisvold wrote:
> No takers on this?  Have I hit upon a FAQ question?  I swear I looked
> and searched and I didn't find suitable answers...
>
> On 11/23/06, Jason Frisvold <xe...@gmail.com> wrote:
>> Greetings,
>>
>> Just a few quick questions.  First, I noticed that prior to 3.1.0
>> bayes_seen was not auto expiriing.  That bug is marked as fixed, so is
>> it safe to say that bayes_seen is now expiring automatically and that
>> a 20+ meg bayes_seen file is valid?
>>
>> Next, the bayes_tok database is over 3 Gig at this point.  I'd like to
>> cut that down a bit as the machine is having considerable trouble
>> dealing with it.  So I have a few questions concerning this.
>>
>> First, can I modify the expiry time, causing an earlier expiration?
>> If so, what are the consequences of such an action?
>>
>> Second, does the autoexpire run for every instance of spamassassin?
>> ie, does it run every time a message is processed?  If not, how does
>> it determine when to run it?  Would it be better to disable auto
>> expire and create a cron job that runs later in the evening to deal
>> with auto expire?
>>
>> I noticed in the wiki that when forcing an expire, you should stop
>> spamassassin first.  Is this strictly necessary?  What are the
>> consequences of not doing this?
>>
>> Any other suggestions for increasing the speed of the database?
>>
>> Thanks!
>>
>> -- 
>> Jason 'XenoPhage' Frisvold
>> XenoPhage0@gmail.com
>>
>
>


Re: Bayes - Autoexpiry, bayes_seen, and bayes_tok

Posted by Jason Frisvold <xe...@gmail.com>.
No takers on this?  Have I hit upon a FAQ question?  I swear I looked
and searched and I didn't find suitable answers...

On 11/23/06, Jason Frisvold <xe...@gmail.com> wrote:
> Greetings,
>
> Just a few quick questions.  First, I noticed that prior to 3.1.0
> bayes_seen was not auto expiriing.  That bug is marked as fixed, so is
> it safe to say that bayes_seen is now expiring automatically and that
> a 20+ meg bayes_seen file is valid?
>
> Next, the bayes_tok database is over 3 Gig at this point.  I'd like to
> cut that down a bit as the machine is having considerable trouble
> dealing with it.  So I have a few questions concerning this.
>
> First, can I modify the expiry time, causing an earlier expiration?
> If so, what are the consequences of such an action?
>
> Second, does the autoexpire run for every instance of spamassassin?
> ie, does it run every time a message is processed?  If not, how does
> it determine when to run it?  Would it be better to disable auto
> expire and create a cron job that runs later in the evening to deal
> with auto expire?
>
> I noticed in the wiki that when forcing an expire, you should stop
> spamassassin first.  Is this strictly necessary?  What are the
> consequences of not doing this?
>
> Any other suggestions for increasing the speed of the database?
>
> Thanks!
>
> --
> Jason 'XenoPhage' Frisvold
> XenoPhage0@gmail.com
>


-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com