You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ian Zimmerman <it...@buug.org> on 2015/05/24 18:32:27 UTC

Confused about Bayes expiry

I am very confused by the various features involving expiry from Bayes.

perldoc Mail::SpamAssassin::Conf :

       bayes_expiry_max_db_size      (default: 150000)

           What should be the maximum size of the Bayes tokens database?
           When expiry occurs, the Bayes system will keep either 75% of
           the maximum value, or 100,000 tokens, whichever has a larger
           value.  150,000 tokens is roughly equivalent to a 8Mb
           database file.

       bayes_auto_expire             (default: 1)

           If enabled, the Bayes system will try to automatically expire
           old tokens from the database.  Auto-expiry occurs when the
           number of tokens in the database surpasses the
           bayes_expiry_max_db_size value. If a bayes datastore backend
           does not implement individual key/value expirations, the
           setting is silently ignored.

       bayes_token_ttl               (default: 3w, i.e. 3 weeks)

           Time-to-live / expiration time in seconds for tokens kept in
           a Bayes database.  A numeric value is optionally suffixed by
           a time unit (s, m, h, d, w, indicating seconds (default),
           minutes, hours, days, weeks).

           If bayes_auto_expire is true and a Bayes datastore backend
           supports it (currently only Redis), this setting controls
           deletion of expired tokens from a bayes database. The value
           is observed on a best-effort basis, exact timing promises are
           not necessarily kept. If a bayes datastore backend does not
           implement individual key/value expirations, the setting is
           silently ignored.

This really sounds as if expiry is a no-op for backends other than
Redis.  And yet Debian bug #334829 [1] exists, and has spawned a whole
subculture of solutions and work-arounds.  (Sorry for the slight
exaggeration.)  Clearly the users reporting these problems do not use
Redis, in fact by all signs they use the default DB backend, as I do.
So should I be worried about the expiry overhead and set up a separate
--force-expire job?  I am confused.

[1]
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=334829

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Confused about Bayes expiry

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>Ian> But, in fact I already have a cronjob running "sa-learn
>Ian> --force-expire".  The reason I would prefer to remove it (and so
>Ian> the reason for my original post) is that it does a journal sync as
>Ian> well, which I didn't intend and which interferes with other things.

>On 2015-05-25 09:43 +0200, Matus UHLAR - fantomas wrote:
>Matus> what other things? Journal is here to fasten database updates,
>Matus> not to avoid database writes. too big journal slows things down.
>
>Matus> The main reason to use manual expire is to avoid ocassional
>Matus> delays with automatic expire noted in the bugreport you posted
>Matus> link to.
>
>Matus> so, again, what are reasons you want to avoid journal syncs?

On 25.05.15 09:47, Ian Zimmerman wrote:
>I do the database updates in a batch fashion, learning each input
>message with --no-sync, then doing a --sync at the end.

how this does differ from running sa-learn over multiple mail, e.g. mailbox
or maildir?

>  This --sync
>cannot wait too long because I want to defend against current spam.

I'd say the more messages are learned, the more time sync takes.
Did you measure the times to see the differencies?

>That is, it cannot wait as long as the typical time between expires.
>But if an explicit expiry happens to run at the same time, the result is
>a mess.

explicit expiry happens when you run it. If you do stuff in batch mode, you
can call sa-learn --force-expire at the very end, which should not cause
mess...

>Of course there is a simple solution, have a single job which decides by
>itself if it's time to expire or not, rather than rely on the cron
>schedule.  But it seemed to me that the two tasks were independent and
>so should be in separate jobs.  As it was explained in the other
>subthread, I was wrong with that assumption.

another simple solution is use SQL or redis storage for bayes database :-)


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
A day without sunshine is like, night.

Re: Confused about Bayes expiry

Posted by Ian Zimmerman <it...@buug.org>.
On 2015-05-25 09:43 +0200, Matus UHLAR - fantomas wrote:

Ian> But, in fact I already have a cronjob running "sa-learn
Ian> --force-expire".  The reason I would prefer to remove it (and so
Ian> the reason for my original post) is that it does a journal sync as
Ian> well, which I didn't intend and which interferes with other things.

Matus> what other things? Journal is here to fasten database updates,
Matus> not to avoid database writes. too big journal slows things down.

Matus> The main reason to use manual expire is to avoid ocassional
Matus> delays with automatic expire noted in the bugreport you posted
Matus> link to.

Matus> so, again, what are reasons you want to avoid journal syncs?

I do the database updates in a batch fashion, learning each input
message with --no-sync, then doing a --sync at the end.  This --sync
cannot wait too long because I want to defend against current spam.
That is, it cannot wait as long as the typical time between expires.
But if an explicit expiry happens to run at the same time, the result is
a mess.

Of course there is a simple solution, have a single job which decides by
itself if it's time to expire or not, rather than rely on the cron
schedule.  But it seemed to me that the two tasks were independent and
so should be in separate jobs.  As it was explained in the other
subthread, I was wrong with that assumption.

Thanks.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Confused about Bayes expiry

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On 2015-05-24 23:25 +0200, Mark Martinec wrote:
>Mark> With other bayes back-ends the traditional expiration mechanisms
>Mark> need to be used, either auto-expiration runs triggered from time
>Mark> to time by SpamAssassin, or explicit expiration runs, e.g. from a
>Mark> cron job. With these traditional back-ends the bayes_token_ttl
>Mark> setting has no effect.

On 24.05.15 15:26, Ian Zimmerman wrote:
>Perhaps this paragraph could be included verbatim in the podfile, and
>the current wording (especially about bayes_auto_expire) removed :-)

maybe re-worded, not removed.

>But, in fact I already have a cronjob running "sa-learn
>--force-expire".  The reason I would prefer to remove it (and so the
>reason for my original post) is that it does a journal sync as well,
>which I didn't intend and which interferes with other things.

what other things? Journal is here to fasten database updates, not to avoid
database writes. too big journal slows things down. 

The main reason to use manual expire is to avoid ocassional delays with
automatic expire noted in the bugreport you posted link to.

so, again, what are reasons you want to avoid journal syncs?
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
It's now safe to throw off your computer.

Re: Confused about Bayes expiry

Posted by RW <rw...@googlemail.com>.
On Sun, 24 May 2015 15:26:32 -0700
Ian Zimmerman wrote:

> On 2015-05-24 23:25 +0200, Mark Martinec wrote:
> 
> Mark> With other bayes back-ends the traditional expiration mechanisms
> Mark> need to be used, either auto-expiration runs triggered from time
> Mark> to time by SpamAssassin, or explicit expiration runs, e.g. from
> Mark> a cron job. With these traditional back-ends the bayes_token_ttl
> Mark> setting has no effect.
> 
> Perhaps this paragraph could be included verbatim in the podfile, and
> the current wording (especially about bayes_auto_expire) removed :-)
> Thanks.
> 
> But, in fact I already have a cronjob running "sa-learn
> --force-expire".  The reason I would prefer to remove it (and so the
> reason for my original post) is that it does a journal sync as well,
> which I didn't intend and which interferes with other things.
> 
> Would "sa-learn --no-sync --force-expire" make sense?
> 

No, I'm not sure off-hand whether this is supported, but expiry needs a
sync to work properly. With the default setting of
bayes_learn_to_journal it's the only reason to have a journal. 

If you remove the cron entry and use auto-expiry, the expiry would
presumably do a  sync as a side-effect anyway.


Re: Confused about Bayes expiry

Posted by Ian Zimmerman <it...@buug.org>.
On 2015-05-24 23:25 +0200, Mark Martinec wrote:

Mark> With other bayes back-ends the traditional expiration mechanisms
Mark> need to be used, either auto-expiration runs triggered from time
Mark> to time by SpamAssassin, or explicit expiration runs, e.g. from a
Mark> cron job. With these traditional back-ends the bayes_token_ttl
Mark> setting has no effect.

Perhaps this paragraph could be included verbatim in the podfile, and
the current wording (especially about bayes_auto_expire) removed :-)
Thanks.

But, in fact I already have a cronjob running "sa-learn
--force-expire".  The reason I would prefer to remove it (and so the
reason for my original post) is that it does a journal sync as well,
which I didn't intend and which interferes with other things.

Would "sa-learn --no-sync --force-expire" make sense?

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Confused about Bayes expiry

Posted by Mark Martinec <Ma...@ijs.si>.
  Ian Zimmerman wrote:

> I am very confused by the various features involving expiry from Bayes.
> 
> perldoc Mail::SpamAssassin::Conf :
> 
>        bayes_expiry_max_db_size      (default: 150000)
> 
>            What should be the maximum size of the Bayes tokens 
> database?
>            When expiry occurs, the Bayes system will keep either 75% of
>            the maximum value, or 100,000 tokens, whichever has a larger
>            value.  150,000 tokens is roughly equivalent to a 8Mb
>            database file.
> 
>        bayes_auto_expire             (default: 1)
> 
>            If enabled, the Bayes system will try to automatically 
> expire
>            old tokens from the database.  Auto-expiry occurs when the
>            number of tokens in the database surpasses the
>            bayes_expiry_max_db_size value. If a bayes datastore backend
>            does not implement individual key/value expirations, the
>            setting is silently ignored.
> 
>        bayes_token_ttl               (default: 3w, i.e. 3 weeks)
> 
>            Time-to-live / expiration time in seconds for tokens kept in
>            a Bayes database.  A numeric value is optionally suffixed by
>            a time unit (s, m, h, d, w, indicating seconds (default),
>            minutes, hours, days, weeks).
> 
>            If bayes_auto_expire is true and a Bayes datastore backend
>            supports it (currently only Redis), this setting controls
>            deletion of expired tokens from a bayes database. The value
>            is observed on a best-effort basis, exact timing promises 
> are
>            not necessarily kept. If a bayes datastore backend does not
>            implement individual key/value expirations, the setting is
>            silently ignored.
> 
> This really sounds as if expiry is a no-op for backends other than
> Redis.  And yet Debian bug #334829 [1] exists, and has spawned a whole
> subculture of solutions and work-arounds.  (Sorry for the slight
> exaggeration.)  Clearly the users reporting these problems do not use
> Redis, in fact by all signs they use the default DB backend, as I do.
> So should I be worried about the expiry overhead and set up a separate
> --force-expire job?  I am confused.
>   [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=334829

The redis backend takes advantage of an auto-expiry mechanism of
key/value pairs as provided by a redis server internally (transparently
and automatically), so with this backend the bayes_token_ttl is the
only setting that matters, and SpamAssassin (auto)expiration runs
are not needed, if fact they are a no-op and should not be used.

With other bayes back-ends the traditional expiration mechanisms
need to be used, either auto-expiration runs triggered from time
to time by SpamAssassin, or explicit expiration runs, e.g. from
a cron job. With these traditional back-ends the bayes_token_ttl
setting has no effect.

> and has spawned a whole subculture of solutions and work-arounds

Indeed. These mostly pre-date the availability of a Redis back-end.

   Mark