You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Michael Parker <pa...@pobox.com> on 2004/04/22 22:07:21 UTC

Re: svn commit: rev 10185 - in incubator/spamassassin/trunk/lib/Mail/SpamAssassin: . BayesStore

On Thu, Apr 22, 2004 at 07:47:56PM -0000, felicity@apache.org wrote:
> Author: felicity
> Date: Thu Apr 22 12:47:55 2004
> New Revision: 10185
> 
> Modified:
>    incubator/spamassassin/trunk/lib/Mail/SpamAssassin/BayesStore.pm
>    incubator/spamassassin/trunk/lib/Mail/SpamAssassin/BayesStore/DBM.pm
>    incubator/spamassassin/trunk/lib/Mail/SpamAssassin/BayesStore/SQL.pm
> Log:
> during expiry, handle the situation where the newest tokens is in the future by resetting the newest token stamp to current time, and changing any token atimes > newest atime to newest atime.
> 

Curious, should the atimes be updated before calculating the expire
delta?

Michael

Re: svn commit: rev 10185 - in incubator/spamassassin/trunk/lib/Mail/SpamAssassin: . BayesStore

Posted by Michael Parker <pa...@pobox.com>.
On Thu, Apr 22, 2004 at 04:36:12PM -0400, Theo Van Dinter wrote:
> On Thu, Apr 22, 2004 at 03:07:21PM -0500, Michael Parker wrote:
> > Curious, should the atimes be updated before calculating the expire
> > delta?
> 
> Umm.  It depends?  ;)
> 
> The problem I was trying to solve there is the "relative handful of tokens
> are in the future" issue, which causes expiry to not function properly.
> In that case, we're not going to expire the future tokens unless we can
> do a normal expire based on newest==current anyway, so there's no issue.
> 
> If, however, the problem is "relative lots of tokens in the future",
> then we'll be in the same "can't expire" state.   IMHO though, the
> solution for that is to blow away the DB and start over, since the 3.0
> code already should stop new tokens being added in the future, so we're
> really only talking about people who currently have the issue.

Ok, I'm cool with the, handle the few anomalies.  I was just making
sure you weren't thinking about the really screwed up case which would
never get hit due to the expire delta.

No prob then.

Michael

Re: svn commit: rev 10185 - in incubator/spamassassin/trunk/lib/Mail/SpamAssassin: . BayesStore

Posted by Theo Van Dinter <fe...@kluge.net>.
On Thu, Apr 22, 2004 at 03:07:21PM -0500, Michael Parker wrote:
> Curious, should the atimes be updated before calculating the expire
> delta?

Umm.  It depends?  ;)

The problem I was trying to solve there is the "relative handful of tokens
are in the future" issue, which causes expiry to not function properly.
In that case, we're not going to expire the future tokens unless we can
do a normal expire based on newest==current anyway, so there's no issue.

If, however, the problem is "relative lots of tokens in the future",
then we'll be in the same "can't expire" state.   IMHO though, the
solution for that is to blow away the DB and start over, since the 3.0
code already should stop new tokens being added in the future, so we're
really only talking about people who currently have the issue.

-- 
Randomly Generated Tagline:
"She taught me Cuban, which is a lot like Spanish only without as many
 words for luxury items." - Emo Philips

Re: I may have less access to the net for a while, and a Bayes question

Posted by Loren Wilton <lw...@earthlink.net>.
How frequently CAN things expire, and how frequently MUST things expire?

If everything has to expire sooner or later, you don't need the high bits of
the time.  If things can expire once an hour, you don't need the low 11 or
12 bits of the time.

If you throw out the bottom 11 bits, save 16, and discard the top
32-11-16=5, you can store a time in 16 bits that will allow an expiration
every  40 minutes and allow a retention for 4.2 years.

For that matter you could discard the bottom 11 and save 8 bits and have a 6
day max retention.  I assume that would be too low, but I don't know that
for a fact.

        Loren

----- Original Message ----- 
From: "Michael Parker" <pa...@pobox.com>
To: "Sidney Markowitz" <si...@sidney.com>
Cc: "Spam Assassin Dev" <sp...@incubator.apache.org>
Sent: Thursday, April 22, 2004 9:03 PM
Subject: Re: I may have less access to the net for a while, and a Bayes
question


> On Fri, Apr 23, 2004 at 09:08:28AM +1200, Sidney Markowitz wrote:
> >
> > When I suggested making the atime in Bayes two bytes instead of four by
> > making it coarser grained than one second, somebody (Justin?) said that
> > it had been tried and produced problems in handling expiry.
> >
> > Can whoever knows about this post some details? It would make such a
> > difference in I/O requirements to not have to update the atime field on
> > every significant token in every message, that I don't want to just drop
> > the issue without trying to solve the problems with it. I have a hard
> > time imagining what would be wrong with expiring tokens on a day
> > boundary instead of one second.
> >
>
> I could definitely see folks expiring more than once a day on a busy
> site and that could cause a problem.
>
> Perhaps we could make it configurable, where we let the user select
> the buffer (1 day, 2 days, 1 week, 1 month, etc) with the default
> being 0.  The folks who want to lower their I/O requirements can
> adjust the value.
>
> That doesn't solve the issue of storing the value in a two byte int,
> I'll have think on that one a little bit more.
>
> Michael
>


Re: I may have less access to the net for a while, and a Bayes question

Posted by Michael Parker <pa...@pobox.com>.
On Fri, Apr 23, 2004 at 09:08:28AM +1200, Sidney Markowitz wrote:
> 
> When I suggested making the atime in Bayes two bytes instead of four by 
> making it coarser grained than one second, somebody (Justin?) said that 
> it had been tried and produced problems in handling expiry.
> 
> Can whoever knows about this post some details? It would make such a 
> difference in I/O requirements to not have to update the atime field on 
> every significant token in every message, that I don't want to just drop 
> the issue without trying to solve the problems with it. I have a hard 
> time imagining what would be wrong with expiring tokens on a day 
> boundary instead of one second.
> 

I could definitely see folks expiring more than once a day on a busy
site and that could cause a problem.

Perhaps we could make it configurable, where we let the user select
the buffer (1 day, 2 days, 1 week, 1 month, etc) with the default
being 0.  The folks who want to lower their I/O requirements can
adjust the value.

That doesn't solve the issue of storing the value in a two byte int,
I'll have think on that one a little bit more.

Michael

Re: I may have less access to the net for a while, and a Bayes question

Posted by Kelsey Cummings <kg...@sonic.net>.
On Thu, Apr 22, 2004 at 03:11:49PM -0700, Daniel Quinlan wrote:
> Sidney Markowitz <si...@sidney.com> writes:
> 
> > When I suggested making the atime in Bayes two bytes instead of four by 
> > making it coarser grained than one second, somebody (Justin?) said that 
> > it had been tried and produced problems in handling expiry.
> 
> If I recall correctly, it makes expiry difficult when entire Bayes DBs
> are being cycled through faster than the granularity allows -- some
> sites get a lot of email.  Justin, Theo, and Kelsey might have some more
> information about this, but there might also be some old threads/bugs --
> search for atime in bugzilla and/or on the mailing list.

I don't have any useful input here.  Just that, in our case, without a
global bayes DB, we shouldn't see anything different that any individual
that gets alot of email.  Our top users get upto a couple thousand a day,
most are nowhere near this.

-- 
Kelsey Cummings - kgc@sonic.net           sonic.net, inc.
System Administrator                      2260 Apollo Way
707.522.1000 (Voice)                      Santa Rosa, CA 95407
707.547.2199 (Fax)                        http://www.sonic.net/
Fingerprint = D5F9 667F 5D32 7347 0B79  8DB7 2B42 86B6 4E2C 3896

Re: I may have less access to the net for a while, and a Bayes question

Posted by Daniel Quinlan <qu...@pathname.com>.
Sidney Markowitz <si...@sidney.com> writes:

> When I suggested making the atime in Bayes two bytes instead of four by 
> making it coarser grained than one second, somebody (Justin?) said that 
> it had been tried and produced problems in handling expiry.

If I recall correctly, it makes expiry difficult when entire Bayes DBs
are being cycled through faster than the granularity allows -- some
sites get a lot of email.  Justin, Theo, and Kelsey might have some more
information about this, but there might also be some old threads/bugs --
search for atime in bugzilla and/or on the mailing list.
 
> Thanks, and wish me luck on the laptop.

Good luck.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

I may have less access to the net for a while, and a Bayes question

Posted by Sidney Markowitz <si...@sidney.com>.
My laptop is dying again and I am about to take it in for repairs 
(again), which means less convenient access to the the 'net, and 
especially to email. That might mean I don't participate in a bug 
squashing session depending on the timing.

Before I try to complete a final incremental backup and shut it down, I 
would like to get a question out there. I apologize in advance for 
asking without doing a thorough search of the archives first.

When I suggested making the atime in Bayes two bytes instead of four by 
making it coarser grained than one second, somebody (Justin?) said that 
it had been tried and produced problems in handling expiry.

Can whoever knows about this post some details? It would make such a 
difference in I/O requirements to not have to update the atime field on 
every significant token in every message, that I don't want to just drop 
the issue without trying to solve the problems with it. I have a hard 
time imagining what would be wrong with expiring tokens on a day 
boundary instead of one second.

Thanks, and wish me luck on the laptop.

By the way, Toshiba doesn't make the Satellite Pro 6100 anymore, but if 
you ever have a chance to get one, run the other way.

  -- sidney