You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Rogers, Zoë A." <zo...@dns.co.uk> on 2004/07/20 12:31:12 UTC

Bayes database problem

Something is wrong with our bayesian database.  The same spam email is getting through even though I am feeding them manually into the learner.  Its matching on BAYES_50 only.
 
I tried running sa-learn --dump data | sort > bayes_dump.txt which should show every token in the database in ascending order of spam probability.  After terminating the process, the text file contains nothing.  When I run sa-learn --dump magic it shows over 1000000 tokens but it hangs on sa-learn --dump data.  Does anyone know what could be wrong here?  Is there a maximum database size? 
 
When I run with the debug option on it hangs at Initialising learner.
 
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mdsmtp,   P=[IPC], F=mDFMuXa%k5, S=EnvFromSMTP/HdrFromSMTP, R=EnvToSMTP, E=\r\n, L=990,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mrelay,   P=[IPC], F=mDFMuXa8k, S=EnvFromSMTP/HdrFromSMTP, R=MasqSMTP, E=\r\n, L=2040,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: use_razor1 0
debug: bayes: 8301 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8301 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
debug: Score set 2 chosen.
debug: Initialising learner


 
sa-learn --force-expire -D
 
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mrelay,   P=[IPC], F=mDFMuXa8k, S=EnvFromSMTP/HdrFromSMTP, R=MasqSMTP, E=\r\n, L=2040,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: use_razor1 0
debug: bayes: 8213 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8213 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
debug: Score set 2 chosen.
debug: Initialising learner
debug: Initialising learner
debug: Syncing Bayes journal and expiring old tokens...
debug: lock: 8213 created /usr/local/share/spamassassin/run/bayes.lock.mss-mail-in-5.dnsmss.net.8213
debug: lock: 8213 trying to get lock on /usr/local/share/spamassassin/run/bayes with 0 retries
debug: lock: 8213 breaking stale /usr/local/share/spamassassin/run/bayes.lock: age=1090317256 now=1090317973
debug: lock: 8213 trying to get lock on /usr/local/share/spamassassin/run/bayes with 1 retries
debug: lock: 8213 link to /usr/local/share/spamassassin/run/bayes.lock: link ok
debug: bayes: 8213 tie-ing to DB file R/W /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8213 tie-ing to DB file R/W /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
synced Bayes databases from journal in 0 seconds: 685 unique entries (887 total entries)
debug: bayes: expiry check keep size, 75% of max: 112500
debug: bayes: token count: 2273369, final goal reduction size: 2160869
debug: bayes: First pass?  Current: 1090317975, Last: 1090299462, atime: 736899888, count: 0, newdelta: 0, ratio: 0
debug: bayes: something fishy, calculating atime (first pass)
 
Then it hangs for ages until I have to terminate.  
 
Thanks,
Zoe


---------------------------------------------------
This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.

For further information contact email-integrity@dns.co.uk





Re: Bayes database problem

Posted by Michael Parker <pa...@pobox.com>.
On Tue, Jul 20, 2004 at 02:39:23PM -0600, Lucas Albers wrote:
> 
> Theo Van Dinter said:
> 
> > IMO, BTW, if you're going to do sitewide, and it's a large db file,
> > just put the bayes db in sql (3.0 feature).  You'll probably use less
> > resources and get better performance.
> 
> Is their any documentation on upgrading from 2.63 to 3.0 and switching to
> a mysql bayesian dbase as part of 3.0?
> 
> I looked at the install documentation, and was looking for better.

sql/README.bayes

Michael

Re: Bayes database problem

Posted by Lucas Albers <ad...@cs.montana.edu>.
Theo Van Dinter said:

> IMO, BTW, if you're going to do sitewide, and it's a large db file,
> just put the bayes db in sql (3.0 feature).  You'll probably use less
> resources and get better performance.

Is their any documentation on upgrading from 2.63 to 3.0 and switching to
a mysql bayesian dbase as part of 3.0?

I looked at the install documentation, and was looking for better.

-- 
Luke Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana



Re: Bayes database problem

Posted by Theo Van Dinter <fe...@kluge.net>.
On Tue, Jul 20, 2004 at 11:39:34AM -0400, Kris Deugau wrote:
> I think you've got bigger problems.  This looks like SA is reading your
> sendmail.cf file for some reason.

Yeah, you should be using /etc/mail/spamassassin, not /etc/mail ...

> > debug: bayes: 8213 tie-ing to DB file R/O
> > /usr/local/share/spamassassin/run/bayes_toks
> 
> > debug: bayes: 8213 tie-ing to DB file R/W
> > /usr/local/share/spamassassin/run/bayes_toks
> 
> Er...  This is odd.  Supposedly it already did this...

It went R/O before to see if it could do a scan.  To do a sync/expire
it needs to go R/W which requires the DB lock, etc.

> > debug: bayes: expiry check keep size, 75% of max: 112500
> > debug: bayes: token count: 2273369, final goal reduction size: 2160869
> > debug: bayes: First pass?  Current: 1090317975, Last: 1090299462,
> > atime: 736899888, count: 0, newdelta: 0, ratio: 0
> > debug: bayes: something fishy, calculating atime (first pass)

This is going to take quite a while, btw.  It wants to figure out how to
expire almost 2.2 million tokens from your ~2.3 million token db.
It looks like expiry hasn't been able to complete in a while. :(

> a medium-size site Bayes should be ~40M _toks + whatever _seen takes
> up;  and a large sitewide Bayes may run up to ~100M.  I wouldn't go much
> higher due to the IO/memory/filesystem cache load.

IMO, BTW, if you're going to do sitewide, and it's a large db file,
just put the bayes db in sql (3.0 feature).  You'll probably use less
resources and get better performance.

-- 
Randomly Generated Tagline:
Love isn't hopeless.  Look, maybe I'm no expert on the subject, but there
 was one time I got it right.
 
 		-- Homer Simpson
 		   Another Simpson's Clip Show

Re: Bayes database problem

Posted by Kris Deugau <kd...@vianet.ca>.
"Rogers, Zoë A." wrote:
> debug: Failed to parse line in SpamAssassin configuration, skipping:
> A=TCP $h
> debug: Failed to parse line in SpamAssassin configuration, skipping:
> Mdsmtp,   P=[IPC], F=mDFMuXa%k5, S=EnvFromSMTP/HdrFromSMTP,
> R=EnvToSMTP, E=\r\n, L=990,

I think you've got bigger problems.  This looks like SA is reading your
sendmail.cf file for some reason.

Check your /etc/mail/spamassassin/*.cf files carefully;  then run
spamassassin --lint.  Repeat until you see NO output.

> debug: bayes: 8213 tie-ing to DB file R/O
> /usr/local/share/spamassassin/run/bayes_toks
> debug: bayes: 8213 tie-ing to DB file R/O
> /usr/local/share/spamassassin/run/bayes_seen
> debug: bayes: found bayes db version 2
> debug: Score set 2 chosen.

This looks normal...

> debug: Initialising learner
> debug: Initialising learner

This is odd.  I don't recall seeing anything like this myself.

> debug: Syncing Bayes journal and expiring old tokens...

Looks like SA is setting up for a Bayes expiry run.  Depending on what
Bayes options you've set, this may take a while.  :/

> debug: lock: 8213 created
> /usr/local/share/spamassassin/run/bayes.lock.mss-mail-in-5.dnsmss.net.8213
> debug: lock: 8213 trying to get lock on
> /usr/local/share/spamassassin/run/bayes with 0 retries
> debug: lock: 8213 breaking stale
> /usr/local/share/spamassassin/run/bayes.lock: age=1090317256
> now=1090317973
> debug: lock: 8213 trying to get lock on
> /usr/local/share/spamassassin/run/bayes with 1 retries
> debug: lock: 8213 link to
> /usr/local/share/spamassassin/run/bayes.lock: link ok

Probably due to a past attempt to expire tokens, which was interrupted.

> debug: bayes: 8213 tie-ing to DB file R/W
> /usr/local/share/spamassassin/run/bayes_toks
> debug: bayes: 8213 tie-ing to DB file R/W
> /usr/local/share/spamassassin/run/bayes_seen
> debug: bayes: found bayes db version 2

Er...  This is odd.  Supposedly it already did this...

> synced Bayes databases from journal in 0 seconds: 685 unique entries
> (887 total entries)
> debug: bayes: expiry check keep size, 75% of max: 112500
> debug: bayes: token count: 2273369, final goal reduction size: 2160869
> debug: bayes: First pass?  Current: 1090317975, Last: 1090299462,
> atime: 736899888, count: 0, newdelta: 0, ratio: 0
> debug: bayes: something fishy, calculating atime (first pass)

This looks like SA is setting up for a Bayes expiry run.  If you're
running a sitewide Bayes, I'd STRONGLY suggest you add at least the
following to your local.cf:

bayes_learn_to_journal  1
bayes_auto_expire       0

If you set bayes_auto_expire like that, you'll have to set up a cron job
to run a manual expiry periodically.  I've found once a day keeps my
sitewide Bayes happy;  depending on your mail load you may need to make
that every 6 hours, possibly once an hour.  I've set up mine like so:  
(watch for line wrap)

02 5 * * * root /usr/bin/sa-learn -p /root/.spamassassin/user_prefs
--rebuild --force-expire

You may also want to specify how big the Bayes db can get:

bayes_expiry_max_db_size        1000000

1000000 tokens gives about a 40M db;  I just recently bumped it up a bit
to 1100000 and the db jumped to 80M.  YMMV.

How big are your bayes_* files on disk?  I would say personally that a
single-user set of Bayes files shouldn't be much more than 8-10M total; 
a medium-size site Bayes should be ~40M _toks + whatever _seen takes
up;  and a large sitewide Bayes may run up to ~100M.  I wouldn't go much
higher due to the IO/memory/filesystem cache load.

Overlarge bayes_* files have been known to cause problems, IIRC.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!