You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Rogers, Zoë A." <zo...@dns.co.uk> on 2004/07/20 12:31:12 UTC
Bayes database problem
Something is wrong with our bayesian database. The same spam email is getting through even though I am feeding them manually into the learner. Its matching on BAYES_50 only.
I tried running sa-learn --dump data | sort > bayes_dump.txt which should show every token in the database in ascending order of spam probability. After terminating the process, the text file contains nothing. When I run sa-learn --dump magic it shows over 1000000 tokens but it hangs on sa-learn --dump data. Does anyone know what could be wrong here? Is there a maximum database size?
When I run with the debug option on it hangs at Initialising learner.
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mdsmtp, P=[IPC], F=mDFMuXa%k5, S=EnvFromSMTP/HdrFromSMTP, R=EnvToSMTP, E=\r\n, L=990,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mrelay, P=[IPC], F=mDFMuXa8k, S=EnvFromSMTP/HdrFromSMTP, R=MasqSMTP, E=\r\n, L=2040,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: use_razor1 0
debug: bayes: 8301 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8301 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
debug: Score set 2 chosen.
debug: Initialising learner
sa-learn --force-expire -D
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: Mrelay, P=[IPC], F=mDFMuXa8k, S=EnvFromSMTP/HdrFromSMTP, R=MasqSMTP, E=\r\n, L=2040,
debug: Failed to parse line in SpamAssassin configuration, skipping: T=DNS/RFC822/SMTP,
debug: Failed to parse line in SpamAssassin configuration, skipping: A=TCP $h
debug: Failed to parse line in SpamAssassin configuration, skipping: use_razor1 0
debug: bayes: 8213 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8213 tie-ing to DB file R/O /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
debug: Score set 2 chosen.
debug: Initialising learner
debug: Initialising learner
debug: Syncing Bayes journal and expiring old tokens...
debug: lock: 8213 created /usr/local/share/spamassassin/run/bayes.lock.mss-mail-in-5.dnsmss.net.8213
debug: lock: 8213 trying to get lock on /usr/local/share/spamassassin/run/bayes with 0 retries
debug: lock: 8213 breaking stale /usr/local/share/spamassassin/run/bayes.lock: age=1090317256 now=1090317973
debug: lock: 8213 trying to get lock on /usr/local/share/spamassassin/run/bayes with 1 retries
debug: lock: 8213 link to /usr/local/share/spamassassin/run/bayes.lock: link ok
debug: bayes: 8213 tie-ing to DB file R/W /usr/local/share/spamassassin/run/bayes_toks
debug: bayes: 8213 tie-ing to DB file R/W /usr/local/share/spamassassin/run/bayes_seen
debug: bayes: found bayes db version 2
synced Bayes databases from journal in 0 seconds: 685 unique entries (887 total entries)
debug: bayes: expiry check keep size, 75% of max: 112500
debug: bayes: token count: 2273369, final goal reduction size: 2160869
debug: bayes: First pass? Current: 1090317975, Last: 1090299462, atime: 736899888, count: 0, newdelta: 0, ratio: 0
debug: bayes: something fishy, calculating atime (first pass)
Then it hangs for ages until I have to terminate.
Thanks,
Zoe
---------------------------------------------------
This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.
For further information contact email-integrity@dns.co.uk
Re: Bayes database problem
Posted by Michael Parker <pa...@pobox.com>.
On Tue, Jul 20, 2004 at 02:39:23PM -0600, Lucas Albers wrote:
>
> Theo Van Dinter said:
>
> > IMO, BTW, if you're going to do sitewide, and it's a large db file,
> > just put the bayes db in sql (3.0 feature). You'll probably use less
> > resources and get better performance.
>
> Is their any documentation on upgrading from 2.63 to 3.0 and switching to
> a mysql bayesian dbase as part of 3.0?
>
> I looked at the install documentation, and was looking for better.
sql/README.bayes
Michael
Re: Bayes database problem
Posted by Lucas Albers <ad...@cs.montana.edu>.
Theo Van Dinter said:
> IMO, BTW, if you're going to do sitewide, and it's a large db file,
> just put the bayes db in sql (3.0 feature). You'll probably use less
> resources and get better performance.
Is their any documentation on upgrading from 2.63 to 3.0 and switching to
a mysql bayesian dbase as part of 3.0?
I looked at the install documentation, and was looking for better.
--
Luke Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana
Re: Bayes database problem
Posted by Theo Van Dinter <fe...@kluge.net>.
On Tue, Jul 20, 2004 at 11:39:34AM -0400, Kris Deugau wrote:
> I think you've got bigger problems. This looks like SA is reading your
> sendmail.cf file for some reason.
Yeah, you should be using /etc/mail/spamassassin, not /etc/mail ...
> > debug: bayes: 8213 tie-ing to DB file R/O
> > /usr/local/share/spamassassin/run/bayes_toks
>
> > debug: bayes: 8213 tie-ing to DB file R/W
> > /usr/local/share/spamassassin/run/bayes_toks
>
> Er... This is odd. Supposedly it already did this...
It went R/O before to see if it could do a scan. To do a sync/expire
it needs to go R/W which requires the DB lock, etc.
> > debug: bayes: expiry check keep size, 75% of max: 112500
> > debug: bayes: token count: 2273369, final goal reduction size: 2160869
> > debug: bayes: First pass? Current: 1090317975, Last: 1090299462,
> > atime: 736899888, count: 0, newdelta: 0, ratio: 0
> > debug: bayes: something fishy, calculating atime (first pass)
This is going to take quite a while, btw. It wants to figure out how to
expire almost 2.2 million tokens from your ~2.3 million token db.
It looks like expiry hasn't been able to complete in a while. :(
> a medium-size site Bayes should be ~40M _toks + whatever _seen takes
> up; and a large sitewide Bayes may run up to ~100M. I wouldn't go much
> higher due to the IO/memory/filesystem cache load.
IMO, BTW, if you're going to do sitewide, and it's a large db file,
just put the bayes db in sql (3.0 feature). You'll probably use less
resources and get better performance.
--
Randomly Generated Tagline:
Love isn't hopeless. Look, maybe I'm no expert on the subject, but there
was one time I got it right.
-- Homer Simpson
Another Simpson's Clip Show
Re: Bayes database problem
Posted by Kris Deugau <kd...@vianet.ca>.
"Rogers, Zoë A." wrote:
> debug: Failed to parse line in SpamAssassin configuration, skipping:
> A=TCP $h
> debug: Failed to parse line in SpamAssassin configuration, skipping:
> Mdsmtp, P=[IPC], F=mDFMuXa%k5, S=EnvFromSMTP/HdrFromSMTP,
> R=EnvToSMTP, E=\r\n, L=990,
I think you've got bigger problems. This looks like SA is reading your
sendmail.cf file for some reason.
Check your /etc/mail/spamassassin/*.cf files carefully; then run
spamassassin --lint. Repeat until you see NO output.
> debug: bayes: 8213 tie-ing to DB file R/O
> /usr/local/share/spamassassin/run/bayes_toks
> debug: bayes: 8213 tie-ing to DB file R/O
> /usr/local/share/spamassassin/run/bayes_seen
> debug: bayes: found bayes db version 2
> debug: Score set 2 chosen.
This looks normal...
> debug: Initialising learner
> debug: Initialising learner
This is odd. I don't recall seeing anything like this myself.
> debug: Syncing Bayes journal and expiring old tokens...
Looks like SA is setting up for a Bayes expiry run. Depending on what
Bayes options you've set, this may take a while. :/
> debug: lock: 8213 created
> /usr/local/share/spamassassin/run/bayes.lock.mss-mail-in-5.dnsmss.net.8213
> debug: lock: 8213 trying to get lock on
> /usr/local/share/spamassassin/run/bayes with 0 retries
> debug: lock: 8213 breaking stale
> /usr/local/share/spamassassin/run/bayes.lock: age=1090317256
> now=1090317973
> debug: lock: 8213 trying to get lock on
> /usr/local/share/spamassassin/run/bayes with 1 retries
> debug: lock: 8213 link to
> /usr/local/share/spamassassin/run/bayes.lock: link ok
Probably due to a past attempt to expire tokens, which was interrupted.
> debug: bayes: 8213 tie-ing to DB file R/W
> /usr/local/share/spamassassin/run/bayes_toks
> debug: bayes: 8213 tie-ing to DB file R/W
> /usr/local/share/spamassassin/run/bayes_seen
> debug: bayes: found bayes db version 2
Er... This is odd. Supposedly it already did this...
> synced Bayes databases from journal in 0 seconds: 685 unique entries
> (887 total entries)
> debug: bayes: expiry check keep size, 75% of max: 112500
> debug: bayes: token count: 2273369, final goal reduction size: 2160869
> debug: bayes: First pass? Current: 1090317975, Last: 1090299462,
> atime: 736899888, count: 0, newdelta: 0, ratio: 0
> debug: bayes: something fishy, calculating atime (first pass)
This looks like SA is setting up for a Bayes expiry run. If you're
running a sitewide Bayes, I'd STRONGLY suggest you add at least the
following to your local.cf:
bayes_learn_to_journal 1
bayes_auto_expire 0
If you set bayes_auto_expire like that, you'll have to set up a cron job
to run a manual expiry periodically. I've found once a day keeps my
sitewide Bayes happy; depending on your mail load you may need to make
that every 6 hours, possibly once an hour. I've set up mine like so:
(watch for line wrap)
02 5 * * * root /usr/bin/sa-learn -p /root/.spamassassin/user_prefs
--rebuild --force-expire
You may also want to specify how big the Bayes db can get:
bayes_expiry_max_db_size 1000000
1000000 tokens gives about a 40M db; I just recently bumped it up a bit
to 1100000 and the db jumped to 80M. YMMV.
How big are your bayes_* files on disk? I would say personally that a
single-user set of Bayes files shouldn't be much more than 8-10M total;
a medium-size site Bayes should be ~40M _toks + whatever _seen takes
up; and a large sitewide Bayes may run up to ~100M. I wouldn't go much
higher due to the IO/memory/filesystem cache load.
Overlarge bayes_* files have been known to cause problems, IIRC.
-kgd
--
Get your mouse off of there! You don't know where that email has been!