You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jesper Wallin <je...@ifconfig.se> on 2011/11/21 23:31:37 UTC

A few questions regarding Bayesin in 3.4.0

Hi,

I recently upgraded to SA 3.4.0-rsvnunknown (using 
https://launchpad.net/~spamassassin/+archive/spamassassin-old on Ubuntu 
10.04 LTS) from SA 3.3.2 on different machine running ArchLinux. I use 
MySQL to store user preferences as well as Bayesin data. No AWL, no 
autolearning of the Bayesin filter and both machines run sa-update as 
daily cronjobs.

I migrated my MySQL database containing all settings along with my 
/etc/spamassassin directory with my static settings/rules to the new 
machine, ran sa-update, sa-compile and restarted spamd. I was curious to 
see if 3.4.0 scored a certain message differently than 3.3.2, so I ran 
"cat spam | spamc -u jesper@ifconfig.se -R" in order to see the result.

To my surprice, the bayesin filter only scored 60-80% (BAYES_60) where 
it previously scored 90-95% (BAYES_95) .. Has there been any major 
changes to the bayesin engine in 3.4? (and/or the SQL storage backend 
for it) .. I copied my spam/ham corpus to the new machine and ran 
sa-learn on top of the current database in order to see if that helped. 
Shockingly, it now scored 1-5% (BAYES_05) and I decided to start over.. 
Ran a "sa-learn --clear" in order to wipe out the old database and 
re-ran the sa-learn.. Now it scored perfectly 99-100% (BAYES_99)

I also noticed that my old database only had 11k tokens while the new 
one got about 60k (both the old and new server has hapaxes enabled and 
was trained using a corpus of about 600 spam and 200 ham)

Any thoughts or ideas what might have caused this?


Regards,
Jesper Wallin

Re: A few questions regarding Bayesin in 3.4.0

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2011-11-21 at 23:31 +0100, Jesper Wallin wrote:
> I recently upgraded to SA 3.4.0-rsvnunknown (using 
> https://launchpad.net/~spamassassin/+archive/spamassassin-old on Ubuntu 
> 10.04 LTS) from SA 3.3.2 on different machine running ArchLinux. I use 
> MySQL to store user preferences as well as Bayesin data. No AWL, no 
> autolearning of the Bayesin filter and both machines run sa-update as 
> daily cronjobs.
> 
> I migrated my MySQL database containing all settings along with my

Maybe bug 6624? A MySQL server bug, that results in terrible Bayes
performance. The MySQL version of Ubuntu Lucid seems to match the
affected versions.
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624

Fixed in trunk / 3.4. Since your issues was with 3.4 this is kind of
backwards, though the database migration might have triggered this.

I don't see any other relevant changes.

And no, the Bayes sub-system in SA has not been changed since 3.3.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: A few questions regarding Bayesin in 3.4.0

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2011-11-22 at 01:47 +0100, Jesper Wallin wrote:
> On 11/22/2011 12:35 AM, Karsten Bräckelmann wrote:

> > > I also noticed that my old database only had 11k tokens while the new
> > > one got about 60k (both the old and new server has hapaxes enabled and
> > > was trained using a corpus of about 600 spam and 200 ham)
> > 
> > Is that "old" database the original one from the previous system, or old
> > as in "before learning from scratch", but *after* migrating the db?
> >
> > I'd guess the latter. 11k tokens is terribly low, and as you just
> > noticed even less than learning a handful from scratch.
> 
> I meant the original database, created by SA 3.3.2.. It got about 11k 
> tokens. Also, it runs MySQL 5.5.17 (as that machine runs ArchLinux) and 
> I'm not sure about the last comment on the MySQL bug page, it doesn't 
> really say if it's fixed or not in 5.5.16.

Your Ubuntu system uses 5.1, though.

Anyway, I guess to ever find out if this might be the issue, Mark or
someone else needs to come up with some funky idea.

And regardless, 11k tokens is terribly low.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: A few questions regarding Bayesin in 3.4.0

Posted by Jesper Wallin <je...@ifconfig.se>.
Hi again and thanks for your quick reply..

On 11/22/2011 12:35 AM, Karsten Bräckelmann wrote:
> On Mon, 2011-11-21 at 23:31 +0100, Jesper Wallin wrote:
>> I also noticed that my old database only had 11k tokens while the new
>> one got about 60k (both the old and new server has hapaxes enabled and
>> was trained using a corpus of about 600 spam and 200 ham)
> Is that "old" database the original one from the previous system, or old
> as in "before learning from scratch", but *after* migrating the db?
>
> I'd guess the latter. 11k tokens is terribly low, and as you just
> noticed even less than learning a handful from scratch.
I meant the original database, created by SA 3.3.2.. It got about 11k 
tokens. Also, it runs MySQL 5.5.17 (as that machine runs ArchLinux) and 
I'm not sure about the last comment on the MySQL bug page, it doesn't 
really say if it's fixed or not in 5.5.16.
> Are you sure the database conversion went cleanly?
>
I used "mysqldump > db.sql" and "mysql < db.sql" to migrate my entire 
MySQL database. Maybe sa-learn would've been a more correct way? Though, 
if the Bayes-backend hasn't been touched, it shouldn't really matter?


Regards,
Jesper Wallin

Re: A few questions regarding Bayesin in 3.4.0

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2011-11-21 at 23:31 +0100, Jesper Wallin wrote:
> I also noticed that my old database only had 11k tokens while the new 
> one got about 60k (both the old and new server has hapaxes enabled and 
> was trained using a corpus of about 600 spam and 200 ham)

Is that "old" database the original one from the previous system, or old
as in "before learning from scratch", but *after* migrating the db?

I'd guess the latter. 11k tokens is terribly low, and as you just
noticed even less than learning a handful from scratch.

Are you sure the database conversion went cleanly?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}