You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/01/27 00:05:45 UTC

Re: hapaxes and chi2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Bill Sickles writes:
> Hi,
> I have searched the users@spamassassin and devel@spamassassin archives but
> didn't come up with a lot on this topic. Sorry if I missed something
> obvious but I am wondering if anyone is using hapaxes. Through googling I
> did see some references to a user or two turning this off as the database
> got too large (and slow?) so I am looking for some current opinions. My
> current database is 10M, learning to journal, bayes_journal_max_size
> 204800, bayes_expiry_max_db_size 300000. Currently I am not seeing any
> performance issues. Before I turn on hapaxes I am wondering what I might
> expect in terms of machine resource consumption (CPU/memory) and
> successful spam hit rates as this feature claims to increase hit rates.
> I realize that I will need more disk (8 to 10 times current size). Also
> has anyone noticed an increase in FP's since this feature uses
> words/tokens that only occur once.
> 
> Is anyone using chi-squared combining? The few references I did hit in my
> searching seemed to have this turned on with hapaxes.

Everyone is using chi-squared combining, and hapaxes.  They both improve
matters quite a lot -- especially hapaxes, and they've been default
settings since the initial release of SpamAssassin 2.50.

I'm not sure it's even possible to turn them off anymore without
hacking the source ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFD2VXJMJF5cimLx9ARAmQuAKCshWoZPObDhaRC0EfUuMjNlHpJigCaAgdR
fkAqYRFKFupXYSSfdVswYXM=
=wncR
-----END PGP SIGNATURE-----