You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bill Sickles <bi...@linuxseclabs.com> on 2006/01/25 21:23:40 UTC

hapaxes and chi2

Hi,
I have searched the users@spamassassin and devel@spamassassin archives but
didn't come up with a lot on this topic. Sorry if I missed something
obvious but I am wondering if anyone is using hapaxes. Through googling I
did see some references to a user or two turning this off as the database
got too large (and slow?) so I am looking for some current opinions. My
current database is 10M, learning to journal, bayes_journal_max_size
204800, bayes_expiry_max_db_size 300000. Currently I am not seeing any
performance issues. Before I turn on hapaxes I am wondering what I might
expect in terms of machine resource consumption (CPU/memory) and
successful spam hit rates as this feature claims to increase hit rates.
I realize that I will need more disk (8 to 10 times current size). Also
has anyone noticed an increase in FP's since this feature uses
words/tokens that only occur once.

Is anyone using chi-squared combining? The few references I did hit in my
searching seemed to have this turned on with hapaxes.

TIA,

Bill

Re: hapaxes and chi2

Posted by Bill Sickles <bi...@linuxseclabs.com>.

On Wed, 25 Jan 2006, Matt Kettler wrote:

> Bill Sickles wrote:
> > Hi,
> > I have searched the users@spamassassin and devel@spamassassin archives but
> > didn't come up with a lot on this topic. Sorry if I missed something
> > obvious but I am wondering if anyone is using hapaxes.
>
> Pretty much everyone does. They're on by default. You have to explicitly disable
> hapaxes to not use them.
>
>
> > Is anyone using chi-squared combining? The few references I did hit in my
> > searching seemed to have this turned on with hapaxes.
>
> Again.. nearly everyone does.. this is the default.
>
Doh.
    bayes_use_hapaxes (default: 1)
I could have sworn I saw a '0' instead of a '1'. Well that's my humbling
moment for the day :)

Thanks Matt



Re: hapaxes and chi2

Posted by Matt Kettler <mk...@evi-inc.com>.
Bill Sickles wrote:
> Hi,
> I have searched the users@spamassassin and devel@spamassassin archives but
> didn't come up with a lot on this topic. Sorry if I missed something
> obvious but I am wondering if anyone is using hapaxes. 

Pretty much everyone does. They're on by default. You have to explicitly disable
hapaxes to not use them.


> Is anyone using chi-squared combining? The few references I did hit in my
> searching seemed to have this turned on with hapaxes.

Again.. nearly everyone does.. this is the default.