You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jeff Rice <py...@finity.org> on 2009/04/07 19:55:17 UTC

Bayes training strategy

Hi,
I'm wondering about the best training strategy for the bayes engine. 
Most bayes classifiers seem to recommend that spam/ham be fed in either 
alternating or random.  SA seems to suggest that all of one type be 
trained, and then all of the other type.  In my experience with other 
programs (CRM114, for example) this really hurts the accuracy.

What are your thoughts on this?  I've been randomizing my spam/ham when 
I train or retrain, but I don't have enough experience with SA to say if 
this is beneficial, useless, or detrimental.

Jeff


Re: Bayes training strategy

Posted by John Hardin <jh...@impsec.org>.
On Tue, 7 Apr 2009, Jeff Rice wrote:

> I'm wondering about the best training strategy for the bayes engine. 
> Most bayes classifiers seem to recommend that spam/ham be fed in either 
> alternating or random.  SA seems to suggest that all of one type be 
> trained, and then all of the other type.  In my experience with other 
> programs (CRM114, for example) this really hurts the accuracy.
>
> What are your thoughts on this?  I've been randomizing my spam/ham when 
> I train or retrain, but I don't have enough experience with SA to say if 
> this is beneficial, useless, or detrimental.

<knowitall>

I would say order of training is fairly meaningless as SA needs a minimum 
of 200 of each before it starts scoring.

Train your 200 or more of each to get bayes started, then train FPs and 
FNs as they happen.

Autolearning can be helpful in large userbases if you keep an eye on it - 
it can magnify errors over time if you're not careful, and it's probably a 
good idea to leave autolearn turned off initially during initial training 
and until you get a feel for how things are being scored.

About all we recommend is keeping the ratio of ham:spam fairly balanced or 
perhaps somewhat skewed towards learning more spam, as spam is (sadly) the 
vast majority of most peoples' raw mail stream.

If you're manually training a large corpus, then the ham/spam order will 
only matter during the time you've learned one and are working on learning 
the other. That time window should be fairly short, and you should have 
autolearn turned off while you're doing that. In fact, you might want to 
temporarily disable bayes if you're going to be manually training a large 
corpus.

</knowitall>

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Vista "security improvements" consist of attempting to shift blame
   onto the user when things go wrong.
-----------------------------------------------------------------------
  6 days until Thomas Jefferson's 266th Birthday