You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2005/02/11 21:51:43 UTC

Re[2]: question about bayes and awl.

Hello Matías,

Friday, February 11, 2005, 10:29:13 AM, you wrote:

>> It sounds to me like you're training a central Bayes database, while
>> the users are using their individual databases.
>> 
>> Individual databases are better (non-spam to the business department
>> may look like spam to the bio-chemistry teachers and v.v.), but a
>> central bayes system would work also, but you need to use one method
>> or the other -- teaching a central database which isn't being used by
>> the users will do not good.

MLB> Let me see if I get this right. If I'm training the Bayesian filter I
MLB> can do two things:
MLB> 1) Train the filter with a central bayes db for all the SA users.
and have all users use that central database, or
MLB> 2) Train the users bayes db one by one, with the same info...
and have each user use their own database.
MLB> Something like a "for" with and "awk" output from /etc/password and the
MLB> ham/spam data will do the trick??

MLB> There is not a third option wen I have the bayes db for the user, and
MLB> also a central bayes db for all the users??

No, in current versions SA will use one and only one bayes database
for any given user.

MLB> Which one will accurate the best performance in the matters of spam
MLB> detection??

That depends on how diverse your users are.

If one person's spam is another person's non-spam, then individual
databases are best, and you should NOT train the databases with the
same emails. After all, if an email is both spam and non-spam for
different users, what will you train it as?

In that case, you should isolate your spam and non-spam by user, and
train those emails into their own individual Bayes databases.

However, if your users are fairly homogeneous, and in good agreement
about what is or isn't spam, then using one central database will a)
save lots of disk space, and b) allow user A to benefit from learning
user B's spam.

Bob Menschel