You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by OliverScott <ol...@fhsinternet.com> on 2007/08/27 16:09:17 UTC

Some thoughts on Baysian Setup...

Site Wide Bayes or Per User Bayes?

This is somthing I have been thinking about and thought I would share to see
what other people think...

Site wide bayes has one database. Per User bayes has one per user or domain
(depending on how your server is configured). For example if you have 40
users with a 10Mb bayes database each then you either have to read and write
these to and from disk when an email comes in, or load all 400Mb of data
into memory.

1. Most users don't know how, arn't allowed, or can't be bothered to train
Bayes. In most cases spamassassin is left to auto-train bayes.

2. Most people would consider the same emails to be SPAM. 90% of what I
think is spam would also be what you think is spam, with only a small
percentage of emails that we disagree on.

3. The emails which we would disagree on would probably be newsletters and
advertising emails from legitimate companies. Unwanted newsletters and
advertising emails which people have deliberately (possibiliy due to
stupidity) signed up to should not be trained as SPAM, but should be
manually blacklisted if necessary.

4. Site wide bayes saves disk space and more importantly it saves
significantly on disk IO or memory requirements.

5. A larger database leads to more accurate baysian identification - I am
guessing this is right?

Do you agree or disagree with the five above statements?

Based on the five above statements I would suggest that:
Site wide bayes is as good as if not slightly better (due to a potentially
larger single database) than per user bayes when it comes to identifying
SPAM emails.

1. What I think of as HAM emails could be widely different from what you
think of as HAM emails - if I were to sort your inbox by hand (without
knowing you personally) I would probably delete some good emails by mistake
while getting rid of the spam.

2. If a server has one customer who is a plumber and one who is an artist,
site wide bayes would learn that emails containing the words pipes or canvas
are good. The plumber will get emails with the word canvas in them tagged as
bayes_00 and vice versa.

3. If per user bayes is chosen then bayes_00 will only fire on emails
containing words which have occurred in emails which YOU have received in
the past and which scored low enough to be autolearned. 

4. If a HAM email is misclasified as SPAM then users are more likely to
report this to their admin or to train the filter themselves, than for SPAM
emails which are not tagged. People will ignore a few spam slipping through
but not false positives!

Do you agree or disagree with the four above statements?

Based on the four above statements I would suggest that:
Per User bayes is better than Site Wide bayes when it comes to correctly
identifying HAM emails.


If my various assumptions are correct then perhapse there should be a third
type of bayes to choose from in spamassassin? Namely one where:
SPAM tokens are stored on a server wide basis - can be a LARGE database if
this helps
HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file
per user.

Any comments?

PS. I am not up to coding anything like this myself so don't bother
suggesting that I try it and report back!
-- 
View this message in context: http://www.nabble.com/Some-thoughts-on-Baysian-Setup...-tf4335489.html#a12347630
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Some thoughts on Baysian Setup...

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> On Mon, 27 Aug 2007, OliverScott wrote:
> >1. Most users don't know how, arn't allowed, or can't be bothered to train
> >Bayes. In most cases spamassassin is left to auto-train bayes.

On 27.08.07 09:46, Chris St. Pierre wrote:
> Disagree.  With proper training -- or if you make it trivially easy,
> like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes.

Yes, but according to what YOU mention in point 2, this may be
contraproductive...

> >2. Most people would consider the same emails to be SPAM. 90% of what I
> >think is spam would also be what you think is spam, with only a small
> >percentage of emails that we disagree on.

> Strongly disagree.  Many users consider anything they don't want to be
> spam, including all sorts of soliticed email.

The fact that users don't differ between mail they subscribed to, may speak
against personalized BAYES database. Otherwise some users will taint their
database and it will become less and less effective. Of course, their
reporting should go to personal bayes, not the shared one. If they have to
teach the bayes database, they should teaht their own.

However users should be well-informed that "report as spam" may be
problematic in such ways.

> >4. Site wide bayes saves disk space and more importantly it saves
> >significantly on disk IO or memory requirements.
> 
> Not sure on this one.  None of the performance statistics I gather saw
> any noticeable hit when I switched from sitewide to per-user.

shared database will take less disk space (and less memory when loaded) and
will probably be most of the time in memory, so it won't get loaded very
often. However I don't think this will help much in efficiency...
 
> >5. A larger database leads to more accurate baysian identification - I am
> >guessing this is right?
> 
> "It depends." :)  With Bayes poisoning all the rage, it sometimes
> helps to avoid a really huge database.

someone mentioned here that the bayes poisoning is a myth... I'm not sure
how much truth is in that, but my BAYES filter works well for some time...

> So what's important is having a well-tuned database -- not necessarily
> a large database.

a large well-tuned database is much better than small fine-tuned database.
For much users it has to be larger, because much users get much of different
e-mail.

> If Joe and Jane User get different kinds of mail, disagree on what spam
> is, etc., then they should have different databases.  (What if Joe
> receives a legitimate newsletter on stock tips, for instance?)

how can Jane get legitimate newsletter on stock tips when she didn't ask for
them? How can they be legitimate if she does not want them?
(provided she did what she could for not receiving them)

> With a diverse user base, any sort of one-size-fits-all filtering is
> bound to increase FPs and FNs.

Yes, however the default scores for BAYES filters are not that big so shared
database won't change score that much :)

Also, note that one simple word will never change BAYES score that much, so
I would not be that afraid that one word "viagra" would change much in final
score.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Due to unexpected conditions Windows 2000 will be released
in first quarter of year 1901

Re: Some thoughts on Baysian Setup...

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.
On Mon, 27 Aug 2007, OliverScott wrote:

> 1. Most users don't know how, arn't allowed, or can't be bothered to train
> Bayes. In most cases spamassassin is left to auto-train bayes.

Disagree.  With proper training -- or if you make it trivially easy,
like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes.

> 2. Most people would consider the same emails to be SPAM. 90% of what I
> think is spam would also be what you think is spam, with only a small
> percentage of emails that we disagree on.

Strongly disagree.  Many users consider anything they don't want to be
spam, including all sorts of soliticed email.  I had one user who,
rather than turn off email notifications from Facebook, reported them
as spam until they started getting blocked.  Since we've implemented a
system where reporting a message as spam automatically blacklists the
sender for the reporting user, I've had a number of reports of
students blacklisting their professors because they didn't want some
notification they got sent.

Perhaps you and I might agree on what spam is, but Joe User does _not_.

> 3. The emails which we would disagree on would probably be newsletters and
> advertising emails from legitimate companies. Unwanted newsletters and
> advertising emails which people have deliberately (possibiliy due to
> stupidity) signed up to should not be trained as SPAM, but should be
> manually blacklisted if necessary.

Again, you and I would probably find this situation, but you and Joe
User (or I and Joe User) would not.

> 4. Site wide bayes saves disk space and more importantly it saves
> significantly on disk IO or memory requirements.

Not sure on this one.  None of the performance statistics I gather saw
any noticeable hit when I switched from sitewide to per-user.

> 5. A larger database leads to more accurate baysian identification - I am
> guessing this is right?

"It depends." :)  With Bayes poisoning all the rage, it sometimes
helps to avoid a really huge database.  A few months ago, we started
over and, for the first week or two, spam went up, but then it dropped
to below previous levels; cleaning out the crap can help from
time-to-time.

So what's important is having a well-tuned database -- not necessarily
a large database.  If Joe and Jane User get different kinds of mail,
disagree on what spam is, etc., then they should have different
databases.  (What if Joe receives a legitimate newsletter on stock
tips, for instance?)

> 1. What I think of as HAM emails could be widely different from what you
> think of as HAM emails - if I were to sort your inbox by hand (without
> knowing you personally) I would probably delete some good emails by mistake
> while getting rid of the spam.

I again disagree.  We retain all of the messages that users report as
FPs and FNs, and, in general, the FPs are more obvious and certainly
easier to agree on.  I would never use the FNs as a spam corpus, for
aforementioned reasons, but I think the FPs would be pretty reliable.

> 2. If a server has one customer who is a plumber and one who is an artist,
> site wide bayes would learn that emails containing the words pipes or canvas
> are good. The plumber will get emails with the word canvas in them tagged as
> bayes_00 and vice versa.

Agree, mostly.

If you have one customer who is a day trader and one who works with
Pfizer Canada, then they'll constantly be fighting each other because
the former doesn't want spam about Viagra from our neighbors to the
north and the latter doesn't want spam about the latest stock that's
about to blow up.  (This is obviously a contrived example, but you get
the idea.)

With a diverse user base, any sort of one-size-fits-all filtering is
bound to increase FPs and FNs.

> 3. If per user bayes is chosen then bayes_00 will only fire on emails
> containing words which have occurred in emails which YOU have received in
> the past and which scored low enough to be autolearned.

..or were expressly learned by the user.  Agree.

> 4. If a HAM email is misclasified as SPAM then users are more likely to
> report this to their admin or to train the filter themselves, than for SPAM
> emails which are not tagged. People will ignore a few spam slipping through
> but not false positives!

For some value of "few," I agree.

> SPAM tokens are stored on a server wide basis - can be a LARGE database if
> this helps
> HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file
> per user.

I think users would be just as adept at poisoning such a split
database as they would be at poisoning a unified, site-wide database.
In any reasonably diverse user base, what my fellow user thinks is
spam should not affect what I get in my mailbox.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University