You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by ks...@ndr.de on 2005/04/27 12:50:04 UTC

Site-wide training with same message, different recipients and different classification

Hi,

we're evaluating SpamAssassin 3.02 on a mail gateway
on Linux.
The mailboxes are not on this gateway but on a Lotus Notes Server
to where the mail is forwarded.
Training is done via copying mails into a different Mailfolder,
which is emptied via POP3 using fetchmail from the SA gateway.
Headers are modified via procmail, then.

Overall we're happy with SA, but still some questions
arise from time to time.

So, the current question is:
What happens, if  two users receive the same mail,
but both are classifying the mail different (one as ham and
the other one as spam) and feed it back into SA to learn it ?

Remember that learning is only done as the user under
which's privilegies SA runs, so it's not user specific.

In what direction will the score for the next mail
from that sender be be pushed ? up or down ?
Spam or Ham ?

Thanks for any hints,
Jan-Uwe


Re: Site-wide training with same message, different recipients and different classification

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello ks,

Wednesday, April 27, 2005, 3:50:04 AM, you wrote:

> we're evaluating SpamAssassin 3.02 on a mail gateway on Linux. The
> mailboxes are not on this gateway but on a Lotus Notes Server to
> where the mail is forwarded. Training is done via copying mails into
> a different Mailfolder, which is emptied via POP3 using fetchmail
> from the SA gateway. Headers are modified via procmail, then.    

> Overall we're happy with SA, but still some questions arise from
> time to time. 

> So, the current question is:
> What happens, if  two users receive the same mail, but both are
> classifying the mail different (one as ham and the other one as
> spam) and feed it back into SA to learn it ?

> Remember that learning is only done as the user under which's
> privilegies SA runs, so it's not user specific.

So if I understand you correctly, there's a generic userid which is
used for both scoring and for learning, which has nothing to do with
the users who receive that email.

> In what direction will the score for the next mail from that sender
> be be pushed ? up or down ? Spam or Ham ?

Is the mail identical down to the message id?

If so, since to the best of my knowledge Bayes tracks messages by
message id, then the last learning "wins".

If both users put their ham or spam into the learning queue at about
the same time, and the system just happens to learn the spam queue
first, the message will be learned as spam, and when the system then
learns the ham queue, the message will be unlearned as spam and
learned as ham.

However, the impact will probably be small -- Bayes is statistical,
and while the From header has some weight, it's only one token, of
which several/many are used to determine whether an email is ham or
spam.

An almost identical message, same From, same path, same/similar
subject, same/similar To, same/similar body content, would tend to be
pushed in that direction (ham in my example), but a reasonably
different message, same From, same path, same/similar To, mostly
different subject, significantly different body, might go in either
direction.

I track spam/ham on a system with hundreds of domains and all of their
users, with one central Bayes database, and I've not seen any problems
caused by this type of sequence learning.

Bob Menschel




Re: Site-wide training with same message, different recipients and different classification

Posted by Kelson <ke...@speed.net>.
ks.service.int2@ndr.de wrote:
> In what direction will the score for the next mail
> from that sender be be pushed ? up or down ?
> Spam or Ham ?

It depends on which gets run first.  Each time sa-learn runs on the same 
message, it overrrides the previous training.  (That way you don't have 
to run sa-learn --forget every time something needs to be reclassified.)

So if user A trains it as spam, and use B trains it as ham, and user A 
goes first, user B will override it and it will be treated as ham.

Now that's if one message has two recipients.  If, on the other hand, 
the two users get separate messages with the same content (but different 
message IDs), SA will learn from both and the two sets of data will 
balance each other out.

-- 
Kelson Vibber
SpeedGate Communications <www.speed.net>