You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Stand H <hs...@yahoo.com> on 2006/07/27 06:27:09 UTC

Re: Non-english mail and Bayes


--- Kelson <ke...@speed.net> wrote:

> Stand H wrote:
>  > I'm not sure if I can feed non-english email to
>  > sa-learn.
> 
> Bowie Bailey wrote:
> > Let it learn as much
> > ham and spam as you can manage and don't worry
> about languages.
> 
> One thing to look out for:  Try to get both ham and
> spam for each 
> language.  The last thing you want is for Bayes to
> decide that common, 
> let's say, German words are signs of spam because
> the only German text 
> it's ever seen is spam.
> 
> As Bowie points out, Bayes doesn't care about the
> languages themselves 
> -- it's the tokens (for practical purposes, the
> words).  It doesn't care 
> whether "Necesito ir a casa a las dos y media." is
> Spanish, it only 
> cares whether it's seen the words "Necesito", "ir",
> "casa", etc. more 
> often in ham or in spam.
> 
> -- 
> Kelson Vibber
> SpeedGate Communications <www.speed.net>
> 
Hi Kelson and Bowie,

Thank you for your reply.

In the situation that the sender client app doesn't
encode the message properly, should I train it?

Some user receive messages with the subject like
¿‚Ü‚©‚₽‚ç‚Ü‚â‚Í and it's considered illegal and got
hit by SUBJ_ILLEGAL_CHARS. When subject is encoded
properly it is like
?iso-2022-jp?B?GyRCJEokKyQ/JDckYyRpGyhC?= 

And the body is encoded as =82=BF=82=DC=82=A9

So in these cases, does it make sense to train the
message. I'm curious how bayes work effectively with
these illegal char and encoded char.

Another thing, say my friend forwards an email to
me(he just wants to let me know the info in the
message) and i want to train his email as ham. Should
I just train it or remove the some headers first?

Thank you.
Stand

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com