You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Axel Werner <ma...@awerner.homeip.net> on 2007/12/05 17:00:41 UTC

the opposit of "ok_locales" ??

im looking for some "opposit" parameters of "ok_locales" to
make spamassassin mark all incoming mail of some specific charsets or 
language settings (locales) to get marked by default.

for example: since i life in western europe i never expect mails from 
eastern europe, asia, afrika or something like that. especialy if they 
use their locale charset n stuff.

so im looking for some parameter doing a

"if locate is not western-europe or western mark mail as spam"

is there something i did not found in all the manuals and google searches??!


greets
Axel

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
jidanni@jidanni.org wrote:
> M> or does he think
> All we know is users don't think like we do. http://www.useit.com/alertbox/
>   
Fundamentally, SpamAssassin is a tool written by system administrators,
for system administrators and advanced users.

Like it or not, the project's primary goal has always been to be a good
spamfilter. Clear documentation is nice, but it's always been a
secondary priority to making it work well. Given that the project has
limited resources, there's often a choice to be made between writing
features that improve SA's accuracy, and trying to make the
documentation easy to read.

You've pointed out some good flaws in the docs, and some things which
IMHO are awfully pedantic given the general poor state of the rest of
the documentation.
> M> how will you benefit from contact with this broader spectrum if
> M> they're emailing you in a character set you can't read?
>
> * Sternstone recalls: I was only 20 years old and had my name in Tamil
> in my .signature or something. Well, it turns out Dr. Futzweiler, may
> he rest in peace, had been plagued by Russian spam, and was using
> "Spam Assassin", which had a bug or something that clobbers more than
> just Russian. Anyway, he never got the mail and I ended up joining the
> Malawi space program, and the rest is history.

You're making a good argument not to use the ok_locales or ok_languages
feature at all. Personally I agree with that. But if you're gonna filter
this way, IMO, you may as well filter off everything you can't read. But
of course, there are some cases where that might be difficult, which is
why I've created:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5743

side note: odds are the above problem was ok_languages, not ok_locales.
ok_languages isn't very precise, but ok_locales is. This is because it's
hard to accurately guess language based on letter groupings like
ok_languages does. It's really easy to accurately figure out that the
Cyrillic character set is Russian.

Also, bear in mind that there's not many locales in SA (unlike
languages). There's only 6 of them. en, ja, ko, ru, th, zh. That's it.
There's no fr, se, de, or whatever else. So, at least in the case of
ok_locales, which this thread is about, it's pretty easy to list
all-but-one.





Re: the opposit of "ok_locales" ??

Posted by ji...@jidanni.org.
M> or does he think
All we know is users don't think like we do. http://www.useit.com/alertbox/

M> how will you benefit from contact with this broader spectrum if
M> they're emailing you in a character set you can't read?

* Sternstone recalls: I was only 20 years old and had my name in Tamil
in my .signature or something. Well, it turns out Dr. Futzweiler, may
he rest in peace, had been plagued by Russian spam, and was using
"Spam Assassin", which had a bug or something that clobbers more than
just Russian. Anyway, he never got the mail and I ended up joining the
Malawi space program, and the rest is history.

* "Just don't send it in Russian" was the last thing he ever said to me.
I didn't. I don't know what my Taiwanese ISP was appending, but it
wasn't Russian. It was many years later when I read he was now living
in Greenland with the Duchess of Nabisco. I would have been jealous
had I not married Google.

* If AustinPowers is the right address, why does it keep replying
"Scorin' too high, baby"? I suppose he only likes Russian girls, like
in the movie.

* I tried to send the IRAN files to Director Snortscough, but
apparently his mailbox was being jammed by the RUSSIANS or something.
He went ahead and pushed the red button, and the rest is BOOM

Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2007-12-06 at 22:42 -0500, Matt Kettler wrote:
> jidanni@jidanni.org wrote:

> And those who really want this effect can just list every locale except
> the one they dislike, if that's really what they want.

Given the really small number of locales (character sets), this isn't
unreasonable to expect from the user or admin, IMHO.


> > Anyway, currently it's not even like one could just use "--" to obtain
> > "+". And even if it was, our basic user is still looking for his
> > blacklist_locales.
> 
> Is he really? Or does he think ok_locales = whitelist_locales?

It seems even jidanni is confusing these...

ok_locales is *not* a whitelist. The fundamental difference is, that
whitelists result in a negative score, since it is a strong sign of
being not spammy. There is no whitelist part. Thus, there is no
blacklist counterpart. It just doesn't come in pairs. ok_locales is a
rather neutral setting.

With something like a charset, it just is not a strong sign for a ham.
Unless someone positively can confirm, he never received a spam using a
western charset...


While I do see that not_ok_locales *might* serve as a shortcut, the
corresponding ok_locales line usually won't be any longer. However, it
might come slightly more natural to the user, who failed to read the
documentation and examples (sic) carefully...

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sat, 2007-12-08 at 03:40 +0100, Karsten Bräckelmann wrote:
> It's a rather twisted logic. You don't define what's good or bad (that
> again would be a black/whitelist), you leave out what's bad...

Hmm, maybe not so twisted after all. ok_locales equals "these are the
charset classes I probably can read" from a users point of view. Could
it be more positive? [1]

  guenther


[1] as in UI design, and avoiding boolean options with negations

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
Karsten Bräckelmann wrote:
>
>>> Maybe the devs can briefly explain how the charset is being determined.
>>> Or at least, where exactly in the code one could find it...
>>>       
>
> Matt, also, I got a feeling, that logic is what the OP is actually
> about. He does not want to leave out what he wants to be scored on. But
> (positively) define it.
>   

That much is easy. It's done by looking at various character-set tags or
encoding marks in the message.  These explicitly specify which character
set to use when interpreting the text.

Re-quoting myself from 11/26 (and elaborating with more examples):

CHARSET_FARAWAY:
Underlying eval function: check_for_faraway_charset() in MIMEEval.pm
Detects based on: character set in the mime Content-Type: of the message
header.

Example (in a message header): 
Content-Type: text/plain;
	charset="iso-2022-jp"

which specifies Japanese text for a single-part message.

MIME_CHARSET_FARAWAY
Underlying eval function: check_for_mime('mime_faraway_charset') in
MIMEEval.pm
Detects based on: character set in the mime Content-Type: of the message
attachments

Example (in a mime-section header): 
Content-Type: text/plain;
	charset="iso-2022-jp"

which specifies Japanese text for this part of a multi-part message.



HTML_CHARSET_FARAWAY
Underlying eval function: html_charset_faraway() in HTMLEval.pm
Detects based on: character set in the Content-Type: of a meta
http-equiv tag embedded in HTML.

Example:
<META http-equiv=Content-Type content="text/html; charset=iso-2022-jp">

which specifies Japanese text for this html document.


CHARSET_FARAWAY_HEADER
check_for_faraway_charset_in_headers()
Detects based on: Embedded charachter encoding marks in the Subject and
From: headers. You'd have to look at the raw message source to see it,
but it's generally things like this somewhere in the header:

=?GB2312?

Which indicates encoded simplified Chinese text follows.



Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sat, 2007-12-08 at 02:05 +0100, Stefan Jakobs wrote:
> On Saturday 08 December 2007 01:15, Karsten Bräckelmann wrote:

> > > Ok. My fault I mistook charsets with country codes. But replace se with
> > > ru or ch or greek7. The result is the same. You want one charset to be
> > > considered as "not ham" and you have to give the whole list to the
> > > parameter. And I think it is a long and ugly to read list (see:
> > > http://www.iana.org/assignments/character-sets)
> >
> > Yes, that list indeed is ugly. However, that is *not* what we are
> > talking about. The list of valid locales for ok_locales can be found in
> > the docs -- and totals 6, including en...
> 
> Only 6? Yes, I found it in the docs. (Yeah, I know: RTFM before you ask 
> around). I appologize, with only 6 charsets it is not useful to have a 
> not_ok_locales option.

You just looked at the wrong docs... ;)

Basically, the coarse distinction ok_locales boils down to from a users
point of view is "can I decipher that?". As in, I don't speak Chinese,
and I got a hard time telling apart Chinese from Japanese. I don't speak
Swedish either, but I do recognize the symbols. And with some luck, I'll
even understand a couple words... [1]


> > > I only want to say that there can be a situation in which you only know
> > > that you don't want to consider the XXX charset as an indicator for ham.
> >
> > Despite its name, ok_locales is *not* about certain charsets being "an
> > indicator for ham". The opposite is true. It does not assign a negative
> > score. All it does is assigning a positive score for charsets "not in
> > the ok list".
> 
> Maybe I should have said: "an indicator for NOT spam" ? Sh.., there are too 
> many double negations and I'm too tired for that.

not spam == ham

Do you actually mean "not an indicator for ham/spam/anything"? Cause
that's what ok_locales is -- whatever is in that list is being treated
neutral, neither taken as an indicator for ham nor spam. Anything that
is *not* in that list, however, is an indicator for spam.

It's a rather twisted logic. You don't define what's good or bad (that
again would be a black/whitelist), you leave out what's bad...


> > Maybe the devs can briefly explain how the charset is being determined.
> > Or at least, where exactly in the code one could find it...

Matt, also, I got a feeling, that logic is what the OP is actually
about. He does not want to leave out what he wants to be scored on. But
(positively) define it.

  guenther


[1] As someone who has dealt with user filed bug reports in bugzilla
    extensively, I know, there is a chance to grok the general topic
    even if you don't know the language.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Stefan Jakobs <st...@rus.uni-stuttgart.de>.
On Saturday 08 December 2007 01:15, Karsten Bräckelmann wrote:
<snip>

> > Ok. My fault I mistook charsets with country codes. But replace se with
> > ru or ch or greek7. The result is the same. You want one charset to be
> > considered as "not ham" and you have to give the whole list to the
> > parameter. And I think it is a long and ugly to read list (see:
> > http://www.iana.org/assignments/character-sets)
>
> Yes, that list indeed is ugly. However, that is *not* what we are
> talking about. The list of valid locales for ok_locales can be found in
> the docs -- and totals 6, including en...

Only 6? Yes, I found it in the docs. (Yeah, I know: RTFM before you ask 
around). I appologize, with only 6 charsets it is not useful to have a 
not_ok_locales option.

> > I only want to say that there can be a situation in which you only know
> > that you don't want to consider the XXX charset as an indicator for ham.
>
> Despite its name, ok_locales is *not* about certain charsets being "an
> indicator for ham". The opposite is true. It does not assign a negative
> score. All it does is assigning a positive score for charsets "not in
> the ok list".

Maybe I should have said: "an indicator for NOT spam" ? Sh.., there are too 
many double negations and I'm too tired for that.

> > > Anyway, this whole example is non-realistic as is. As Matt pointed out
> > > in a later post, we are talking character sets here, not languages. In
> > > the world of ok_locales, there is no distinction between en and se,
> > > which is just en to ok_locales...
> >
> > As I say I got confused with it (and be it maybe still).
> >
> > Other question: How does Spamassassin know which charset it should use.
> > Provides it a list of all charsets and compares or does it try it to find
> > the information in the header of the mail or ...?
>
> Unfortunately, I don't know either. Although I'd like to...
>
> As per my counter example above, I do not want CHARSET_FARAWAY and
> friends to score on mail, just because a fellow hacker happens to have
> his original name in his sig or From: header. And it probably doesn't
> come as a surprise, that the example actually is real life. ;)
>
>
> Maybe the devs can briefly explain how the charset is being determined.
> Or at least, where exactly in the code one could find it...
>
>   guenther  - who is too lazy to dig through all the code right now :)

Bye
Stefan

Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2007-12-07 at 23:36 +0100, Stefan Jakobs wrote:
> On Friday 07 December 2007 20:42, Karsten Bräckelmann wrote:

> > Let's assume, one of them happens to be Swedish. And even though the
> > entire communication is English, that ignorant bastard dares to have his
> > real name at the bottom of his mail -- which includes Swedish chars.
> >
> > Do you hear that flushing sound of catching spam?
> 
> Do you mean: If I have one false positive I should throw my spam filter in a 
> trash can?

No. And I am not talking a single FP either.

My point is, that above approach is prone to hit hard on a lot of
totally legitimate mail. There is a *huge* difference between Cyrillic
or even Chinese or Japanese symbols -- and sub types of latin.


> > Swedish chars are a superset of English chars. As are German and many
> > others. To see that this is not an artificial, made up example please
> > have a look at my real name. :)
> 
> Ok. My fault I mistook charsets with country codes. But replace se with ru or 
> ch or greek7. The result is the same. You want one charset to be considered 
> as "not ham" and you have to give the whole list to the parameter. And I 
> think it is a long and ugly to read list (see: 
> http://www.iana.org/assignments/character-sets)

Yes, that list indeed is ugly. However, that is *not* what we are
talking about. The list of valid locales for ok_locales can be found in
the docs -- and totals 6, including en...


> I only want to say that there can be a situation in which you only know that 
> you don't want to consider the XXX charset as an indicator for ham.

Despite its name, ok_locales is *not* about certain charsets being "an
indicator for ham". The opposite is true. It does not assign a negative
score. All it does is assigning a positive score for charsets "not in
the ok list".


> > Anyway, this whole example is non-realistic as is. As Matt pointed out
> > in a later post, we are talking character sets here, not languages. In
> > the world of ok_locales, there is no distinction between en and se,
> > which is just en to ok_locales...
> 
> As I say I got confused with it (and be it maybe still).

> Other question: How does Spamassassin know which charset it should use. 
> Provides it a list of all charsets and compares or does it try it to find the 
> information in the header of the mail or ...?

Unfortunately, I don't know either. Although I'd like to...

As per my counter example above, I do not want CHARSET_FARAWAY and
friends to score on mail, just because a fellow hacker happens to have
his original name in his sig or From: header. And it probably doesn't
come as a surprise, that the example actually is real life. ;)


Maybe the devs can briefly explain how the charset is being determined.
Or at least, where exactly in the code one could find it...

  guenther  - who is too lazy to dig through all the code right now :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Stefan Jakobs <st...@rus.uni-stuttgart.de>.
On Friday 07 December 2007 20:42, Karsten Bräckelmann wrote:
> On Fri, 2007-12-07 at 08:38 -0500, Matt Kettler wrote:
> > Stefan Jakobs wrote:
> > > Let's assume you running a mailrelay for a university and your users
> > > are from different countries. Lets assume further on you have no
> > > Swedish people at your university (and you get a lot of spam from
> > > Sweden). Then it would be nice to have a not_ok_locales option, because
> > > you see immediately which locale character set is considered as
> > > possible spam.
>
> Now let's further assume, your students are able to speak English. And
> they are collaborating with an Open Source project, discussing with a
> lot of people from all over the world.
>
> Let's assume, one of them happens to be Swedish. And even though the
> entire communication is English, that ignorant bastard dares to have his
> real name at the bottom of his mail -- which includes Swedish chars.
>
> Do you hear that flushing sound of catching spam?

Do you mean: If I have one false positive I should throw my spam filter in a 
trash can? Of course, can it happen that a mail is catched by rules which 
were not made for it. Especially at Universities were you have a great range 
of different types of mails. 

> Swedish chars are a superset of English chars. As are German and many
> others. To see that this is not an artificial, made up example please
> have a look at my real name. :)

Ok. My fault I mistook charsets with country codes. But replace se with ru or 
ch or greek7. The result is the same. You want one charset to be considered 
as "not ham" and you have to give the whole list to the parameter. And I 
think it is a long and ugly to read list (see: 
http://www.iana.org/assignments/character-sets)

I only want to say that there can be a situation in which you only know that 
you don't want to consider the XXX charset as an indicator for ham.

> > Now that sounds like a valid reason to me.
>
> It doesn't to me...
>
> Anyway, this whole example is non-realistic as is. As Matt pointed out
> in a later post, we are talking character sets here, not languages. In
> the world of ok_locales, there is no distinction between en and se,
> which is just en to ok_locales...

As I say I got confused with it (and be it maybe still).
>
>   guenther

Other question: How does Spamassassin know which charset it should use. 
Provides it a list of all charsets and compares or does it try it to find the 
information in the header of the mail or ...?

Greetings
Stefan

Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2007-12-07 at 08:38 -0500, Matt Kettler wrote:
> Stefan Jakobs wrote:
> > Let's assume you running a mailrelay for a university and your users are from 
> > different countries. Lets assume further on you have no Swedish people at 
> > your university (and you get a lot of spam from Sweden). Then it would be 
> > nice to have a not_ok_locales option, because you see immediately which 
> > locale character set is considered as possible spam.

Now let's further assume, your students are able to speak English. And
they are collaborating with an Open Source project, discussing with a
lot of people from all over the world.

Let's assume, one of them happens to be Swedish. And even though the
entire communication is English, that ignorant bastard dares to have his
real name at the bottom of his mail -- which includes Swedish chars.

Do you hear that flushing sound of catching spam?


Swedish chars are a superset of English chars. As are German and many
others. To see that this is not an artificial, made up example please
have a look at my real name. :)


> Now that sounds like a valid reason to me.

It doesn't to me...


Anyway, this whole example is non-realistic as is. As Matt pointed out
in a later post, we are talking character sets here, not languages. In
the world of ok_locales, there is no distinction between en and se,
which is just en to ok_locales...

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
Matt Kettler wrote:
> Stefan Jakobs wrote:
>   
>> Let's assume you running a mailrelay for a university and your users are from 
>> different countries. Lets assume further on you have no Swedish people at 
>> your university (and you get a lot of spam from Sweden). Then it would be 
>> nice to have a not_ok_locales option, because you see immediately which 
>> locale character set is considered as possible spam.
>>
>> If you have a list of: af ax al dz as ad ao ai aq ag ar am aw ac au at az bs 
>> bh bb by be bz bm bt bo ba ... ve vn vg vi wf eh ye yu zm zw
>> Do you see, that Sweden is the only country which is missing?  I know it 
>> maybe, but what happens when I quit my job. And somebody else should find the 
>> mistake, why some mails from Sweden are considered as spam. This can be trap.
>>
>> I know this is a case with a lot of "if", but I mean it is better to have good 
>> readable configuration than to prevent a second parameter which does nearly 
>> the same as the first one.
>>
>>   
>>     
> Now that sounds like a valid reason to me. The only problem is if you
> use not_ok_locales, then you should not use ok_locales.. This might get
> confusing to someone who thinks they're white/blacklists.
>
> It would be a harmless confusion, but if you specified:
>
> not_ok_locales se
> ok_locales en
>
> The ok_locales would do nothing at all.  We'll have to document that
> *very* carefully.
>   

FYI, an enhancement request has been created for this:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5743


>
>
>
>
>
>   


Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2007-12-07 at 09:23 -0500, Matt Kettler wrote:
> Also, keep in mind that it's perfectly valid to have multiple ok_locales
> statements so:

No. :)

According to the documentation, "if there are multiple ok_locales lines,
only the last one is used."

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
Daniel J McDonald wrote:
>
>> It would be a harmless confusion, but if you specified:
>>
>> not_ok_locales se
>> ok_locales en
>>
>> The ok_locales would do nothing at all.  We'll have to document that
>> *very* carefully.
>>     
>
> Maybe something like:
> ok_locales !se all
>   

Hmm, that's a bit confusing to me, as the "all" would appear to be
redundant (it's the default)

Also, keep in mind that it's perfectly valid to have multiple ok_locales
statements so:

ok_locales !ja all
ok_locales !ko all


Would be confusing. Note: I also dropped the use of "se" here, as there
is no "se" locale. ok_locales works on character sets, not languages,
and Swedish is just 'en' to it.





Re: the opposit of "ok_locales" ??

Posted by Daniel J McDonald <da...@austinenergy.com>.
On Fri, 2007-12-07 at 08:38 -0500, Matt Kettler wrote:
> Stefan Jakobs wrote:
> > Let's assume you running a mailrelay for a university and your users are from 
> > different countries. Lets assume further on you have no Swedish people at 
> > your university (and you get a lot of spam from Sweden). Then it would be 
> > nice to have a not_ok_locales option, because you see immediately which 
> > locale character set is considered as possible spam.
> >
> > If you have a list of: af ax al dz as ad ao ai aq ag ar am aw ac au at az bs 
> > bh bb by be bz bm bt bo ba ... ve vn vg vi wf eh ye yu zm zw
> > Do you see, that Sweden is the only country which is missing?  I know it 
> > maybe, but what happens when I quit my job. And somebody else should find the 
> > mistake, why some mails from Sweden are considered as spam. This can be trap.
> >
> > I know this is a case with a lot of "if", but I mean it is better to have good 
> > readable configuration than to prevent a second parameter which does nearly 
> > the same as the first one.
> >
> >   
> Now that sounds like a valid reason to me. The only problem is if you
> use not_ok_locales, then you should not use ok_locales.. This might get
> confusing to someone who thinks they're white/blacklists.

> It would be a harmless confusion, but if you specified:
> 
> not_ok_locales se
> ok_locales en
> 
> The ok_locales would do nothing at all.  We'll have to document that
> *very* carefully.

Maybe something like:
ok_locales !se all

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
Stefan Jakobs wrote:
> Let's assume you running a mailrelay for a university and your users are from 
> different countries. Lets assume further on you have no Swedish people at 
> your university (and you get a lot of spam from Sweden). Then it would be 
> nice to have a not_ok_locales option, because you see immediately which 
> locale character set is considered as possible spam.
>
> If you have a list of: af ax al dz as ad ao ai aq ag ar am aw ac au at az bs 
> bh bb by be bz bm bt bo ba ... ve vn vg vi wf eh ye yu zm zw
> Do you see, that Sweden is the only country which is missing?  I know it 
> maybe, but what happens when I quit my job. And somebody else should find the 
> mistake, why some mails from Sweden are considered as spam. This can be trap.
>
> I know this is a case with a lot of "if", but I mean it is better to have good 
> readable configuration than to prevent a second parameter which does nearly 
> the same as the first one.
>
>   
Now that sounds like a valid reason to me. The only problem is if you
use not_ok_locales, then you should not use ok_locales.. This might get
confusing to someone who thinks they're white/blacklists.

It would be a harmless confusion, but if you specified:

not_ok_locales se
ok_locales en

The ok_locales would do nothing at all.  We'll have to document that
*very* carefully.







Re: the opposit of "ok_locales" ??

Posted by Stefan Jakobs <st...@rus.uni-stuttgart.de>.
On Friday 07 December 2007 04:42, Matt Kettler wrote:
> jidanni@jidanni.org wrote:
> > MK> I'll be happy to change my assumptions, but can you name any good
> > reason MK> why they would want to do so?
> >
> > The Matt theme: restrict oneself from getting mail from any but a few
> > safe people, languages, or whatever. Life goes on in its familiar grey
> > days. But alas, the software knows best.
>
> Erm.. No. The Matt theme is to only add options if they have a use. I
> have yet to see a sensible argument for this..
>
> > The jidanni theme: open up life to a rainbow of possibilities. New
> > styles, new friends, new colors. Don't let the minor fact that we
> > filter out a tiny part of the spectrum cause us to miss out on new
> > contacts from who knows where.
>
> At that point, set ok_locales to all because you might miss out on new
> contacts from that tiny spectrum too.
>
> Also, how will you benefit from contact with this broader spectrum if
> they're emailing you in a character set you can't read?
>
> Now really. Can you make a serious argument why this configuration
> option would be useful. I'm being serious here. I honestly don't see a
> valid need for the option.
>
> And those who really want this effect can just list every locale except
> the one they dislike, if that's really what they want.

Let's assume you running a mailrelay for a university and your users are from 
different countries. Lets assume further on you have no Swedish people at 
your university (and you get a lot of spam from Sweden). Then it would be 
nice to have a not_ok_locales option, because you see immediately which 
locale character set is considered as possible spam.

If you have a list of: af ax al dz as ad ao ai aq ag ar am aw ac au at az bs 
bh bb by be bz bm bt bo ba ... ve vn vg vi wf eh ye yu zm zw
Do you see, that Sweden is the only country which is missing?  I know it 
maybe, but what happens when I quit my job. And somebody else should find the 
mistake, why some mails from Sweden are considered as spam. This can be trap.

I know this is a case with a lot of "if", but I mean it is better to have good 
readable configuration than to prevent a second parameter which does nearly 
the same as the first one.

Greetings Stefan

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
jidanni@jidanni.org wrote:
> MK> I'll be happy to change my assumptions, but can you name any good reason
> MK> why they would want to do so?
>
> The Matt theme: restrict oneself from getting mail from any but a few
> safe people, languages, or whatever. Life goes on in its familiar grey
> days. But alas, the software knows best.
>   
Erm.. No. The Matt theme is to only add options if they have a use. I
have yet to see a sensible argument for this..

> The jidanni theme: open up life to a rainbow of possibilities. New
> styles, new friends, new colors. Don't let the minor fact that we
> filter out a tiny part of the spectrum cause us to miss out on new
> contacts from who knows where.
>   
At that point, set ok_locales to all because you might miss out on new
contacts from that tiny spectrum too.

Also, how will you benefit from contact with this broader spectrum if
they're emailing you in a character set you can't read?

Now really. Can you make a serious argument why this configuration
option would be useful. I'm being serious here. I honestly don't see a
valid need for the option.

And those who really want this effect can just list every locale except
the one they dislike, if that's really what they want.

> Anyway, currently it's not even like one could just use "--" to obtain
> "+". And even if it was, our basic user is still looking for his
> blacklist_locales.
>   
Is he really? Or does he think ok_locales = whitelist_locales?




Re: the opposit of "ok_locales" ??

Posted by Dave Pooser <da...@pooserville.com>.
> The jidanni theme: open up life to a rainbow of possibilities.

Y'know, at the risk of being rude, does the rainbow of possibilities include
the possibility of READING the expletive-deleted CONF FILE? Just asking.

> But the basic user is not in the business of understanding things.

Then he shouldn't be tweaking SpamAssassin conf files, or most other server
settings. The world has enough Mouse Clicking System Engineers.
-- 
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com
"...Life is not a journey to the grave with the intention of arriving
safely in one pretty and well-preserved piece, but to slide across the
finish line broadside, thoroughly used up, worn out, leaking oil, and
shouting GERONIMO!!!" -- Bill McKenna



Re: the opposit of "ok_locales" ??

Posted by ji...@jidanni.org.
MK> I'll be happy to change my assumptions, but can you name any good reason
MK> why they would want to do so?

The Matt theme: restrict oneself from getting mail from any but a few
safe people, languages, or whatever. Life goes on in its familiar grey
days. But alas, the software knows best.

The jidanni theme: open up life to a rainbow of possibilities. New
styles, new friends, new colors. Don't let the minor fact that we
filter out a tiny part of the spectrum cause us to miss out on new
contacts from who knows where.

Anyway, currently it's not even like one could just use "--" to obtain
"+". And even if it was, our basic user is still looking for his
blacklist_locales.

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
jidanni@jidanni.org wrote:
>
> MK> Let's say you speak English and Chinese, and hate Russian because you
> MK> get lots of spam in that text format and don't speak it.
>
> That's me, English and Chinese, and hate Russian.
>
> MK> In this situation, why would you want "not_ok_localles ru" instead of
> MK> "ok_locales en zh"? Is there a reason you'd want to allow character sets
> MK> like Thai, Korean, etc, even though you don't understand them any better
> MK> than Russian?  No.
>
> You make assumptions about peoples lifestyles.
>
> And what if they did?
>   
I'll be happy to change my assumptions, but can you name any good reason
why they would want to do so?



Re: the opposit of "ok_locales" ??

Posted by ji...@jidanni.org.
The basic user understands whitelist_from and blacklist_from. But when
he encounters the locales, he wonders why cannot there be
whitelist_locales and blacklist_locales. He does not want to learn the
superior logic of why his wish is not smart. He just wants to find the
commands for whitelist_locales and blacklist_locales, and can only
find half.

MK> The answer is to read the Conf manpage and understand it. It
MK> doesn't mention it in the exact wording you want, but there is an
MK> answer and ok_locales is exactly the answer you want.

But the basic user is not in the business of understanding things. He
is just looking for the pair whitelist_locales and blacklist_locales,
or whatever devious name they are called, and can only find half of
the pair.

Perhaps deep down some macro could be made so the user can finally
find such a pair, without having to understand anything.

MK> Quite frankly, a "not_ok_locales" option doesn't make any useful sense
MK> anyway. If you want to restrict the locales, restrict it to the ones you
MK> speak. Don't bother singling out just ones you dislike...

...just because the software can't do it yet.

MK> Let's say you speak English and Chinese, and hate Russian because you
MK> get lots of spam in that text format and don't speak it.

That's me, English and Chinese, and hate Russian.

MK> In this situation, why would you want "not_ok_localles ru" instead of
MK> "ok_locales en zh"? Is there a reason you'd want to allow character sets
MK> like Thai, Korean, etc, even though you don't understand them any better
MK> than Russian?  No.

You make assumptions about peoples lifestyles.

And what if they did?

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
jidanni@jidanni.org wrote:
> Anyway, Mail::SpamAssassin::Conf should admit that it doesn't mention
> "What if I hate a specific language, people, culture. Is there e.g., a
> not_ok_locales?"
>
> Don't put the answer here, put it on Mail::SpamAssassin::Conf, even if
> the answer is that there is no answer. Thank you.
>   
Well, at the risk of sounding more rude than I intend, the answer is to
read the Conf manpage and understand it. It doesn't mention it in the
exact wording you want, but there is an answer and ok_locales is exactly
the answer you want.

Quite frankly, a "not_ok_locales" option doesn't make any useful sense
anyway. If you want to restrict the locales, restrict it to the ones you
speak. Don't bother singling out just ones you dislike.

Let's say you speak English and Chinese, and hate Russian because you
get lots of spam in that text format and don't speak it.

In this situation, why would you want "not_ok_localles ru" instead of
"ok_locales en zh"? Is there a reason you'd want to allow character sets
like Thai, Korean, etc, even though you don't understand them any better
than Russian?  No.











Re: the opposit of "ok_locales" ??

Posted by ji...@jidanni.org.
Anyway, Mail::SpamAssassin::Conf should admit that it doesn't mention
"What if I hate a specific language, people, culture. Is there e.g., a
not_ok_locales?"

Don't put the answer here, put it on Mail::SpamAssassin::Conf, even if
the answer is that there is no answer. Thank you.

Re: the opposit of "ok_locales" ??

Posted by Matt Kettler <mk...@verizon.net>.
Axel Werner wrote:
> im looking for some "opposit" parameters of "ok_locales" to
> make spamassassin mark all incoming mail of some specific charsets or
> language settings (locales) to get marked by default.
>
> for example: since i life in western europe i never expect mails from
> eastern europe, asia, afrika or something like that. especialy if they
> use their locale charset n stuff.
>
> so im looking for some parameter doing a
>
> "if locate is not western-europe or western mark mail as spam"
>
> is there something i did not found in all the manuals and google
> searches??! 

You need ok_locales. This is *EXACTLY* how it works. It is not a
whitelist, it's a list of exceptions to a blacklist.

Re-read man Mail::SpamAssassin::Conf when it says:
"This option is used to specify which locales are considered OK for
incoming mail. Mail using the *character sets* that are allowed by this
option will not be marked as possibly being spam in a foreign language."

ie: if the locale of the message isn't in ok_locales, CHARSET_FARAWAY
and friends will fire off, giving the message a positive score.

Now, technically the score of CHARSET_FARAWAY defaults to 3.2, which
isn't enough to mark as spam by itself. You can always over-ride the
score, but I'd suggest trying it out with the default score first.

Re: the opposit of "ok_locales" ??

Posted by Per Jessen <pe...@computer.org>.
Jonathan Armitage wrote:

> Provided it is possible with your MTA, you could consider rejecting
> such email at that level, thus relieving SA of the burden of having to
> scan it at all.
> 
> This is easy in Exim, but I don't know if other mailers can do the
> same thing.

In postfix, a header or a body check would do it. 


/Per Jessen, Zürich


Re: the opposit of "ok_locales" ??

Posted by Jonathan Armitage <jo...@hepworthband.co.uk>.
Karsten Bräckelmann wrote:
> On Wed, 2007-12-05 at 17:00 +0100, Axel Werner wrote:
>> im looking for some "opposit" parameters of "ok_locales" to
>> make spamassassin mark all incoming mail of some specific charsets or 
>> language settings (locales) to get marked by default.
>>
Provided it is possible with your MTA, you could consider rejecting such email 
at that level, thus relieving SA of the burden of having to scan it at all.

This is easy in Exim, but I don't know if other mailers can do the same thing.

Jon

Re: the opposit of "ok_locales" ??

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2007-12-05 at 17:00 +0100, Axel Werner wrote:
> im looking for some "opposit" parameters of "ok_locales" to
> make spamassassin mark all incoming mail of some specific charsets or 
> language settings (locales) to get marked by default.
> 
> for example: since i life in western europe i never expect mails from 
> eastern europe, asia, afrika or something like that. especialy if they 
> use their locale charset n stuff.
> 
> so im looking for some parameter doing a
> 
> "if locate is not western-europe or western mark mail as spam"
                ^^^                   ^^^^^^^
> is there something i did not found in all the manuals and google searches??!

The documentation itself?
 http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html#language_options

ok_locales en  # Western character sets in general


However, note that unlike your desired behavior, this will NOT "mark
mail as spam". It will result in non-western charsets to trigger some
rules (there are a couple) and thus add to the score. No single rule
marks a mail as spam.

The combined scored rules are likely to pitchfork the mail beyond the
spam threshold, though, since the default scores for CHARSET_FARAWAY*
aren't particularly lightweight.

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}