You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Philip Prindeville <ph...@redfish-solutions.com> on 2006/11/14 02:44:13 UTC

Re: פריצת דרך מאתגרת

At the risk of appearing to be (or revealing myself to be ;-) an
anti-Windows bigot (actually, I'm more of a pro-Open Standards
cheerleader), we mark all of the "charset=Windows-125[0-8]"
messages by 4.85...

Why?  Because none of the Windows charsets do anything that
the ISO-8859-x charsets don't already do...  and at least one
Internet Draft suggests requiring the following encoding rules:

* ASCII (NVT) should be encoded as USASCII

* Anything that can fit into ISO-8859-1 must be encoded as this
  (assuming it doesn't fit into USASCII, of course);

* All else should be encoded as UTF8.  Period.  Full-stop.

Makes sense to me.

It's easy enough (it's a single registry setting) to force Outlook
to encode via either UTF-8 or Latin 1.

Should the following test be included in the distribution (with a
score of 0.00) and we can crank it up based on what ham vs.
spam differentiation indicates?

# don't allow windows-1252 text attachments...
mimeheader __CTYPE_MH_WIN1252   Content-Type =~ /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
meta WIN_CHARSET                ((__CTYPE_MH_HTML || __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
describe WIN_CHARSET            Content-Type is Windows-specific text
score WIN_CHARSET               0.01


-Philip




Robert Nicholson wrote:

>You may have misunderstand but that's the point.
>
>The message was _not_ being filtered out like it should be and that  
>was because of the very generic /WINDOWS/ match.
>
>so that method doesn't really obey the locales you have set.
>
>when I take out the generic /WINDOWS/ match it does then screen it out.
>
>or rather is tagged against the rule.
>
>On Sep 11, 2006, at 8:40 AM, David Baron wrote:
>
>  
>
>>Local for HEBREW is not in this list.
>>
>>    
>>
>>>Windows-1255
>>>
>>>and apparently with locales
>>>
>>>DB<6> x @locales
>>>0  'en'
>>>1  'th'
>>>2  'it'
>>>3  'en_US'
>>>
>>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales($1, @locales)
>>>
>>>returns true
>>>
>>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales(/home/robert/
>>>lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Locales.pm:91):
>>>91:       return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
>>>
>>>what?
>>>
>>>On Sep 10, 2006, at 4:38 PM, Robert Nicholson wrote:
>>>      
>>>
>>>>Why didn't foreign charset rules catch this?
>>>>
>>>>Begin forwarded message:
>>>>        
>>>>
>>>>>From: ariel@kini12.com
>>>>>Date: September 10, 2006 2:17:51 PM CDT
>>>>>To: robert@elastica.com
>>>>>Subject: פריצת דרך מאתגרת
>>>>>X-Spam-Dcc: : grub.camros.com 1113; Body=5 Fuz1=5 Fuz2=3
>>>>>X-Spam-Flag: YES
>>>>>X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on
>>>>>grub.camros.com
>>>>>X-Spam-Level: *****
>>>>>X-Spam-Status: Yes, score=5.7 required=0.6
>>>>>tests=BAYES_95,FRONTPAGE,
>>>>>HTML_90_100,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_TITLE_SUBJ_DIFF,
>>>>>MIME_HTML_ONLY,NO_REAL_NAME,UNPARSEABLE_RELAY autolearn=no
>>>>>version=3.1.1
>>>>>X-Spam-Report: *  1.0 NO_REAL_NAME From: does not include a real
>>>>>name *  0.0 UNPARSEABLE_RELAY Informational: message has
>>>>>unparseable relay *      lines *  0.5 HTML_IMAGE_RATIO_02 BODY:
>>>>>HTML has a low ratio of text to image *      area *  0.1
>>>>>HTML_90_100 BODY: Message is 90% to 100% HTML *  0.0 HTML_MESSAGE
>>>>>BODY: HTML included in message *  3.0 BAYES_95 BODY: Bayesian spam
>>>>>probability is 95 to 99% *      [score: 0.9667] *  0.0
>>>>>MIME_HTML_ONLY BODY: Message only has text/html MIME parts *  0.9
>>>>>FRONTPAGE RAW: Frontpage used to create the message *  0.3
>>>>>HTML_TITLE_SUBJ_DIFF HTML_TITLE_SUBJ_DIFF
>>>>>Received: (qmail 10557 invoked from network); 10 Sep 2006 18:17:08
>>>>>-0000
>>>>>Received: from  (HELO kini12.com) (208.53.131.241) by 64.34.193.12
>>>>>with SMTP; 10 Sep 2006 18:17:08 -0000
>>>>>Message-Id: <20...@kini12.com>
>>>>>Mime-Version: 1.0
>>>>>Content-Type: text/html; charset="windows-1255"
>>>>>Content-Transfer-Encoding: quoted-printable
>>>>>Lines: 124
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>להגיע למיליון לקוחות ?גם אתם רוצים
>>>>>נא לחצו כאן
>>>>>
>>>>>
>>>>>מתנצלים אם גרמנו להפרעה, להסרה
>>>>>מרשימת הדיוורנמען נכבד, אנו לחץ
>>>>>
>>>>>להסרה לחצו כאן
>>>>>          
>>>>>

Re: פריצת דרך מאתגרת

Posted by Bob Proulx <bo...@proulx.com>.

Benny Pedersen wrote:
> Subject: Re: פריצת דרך מאתגרת

Of course I can see the glyphs but I can't read the meaning.  Care to
clue us in?  I don't see the original message to which you were
replying and neither does it seem to be in the mailing list archive.

> Philip Prindeville wrote:
> > At the risk of appearing to be (or revealing myself to be ;-) an
> > anti-Windows bigot (actually, I'm more of a pro-Open Standards
> > cheerleader), we mark all of the "charset=Windows-125[0-8]"
> > messages by 4.85...

"They called me mad, and I called them mad, and damn them, they
outvoted me." -- Nathaniel Lee (on being consigned to a mental
institution, circa 17th c.)

While I admire the idea in its perverse concept I don't think it will
work.  (And people complained about collateral damage from SPEWS! :-)

> got this in the mailheaders of your mail:
> 
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> that screws my head all night :-)

Looks fine to me.

> 8bit encoding and at the same time utf-8
> i belive one of them needs to be 7bit, i just don't know with one :-)

Nope.  That is perfectly valid.  The only way to transport it 7bit
would be with an encoding such as base64 or quoted-printable.  I think
all internet transports handle 8bit fine these days.  Gratuitously
encoding messages is a spam sign.

> pur squirrelmail that can't qoute unicode in subject but shows unicode ok :(

But your reply munged the content of the subject.  I restored it for
this reply posting, mostly just to lend support for UTF-8.

Bob

Re: ????? ??? ??????

Posted by Benny Pedersen <me...@junc.org>.

On Tue, November 14, 2006 02:44, Philip Prindeville wrote:
> At the risk of appearing to be (or revealing myself to be ;-) an
> anti-Windows bigot (actually, I'm more of a pro-Open Standards
> cheerleader), we mark all of the "charset=Windows-125[0-8]"
> messages by 4.85...

got this in the mailheaders of your mail:

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

that screws my head all night :-)

8bit encoding and at the same time utf-8

i belive one of them needs to be 7bit, i just don't know with one :-)

pur squirrelmail that can't qoute unicode in subject but shows unicode ok :(

-- 
This message was sent using 100% recycled spam mails.

RE: ????? ??? ??????

Posted by Giampaolo Tomassoni <Gi...@Tomassoni.biz>.

> At the risk of appearing to be (or revealing myself to be ;-) an
> anti-Windows bigot (actually, I'm more of a pro-Open Standards
> cheerleader), we mark all of the "charset=Windows-125[0-8]"
> messages by 4.85...
> 
> Why?  Because none of the Windows charsets do anything that
> the ISO-8859-x charsets don't already do...  and at least one
> Internet Draft suggests requiring the following encoding rules:
> 
> * ASCII (NVT) should be encoded as USASCII
> 
> * Anything that can fit into ISO-8859-1 must be encoded as this
>   (assuming it doesn't fit into USASCII, of course);
> 
> * All else should be encoded as UTF8.  Period.  Full-stop.
> 
> Makes sense to me.
> 
> It's easy enough (it's a single registry setting) to force Outlook
> to encode via either UTF-8 or Latin 1.

You don't need it. Don't know about Outlook Express, but Outlook got its own setting in tools->whatever about it.


> Should the following test be included in the distribution (with a
> score of 0.00) and we can crank it up based on what ham vs.
> spam differentiation indicates?
> 
> # don't allow windows-1252 text attachments...
> mimeheader __CTYPE_MH_WIN1252   Content-Type =~ 
> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
> meta WIN_CHARSET                ((__CTYPE_MH_HTML || 
> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
> describe WIN_CHARSET            Content-Type is Windows-specific text
> score WIN_CHARSET               0.01

In order to stress the SA's inclination toward Open Standards and push away any feeling of anti-Windows bigotism, I would suggest to name the test something like, say, NOT_RFCxxx_CHARSET and test for the charset not being any of the allowed ones.

Even IBM has (had?) its own charsets...

-----------------------------------
Giampaolo Tomassoni - IT Consultant
Piazza VIII Aprile 1948, 4
I-53044 Chiusi (SI) - Italy
Ph: +39-0578-21100

MAI inviare una e-mail a:
NEVER send an e-mail to:
 rainbowl@tomassoni.eu

> 
> 
> -Philip
> 
> 
> 
> 
> Robert Nicholson wrote:
> 
> >You may have misunderstand but that's the point.
> >
> >The message was _not_ being filtered out like it should be and that  
> >was because of the very generic /WINDOWS/ match.
> >
> >so that method doesn't really obey the locales you have set.
> >
> >when I take out the generic /WINDOWS/ match it does then screen it out.
> >
> >or rather is tagged against the rule.
> >
> >On Sep 11, 2006, at 8:40 AM, David Baron wrote:
> >
> >  
> >
> >>Local for HEBREW is not in this list.
> >>
> >>    
> >>
> >>>Windows-1255
> >>>
> >>>and apparently with locales
> >>>
> >>>DB<6> x @locales
> >>>0  'en'
> >>>1  'th'
> >>>2  'it'
> >>>3  'en_US'
> >>>
> >>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales($1, @locales)
> >>>
> >>>returns true
> >>>
> >>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales(/home/robert/
> >>>lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Locales.pm:91):
> >>>91:       return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
> >>>
> >>>what?
> >>>
> >>>On Sep 10, 2006, at 4:38 PM, Robert Nicholson wrote:
> >>>      
> >>>
> >>>>Why didn't foreign charset rules catch this?
> >>>>
> >>>>Begin forwarded message:
> >>>>        
> >>>>
> >>>>>From: ariel@kini12.com
> >>>>>Date: September 10, 2006 2:17:51 PM CDT
> >>>>>To: robert@elastica.com
> >>>>>Subject: פריצת דרך מאתגרת
> >>>>>X-Spam-Dcc: : grub.camros.com 1113; Body=5 Fuz1=5 Fuz2=3
> >>>>>X-Spam-Flag: YES
> >>>>>X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on
> >>>>>grub.camros.com
> >>>>>X-Spam-Level: *****
> >>>>>X-Spam-Status: Yes, score=5.7 required=0.6
> >>>>>tests=BAYES_95,FRONTPAGE,
> >>>>>HTML_90_100,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_TITLE_SUBJ_DIFF,
> >>>>>MIME_HTML_ONLY,NO_REAL_NAME,UNPARSEABLE_RELAY autolearn=no
> >>>>>version=3.1.1
> >>>>>X-Spam-Report: *  1.0 NO_REAL_NAME From: does not include a real
> >>>>>name *  0.0 UNPARSEABLE_RELAY Informational: message has
> >>>>>unparseable relay *      lines *  0.5 HTML_IMAGE_RATIO_02 BODY:
> >>>>>HTML has a low ratio of text to image *      area *  0.1
> >>>>>HTML_90_100 BODY: Message is 90% to 100% HTML *  0.0 HTML_MESSAGE
> >>>>>BODY: HTML included in message *  3.0 BAYES_95 BODY: Bayesian spam
> >>>>>probability is 95 to 99% *      [score: 0.9667] *  0.0
> >>>>>MIME_HTML_ONLY BODY: Message only has text/html MIME parts *  0.9
> >>>>>FRONTPAGE RAW: Frontpage used to create the message *  0.3
> >>>>>HTML_TITLE_SUBJ_DIFF HTML_TITLE_SUBJ_DIFF
> >>>>>Received: (qmail 10557 invoked from network); 10 Sep 2006 18:17:08
> >>>>>-0000
> >>>>>Received: from  (HELO kini12.com) (208.53.131.241) by 64.34.193.12
> >>>>>with SMTP; 10 Sep 2006 18:17:08 -0000
> >>>>>Message-Id: <20...@kini12.com>
> >>>>>Mime-Version: 1.0
> >>>>>Content-Type: text/html; charset="windows-1255"
> >>>>>Content-Transfer-Encoding: quoted-printable
> >>>>>Lines: 124
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>להגיע למיליון לקוחות ?גם אתם רוצים
> >>>>>נא לחצו כאן
> >>>>>
> >>>>>
> >>>>>מתנצלים אם גרמנו להפרעה, להסרה
> >>>>>מרשימת הדיוורנמען נכבד, אנו לחץ
> >>>>>
> >>>>>להסרה לחצו כאן
> >>>>>          
> >>>>>
>

Accurately deprecating charsets

Posted by Philip Prindeville <ph...@redfish-solutions.com>.

I'll ask again...  Can someone who handles a fair mix of
email content (i.e. not just western European languages)
do a triage (individually) of the rules below for ham versus
spam?

I'd suspect that very little genuine ham contains "IBM852"
or "Unicode" or "CP12[0-8]" these days.

Thanks,

-Philip



Robert Nicholson wrote:

> so what is the conclusion to this issue?
>
> why when I set ok_locales to it th en does it allow any Charset with
> "Windows" in the name
> to bypass that setting?
>
> Why is it that is_charset_ok_for_locales written to give exceptions
>
> sub is_charset_ok_for_locales {
>   my ($cs, @locales) = @_;
>
>   $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
>   $cs =~ s/^3D//gs;             # broken by quoted-printable
>   $cs =~ s/:.*$//gs;            # trim off multiple charsets, just use 1st
>
>   study $cs;
>   #warn "JMD $cs";
>
>   # always OK (the net speaks mostly roman charsets)
>   return 1 if ($cs eq 'USASCII');
>   return 1 if ($cs =~ /^ISO8859/);
>   return 1 if ($cs =~ /^ISO10646/);
>   return 1 if ($cs =~ /^UTF/);
>   return 1 if ($cs =~ /^UCS/);
>   return 1 if ($cs =~ /^CP125/);
>   return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
>   return 1 if ($cs eq 'IBM852');
>   return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard of it
>   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting
> to 8bit
>   return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso
> 8859-15'. grr
>
>   foreach my $locale (@locales) {
>     if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
>     $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh
>
>     my $ok_for_loc = $charsets_for_locale{$locale};
>     next if (!defined $ok_for_loc);
>
>     if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
>       return 1;
>     }
>   }
>
>   return 0;
> }

Re: ????? ??? ??????

Posted by Philip Prindeville <ph...@redfish-solutions.com>.

You'd think, wouldn't you????

-Philip


Robert Nicholson wrote:

> This is Japanese
>
> # Japanese: Peter Evans writes: iso-2022-jp = rfc approved, rfc 1468,
> created
>   # by Jun Murai in 1993 back when he didnt have white hair!  rfc
> approved.
>   # (rfc 2237) <-- by M$.
>   'ja' => 'EUCJP JISX020119760 JISX020819830 JISX020819900
> JISX020819970 '.
>         'JISX021219900 JISX021320001 JISX021320002 SHIFT_JIS SHIFTJIS '.
>         'ISO2022JP SJIS JIS7 JISX0201 JISX0208 JISX0212',
>
> Surely the MUA only changes the charset to Windows-1255 once it sees
> there are glyphs in which case you'd expect seldom to see Windows-1255
> when there are no glyphs present?
>
> On Nov 16, 2006, at 4:24 PM, Philip Prindeville wrote:
>
>> Windows-1256... but a sane mailer would detect that a message
>>
>> all fits into 7-bits and use USASCII instead.
>>
>

Re: ????? ??? ??????

Posted by Robert Nicholson <ro...@elastica.com>.

This is Japanese

# Japanese: Peter Evans writes: iso-2022-jp = rfc approved, rfc 1468,  
created
   # by Jun Murai in 1993 back when he didnt have white hair!  rfc  
approved.
   # (rfc 2237) <-- by M$.
   'ja' => 'EUCJP JISX020119760 JISX020819830 JISX020819900  
JISX020819970 '.
         'JISX021219900 JISX021320001 JISX021320002 SHIFT_JIS  
SHIFTJIS '.
         'ISO2022JP SJIS JIS7 JISX0201 JISX0208 JISX0212',

Surely the MUA only changes the charset to Windows-1255 once it sees  
there are glyphs in which case you'd expect seldom to see  
Windows-1255 when there are no glyphs present?

On Nov 16, 2006, at 4:24 PM, Philip Prindeville wrote:

> Windows-1256... but a sane mailer would detect that a message
> all fits into 7-bits and use USASCII instead.

Re: ????? ??? ??????

Posted by Philip Prindeville <ph...@redfish-solutions.com>.

I would say that this issue in general (and this file in particular) is
more than overdue for a revisiting.

I haven't seen UCS, CP125?, or IBM852 for a long time.  Likewise
for "UNICODE" or "XUNKNOWN".

As for "ISO" (tout court) from Magellan... that's broken, and if it
hasn't been fixed by now, then it's their problem, not our.  Easier to
whitelist the few users still clinging to broken mailers than to
continue to compromise spamproofness.

As for Windows...  I would change the test from:

$cs =~ /^WINDOWS/

to:

$cs eq 'WINDOWS-1252'

instead.  There is no reason to use any of the other
Windows character sets:  they offer nothing that UTF doesn't
already have.

Being liberal in what you accept is good if interoperability is
your goal.  If security and integrity, however, are primal, then
being paranoid in what you accept might actually be more
appropriate.

Is there anyone out there (preferably in Central/Eastern Europe)
that handles a high volume of traffic that can tell us if
any of these encodings are still in legitimate use?  Like "ISO10646"
or "UCS" or ISO-8859-8 or CP125?, etc.

The alternative is to add checks per language for each of the
Windows-125[0-8] types.  Yes, you can encode English in
Windows-1256... but a sane mailer would detect that a message
all fits into 7-bits and use USASCII instead.

If it doesn't, then it's broken and needs to be fixed.

I'm not against reinventing the wheel when a new design is
offered that's better.  But I'm not convinced that Windows-1252
is an improvement over Latin-1.  For instance, the glyphs "oe"
and "OE" aren't a unique letter:  they are a presentation (i.e.
ligature) that renders (displays) differently from writing "o" and
"e" separately... but it is in fact just the two letters "o" and "e"
that are being represented (similarly for "ij" in Dutch, etc)
without kerning between them.

The bottom line is you don't need specific characters for
"oe" and "ij", etc.  You just need a rendering engine that
understands when using a ligature is appropriate (same
as with "ss" in German, or "ff", "fl", etc. in English).

Making these distinct characters was folly.

But I digress.

Just out of curiosity, what are the charsets_for_locale{'en'}
anyway?  If it were up to me, I'd limit it to USASCII,
ISO-8859-1, and UTF-8.  Period.

Likewise, for Japanese, how many UA's use anything other
than ISO2022JP?  This is the blessed standard.  Anything else
is out-of-date and requires a fix.

-Philip

Robert Nicholson wrote:

> so what is the conclusion to this issue?
>
> why when I set ok_locales to it th en does it allow any Charset with
> "Windows" in the name
> to bypass that setting?
>
> Why is it that is_charset_ok_for_locales written to give exceptions
>
> sub is_charset_ok_for_locales {
>   my ($cs, @locales) = @_;
>
>   $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
>   $cs =~ s/^3D//gs;             # broken by quoted-printable
>   $cs =~ s/:.*$//gs;            # trim off multiple charsets, just use 1st
>
>   study $cs;
>   #warn "JMD $cs";
>
>   # always OK (the net speaks mostly roman charsets)
>   return 1 if ($cs eq 'USASCII');
>   return 1 if ($cs =~ /^ISO8859/);
>   return 1 if ($cs =~ /^ISO10646/);
>   return 1 if ($cs =~ /^UTF/);
>   return 1 if ($cs =~ /^UCS/);
>   return 1 if ($cs =~ /^CP125/);
>   return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
>   return 1 if ($cs eq 'IBM852');
>   return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard of it
>   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting
> to 8bit
>   return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso
> 8859-15'. grr
>
>   foreach my $locale (@locales) {
>     if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
>     $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh
>
>     my $ok_for_loc = $charsets_for_locale{$locale};
>     next if (!defined $ok_for_loc);
>
>     if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
>       return 1;
>     }
>   }
>
>   return 0;
> }
>
> On Nov 13, 2006, at 8:30 PM, Giampaolo Tomassoni wrote:
>
>>> # don't allow windows-1252 text attachments...
>>>
>>> mimeheader __CTYPE_MH_WIN1252   Content-Type =~ 
>>>
>>> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
>>>
>>> meta WIN_CHARSET                ((__CTYPE_MH_HTML || 
>>>
>>> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
>>>
>>> describe WIN_CHARSET            Content-Type is Windows-specific text
>>>
>>> score WIN_CHARSET               0.01
>>>
>

Re: ????? ??? ??????

Posted by Robert Nicholson <ro...@elastica.com>.

so what is the conclusion to this issue?

why when I set ok_locales to it th en does it allow any Charset with  
"Windows" in the name
to bypass that setting?

Why is it that is_charset_ok_for_locales written to give exceptions

sub is_charset_ok_for_locales {
   my ($cs, @locales) = @_;

   $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
   $cs =~ s/^3D//gs;             # broken by quoted-printable
   $cs =~ s/:.*$//gs;            # trim off multiple charsets, just  
use 1st

   study $cs;
   #warn "JMD $cs";

   # always OK (the net speaks mostly roman charsets)
   return 1 if ($cs eq 'USASCII');
   return 1 if ($cs =~ /^ISO8859/);
   return 1 if ($cs =~ /^ISO10646/);
   return 1 if ($cs =~ /^UTF/);
   return 1 if ($cs =~ /^UCS/);
   return 1 if ($cs =~ /^CP125/);
   return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
   return 1 if ($cs eq 'IBM852');
   return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard  
of it
   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when  
converting to 8bit
   return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso  
8859-15'. grr

   foreach my $locale (@locales) {
     if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
     $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh

     my $ok_for_loc = $charsets_for_locale{$locale};
     next if (!defined $ok_for_loc);

     if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
       return 1;
     }
   }

   return 0;
}

On Nov 13, 2006, at 8:30 PM, Giampaolo Tomassoni wrote:

>> # don't allow windows-1252 text attachments...
>> mimeheader __CTYPE_MH_WIN1252   Content-Type =~
>> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
>> meta WIN_CHARSET                ((__CTYPE_MH_HTML ||
>> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
>> describe WIN_CHARSET            Content-Type is Windows-specific text
>> score WIN_CHARSET               0.01

RE: ????? ??? ??????

Posted by Giampaolo Tomassoni <g....@libero.it>.

> At the risk of appearing to be (or revealing myself to be ;-) an
> anti-Windows bigot (actually, I'm more of a pro-Open Standards
> cheerleader), we mark all of the "charset=Windows-125[0-8]"
> messages by 4.85...
> 
> Why?  Because none of the Windows charsets do anything that
> the ISO-8859-x charsets don't already do...  and at least one
> Internet Draft suggests requiring the following encoding rules:
> 
> * ASCII (NVT) should be encoded as USASCII
> 
> * Anything that can fit into ISO-8859-1 must be encoded as this
>   (assuming it doesn't fit into USASCII, of course);
> 
> * All else should be encoded as UTF8.  Period.  Full-stop.
> 
> Makes sense to me.
> 
> It's easy enough (it's a single registry setting) to force Outlook
> to encode via either UTF-8 or Latin 1.

You don't need it. Don't know about Outlook Express, but Outlook got its own setting in tools->whatever about it.


> Should the following test be included in the distribution (with a
> score of 0.00) and we can crank it up based on what ham vs.
> spam differentiation indicates?
> 
> # don't allow windows-1252 text attachments...
> mimeheader __CTYPE_MH_WIN1252   Content-Type =~ 
> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
> meta WIN_CHARSET                ((__CTYPE_MH_HTML || 
> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
> describe WIN_CHARSET            Content-Type is Windows-specific text
> score WIN_CHARSET               0.01

In order to stress the SA's inclination toward Open Standards and push away any feeling of anti-Windows bigotism, I would suggest to name the test something like, say, NOT_RFCxxx_CHARSET and test for the charset not being any of the allowed ones.

Even IBM has (had?) its own charsets...

-----------------------------------
Giampaolo Tomassoni - IT Consultant
Piazza VIII Aprile 1948, 4
I-53044 Chiusi (SI) - Italy
Ph: +39-0578-21100

MAI inviare una e-mail a:
NEVER send an e-mail to:
 rainbowl@tomassoni.eu

> 
> 
> -Philip
> 
> 
> 
> 
> Robert Nicholson wrote:
> 
> >You may have misunderstand but that's the point.
> >
> >The message was _not_ being filtered out like it should be and that  
> >was because of the very generic /WINDOWS/ match.
> >
> >so that method doesn't really obey the locales you have set.
> >
> >when I take out the generic /WINDOWS/ match it does then screen it out.
> >
> >or rather is tagged against the rule.
> >
> >On Sep 11, 2006, at 8:40 AM, David Baron wrote:
> >
> >  
> >
> >>Local for HEBREW is not in this list.
> >>
> >>    
> >>
> >>>Windows-1255
> >>>
> >>>and apparently with locales
> >>>
> >>>DB<6> x @locales
> >>>0  'en'
> >>>1  'th'
> >>>2  'it'
> >>>3  'en_US'
> >>>
> >>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales($1, @locales)
> >>>
> >>>returns true
> >>>
> >>>Mail::SpamAssassin::Locales::is_charset_ok_for_locales(/home/robert/
> >>>lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Locales.pm:91):
> >>>91:       return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
> >>>
> >>>what?
> >>>
> >>>On Sep 10, 2006, at 4:38 PM, Robert Nicholson wrote:
> >>>      
> >>>
> >>>>Why didn't foreign charset rules catch this?
> >>>>
> >>>>Begin forwarded message:
> >>>>        
> >>>>
> >>>>>From: ariel@kini12.com
> >>>>>Date: September 10, 2006 2:17:51 PM CDT
> >>>>>To: robert@elastica.com
> >>>>>Subject: פריצת דרך מאתגרת
> >>>>>X-Spam-Dcc: : grub.camros.com 1113; Body=5 Fuz1=5 Fuz2=3
> >>>>>X-Spam-Flag: YES
> >>>>>X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on
> >>>>>grub.camros.com
> >>>>>X-Spam-Level: *****
> >>>>>X-Spam-Status: Yes, score=5.7 required=0.6
> >>>>>tests=BAYES_95,FRONTPAGE,
> >>>>>HTML_90_100,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_TITLE_SUBJ_DIFF,
> >>>>>MIME_HTML_ONLY,NO_REAL_NAME,UNPARSEABLE_RELAY autolearn=no
> >>>>>version=3.1.1
> >>>>>X-Spam-Report: *  1.0 NO_REAL_NAME From: does not include a real
> >>>>>name *  0.0 UNPARSEABLE_RELAY Informational: message has
> >>>>>unparseable relay *      lines *  0.5 HTML_IMAGE_RATIO_02 BODY:
> >>>>>HTML has a low ratio of text to image *      area *  0.1
> >>>>>HTML_90_100 BODY: Message is 90% to 100% HTML *  0.0 HTML_MESSAGE
> >>>>>BODY: HTML included in message *  3.0 BAYES_95 BODY: Bayesian spam
> >>>>>probability is 95 to 99% *      [score: 0.9667] *  0.0
> >>>>>MIME_HTML_ONLY BODY: Message only has text/html MIME parts *  0.9
> >>>>>FRONTPAGE RAW: Frontpage used to create the message *  0.3
> >>>>>HTML_TITLE_SUBJ_DIFF HTML_TITLE_SUBJ_DIFF
> >>>>>Received: (qmail 10557 invoked from network); 10 Sep 2006 18:17:08
> >>>>>-0000
> >>>>>Received: from  (HELO kini12.com) (208.53.131.241) by 64.34.193.12
> >>>>>with SMTP; 10 Sep 2006 18:17:08 -0000
> >>>>>Message-Id: <20...@kini12.com>
> >>>>>Mime-Version: 1.0
> >>>>>Content-Type: text/html; charset="windows-1255"
> >>>>>Content-Transfer-Encoding: quoted-printable
> >>>>>Lines: 124
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>להגיע למיליון לקוחות ?גם אתם רוצים
> >>>>>נא לחצו כאן
> >>>>>
> >>>>>
> >>>>>מתנצלים אם גרמנו להפרעה, להסרה
> >>>>>מרשימת הדיוורנמען נכבד, אנו לחץ
> >>>>>
> >>>>>להסרה לחצו כאן
> >>>>>          
> >>>>>
>