You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by ha...@t-online.de on 2014/07/04 08:08:51 UTC

Re: More text/plain questions

>> >> I got the following MIME body part below, and I�m wondering if it would make sense to filter on this as well.
>> >> Given that it�s text/plain with an implicit charset=�us-ascii� and an implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn�t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap...
>> >> Wouldn�t that normally render as the character �&�, �#�, �x�, etc. rather than the unicode16 or UTF-8 character with that hex value?
>> >> There might be times when someone has sent an attachment improperly encoded this way which might have embedded binary values in it, but that�s kind of buggy anyway� it should have been done as base64 and application/octet-stream in the worst of cases if it has arbitrary binary data.
>> >> I wouldn�t want a message where someone gives a couple of examples of encoding &#x0400 for instance being flagged as SPAM, but if the text is 20% or more of these sequences then I would say that�s SPAM-sign.
>> >> Anyway, here�s the body I saw:
>> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb
>> Content-Type: text/plain;
>> >> Th&#x0435; R&#x0435;&#x0430;l R&#x0435;&#x0430;&#x0455;&#x043E;nTh&#x0435; &#x13DF;&#x043E;m&#x0456;ng &#x13DF;&#x043E;ll&#x0430;&#x0440;&#x0455;&#x0435;...Th&#x0435; r&#x0435;&#x0430;l r&#x0435;&#x0430;&#x0455;&#x043E;n &#x13B3;HY H&#x043E;m&#x0435;l&#x0430;ndS&#x0435;cur&#x0456;t&#x0443; r&#x0435;c&#x0435;ntl&#x0443; &#x0440;urch&#x0430;&#x0455;&#x0435;d1.7 B&#x0456;ll&#x0456;&#x043E;n R&#x043E;und&#x0455; &#x043E;f &#x0430;mmun&#x0456;t&#x0456;&#x043E;n...&#x13B3;h&#x0430;t Y&#x043E;u Mu&#x0455;t D&#x043E; T&#x043E; &#x13AC;n&#x0455;ur&#x0435; Y&#x043E;urS&#x0430;f&#x0435;t&#x0443;H&#x043E;m&#x0435;l&#x0430;nd &#x0455;&#x0435;cur&#x0456;t&#x0443; &#x0456;&#x0455; th&#x0435;r&#x0435; t&#x043E; &#x0455;&#x0435;cur&#x0435;th&#x0435; h&#x043E;m&#x0435;l&#x0430;nd &#x043E;nl&#x0443;... S&#x043E; th&#x0435;&#x0455;&#x0435; &#x042C;ull&#x0435;t&#x0455;&#x0430;r&#x0435; r&#x0435;&#x0430;l&#x0443; m&#x0435;&#x0430;nt f&#x043E;r th&#x0435;Th&#x0456;&#x0455; &#x0456;&#x0455; &#x0430;n &#x0435;m&#x0430;&#x0456;l&#x0430;dv&#x0435;rt&#x0456;&#x0455;&#x0435;m&#x0435;nt th&#x0430;t w&#x0430;&#x0455; &#x0455;&#x0435;nt t&#x043E; &#x0443;&#x043E;u &#x042C;&#x0443; &#x03A1;&#x0430;tr&#x0456;&#x043E;t Surv&#x0456;v&#x0430;l &#x03A1;l&#x0430;n. If &#x0443;&#x043E;uw&#x0456;&#x0455;h t&#x043E; n&#x043E;l&#x043E;ng&#x0435;r r&#x0435;c&#x0435;&#x0456;v&#x0435; m&#x0435;&#x0455;&#x0455;&#x0430;g&#x0435;&#x0455; th&#x0430;t &#x0440;r&#x043E;m&#x043E;t&#x0435; &#x0455;urv&#x0456;v&#x0430;l t&#x0456;&#x0440;&#x0455;, &#x0440;l&#x0435;&#x0430;&#x0455;&#x0435;cl&#x0456;ck h&#x0435;r&#x0435; t&#x043E; un&#x0455;u&#x042C;&#x0455;cr&#x0456;&#x042C;&#x0435;.4 Unstable as water, thou shalt not excel because thou wentest up to thy fathers bed then defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the son of Josiah king in the room of Josiah his father, and turned his name to Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37  And the thing was good in the eyes of Pharaoh, and in the eyes o!
>> f all his servants.
>> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb

Hi,

while this is certainly not correct - and likely does not display in every mail client - it would
probably work in several webmailers. Perhaps this is the configuration the author of that
crap tested.
Now, I am somewhat reluctant to classify badly formatted mails as spam: there are many
systems around, even from major players, that send legitimate mails like order confirmation,
delivery notification, opted-in newsletters but do many of the formal things more right than wrong
On the other side, looking at the actual characters shows that the message is spam: these are
cyrillic letters that happen to look exactly like western ones (a, e, o or such) so the obvious intent
is to avoid detection of the strings. We have seen the same with IDN domain names that might
use a cyrillic a to register a domain that looks like, e.g. paypal.com
The list of characters is fairly short, so maybe checking for these characters in all commonly
used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) would be a good
spam indication

Regards
Wolfgang

Re: More text/plain questions

Posted by "David F. Skoll" <df...@roaringpenguin.com>.

On Mon, 07 Jul 2014 19:29:11 -0400
Daniel Staal <DS...@usa.net> wrote:

> Just to start the discussion: I'd say default to UTF-8 if not
> otherwise specified and can't be worked out.  (How hard to work on
> 'working it out' is a question, of course.)  It's the growing
> standard, as far as I can tell.

+1.  UTF-8 is the best choice.  (Modern) Perl handles it very nicely.
Even non-UTF-8 messages should be recoded into UTF-8 for body rules;
otherwise, making a rule that looks for things like "抵押" will be
well-nigh impossible.

Regards,

David.

Re: More text/plain questions

Posted by Daniel Staal <DS...@usa.net>.

--As of July 7, 2014 5:20:01 PM -0400, Kevin A. McGrail is alleged to have 
said:

> On 7/7/2014 5:09 PM, Philip Prindeville wrote:
>> On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail <KM...@PCCC.com> wrote:
>>
>>> On 7/7/2014 2:28 AM, John Wilcock wrote:
>>>> Le 05/07/2014 19:08, Philip Prindeville a écrit :
>>>>> As for encoding a cyrillic small a: there are many ways to do this.
>>>>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
>>>>> would be very efficient—there are just too many charsets possible.
>>>> Normalising the input message to UTF-8 before body checks would help
>>>> somewhat with that. I seem to remember there's been talk of doing this.
>>>>
>>> Yes, or utf-16...  I think that will be necessary to keep SA effective
>>> in the modern world sooner than later.
>>
>> Okay, but… if the message body is non-ASCII and the CTE is 8bit or
>> base64 and no explicit charset has been given, how do you know which
>> translation to perform?
>>
>> I get a lot of Han SPAM in GB2312 where the charset is never specified
>> (apparently it’s a national default in China, despite the requirements
>> stated in RFC-2045 and -2046).
> Sorry, I haven't even started delving into the devilish details but I
> know it's looming as a needed feature.

--As for the rest, it is mine.

Just to start the discussion: I'd say default to UTF-8 if not otherwise 
specified and can't be worked out.  (How hard to work on 'working it out' 
is a question, of course.)  It's the growing standard, as far as I can tell.

Even if it's wrong in a particular case, it would probably be useful: It 
would give rule writers something to work with.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: More text/plain questions

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 7/7/2014 5:09 PM, Philip Prindeville wrote:
> On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail <KM...@PCCC.com> wrote:
>
>> On 7/7/2014 2:28 AM, John Wilcock wrote:
>>> Le 05/07/2014 19:08, Philip Prindeville a écrit :
>>>> As for encoding a cyrillic small a: there are many ways to do this.
>>>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
>>>> would be very efficient—there are just too many charsets possible.
>>> Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this.
>>>
>> Yes, or utf-16...  I think that will be necessary to keep SA effective in the modern world sooner than later.
>
> Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and no explicit charset has been given, how do you know which translation to perform?
>
> I get a lot of Han SPAM in GB2312 where the charset is never specified (apparently it’s a national default in China, despite the requirements stated in RFC-2045 and -2046).
Sorry, I haven't even started delving into the devilish details but I 
know it's looming as a needed feature.

regards,
KAM

Re: More text/plain questions

Posted by Philip Prindeville <ph...@redfish-solutions.com>.

On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail <KM...@PCCC.com> wrote:

> On 7/7/2014 2:28 AM, John Wilcock wrote:
>> Le 05/07/2014 19:08, Philip Prindeville a écrit :
>>> As for encoding a cyrillic small a: there are many ways to do this.
>>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
>>> would be very efficient—there are just too many charsets possible.
>> 
>> Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this.
>> 
> Yes, or utf-16...  I think that will be necessary to keep SA effective in the modern world sooner than later.

Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and no explicit charset has been given, how do you know which translation to perform?

I get a lot of Han SPAM in GB2312 where the charset is never specified (apparently it’s a national default in China, despite the requirements stated in RFC-2045 and -2046).

-Philip

Re: More text/plain questions

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 7/7/2014 2:28 AM, John Wilcock wrote:
> Le 05/07/2014 19:08, Philip Prindeville a écrit :
>> As for encoding a cyrillic small a: there are many ways to do this.
>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
>> would be very efficient—there are just too many charsets possible.
>
> Normalising the input message to UTF-8 before body checks would help 
> somewhat with that. I seem to remember there's been talk of doing this.
>
Yes, or utf-16...  I think that will be necessary to keep SA effective 
in the modern world sooner than later.

Re: More text/plain questions

Posted by John Wilcock <jo...@tradoc.fr>.

Le 05/07/2014 19:08, Philip Prindeville a écrit :
> As for encoding a cyrillic small a: there are many ways to do this.
> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
> would be very efficient—there are just too many charsets possible.

Normalising the input message to UTF-8 before body checks would help 
somewhat with that. I seem to remember there's been talk of doing this.

-- 
John

Re: More text/plain questions

Posted by Philip Prindeville <ph...@redfish-solutions.com>.

On Jul 4, 2014, at 12:08 AM, hamann.w@t-online.de wrote:

> 
> Hi,
> 
> while this is certainly not correct - and likely does not display in every mail client - it would
> probably work in several webmailers. Perhaps this is the configuration the author of that
> crap tested.
> Now, I am somewhat reluctant to classify badly formatted mails as spam: there are many
> systems around, even from major players, that send legitimate mails like order confirmation,
> delivery notification, opted-in newsletters but do many of the formal things more right than wrong
> On the other side, looking at the actual characters shows that the message is spam: these are
> cyrillic letters that happen to look exactly like western ones (a, e, o or such) so the obvious intent
> is to avoid detection of the strings. We have seen the same with IDN domain names that might
> use a cyrillic a to register a domain that looks like, e.g. paypal.com
> The list of characters is fairly short, so maybe checking for these characters in all commonly
> used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) would be a good
> spam indication
> 
> Regards
> Wolfgang
> 
> 

I think you’re overlooking what a lot of tests already do: test for poor formatting.

INVALID_DATE
UNPARSEABLE_RELAY
HTML_MISSING_CTYPE
MISSING_HEADERS
MISSING_DATE

As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible.

-Philip