You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Reindl Harald <h....@thelounge.net> on 2015/10/08 17:34:34 UTC

charset=utf-16 tricks out SA

Content-Type: text/plain; charset=utf-16
Content-Transfer-Encoding: base64

no custom body rules hit like they do for ISO/UTF8 :-(





Re: charset=utf-16 tricks out SA

Posted by Reindl Harald <h....@thelounge.net>.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7252 with the sample 
and link to this list thread - major because the sample is just a 
english mail tricking out SA and if spammers find that information i 
expect a flood sooner or later - not disclose the problem and so get it 
fixed won't make things better over the long

Am 10.10.2015 um 03:03 schrieb RW:
> On Fri, 09 Oct 2015 14:22:18 +0200
> Mark Martinec wrote:
>
>> The problem with this message is that it declares encoding
>> as UTF-16, i.e. not explicitly stating endianness like
>> UTF-16BE or UTF-16LE, and there is no BOM mark at the
>> beginning of each textual part, so endianness cannot be
>> determined. The RFC 2781 says that big-endian encoding
>> should be assumed in absence of BOM.
>> See https://en.wikipedia.org/wiki/UTF-16
>>
>> In the provided message the actual endianness is LE, and
>> BOM is missing, so decoding as UTF-16BE fails and the
>> rule does not hit. Garbage-in, garbage-out.
>
>
> I'm not seeing any body tokens, even after training.
>
> I was expecting that the text would be tokenized as individual UTF-8
> sequences. ASCII characters encoded as UTF-16 and decoded with the
> wrong endianness are still valid UTF-16. Normalizing them into
> UTF-8 should produce completely multi-byte UTF-8 without whitespace or
> punctuation (not counting U+2000 inside UTF-8).
>
> If I add John Hardin's diagnostic rule
>
> body     __ALL_BODY     /.*/
> tflags   __ALL_BODY     multiple
>
> I get:
>
> ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
> _p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
> _p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
> _...
>
> It looks like it's still UTF-16, and Bayes is seeing individual
> letters (which are too short to be tokens) separated by nulls.
>
> If I change the mime to utf-16le it works correctly, except that the
> subject isn't converted - including the copy in the body.  If I set the
> mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
> expecting.
>
> So SA isn't falling back to big-endian, it wont normalize without an
> explicit endianess.
>
>
> BTW with normalize_charset 0 it looks like a spammer can effectively
> turn-off body tokenization by using UTF-16 (with correct endianness)


Re: charset=utf-16 tricks out SA

Posted by RW <rw...@googlemail.com>.
On Sat, 10 Oct 2015 10:56:14 +0200
Mark Martinec wrote:


> > BTW with normalize_charset 0 it looks like a spammer can effectively
> > turn-off body tokenization by using UTF-16 (with correct
> > endianness).
> 
> Yes. There are also other tricks that a spammer can't play.
> It's not possible to emulate all different behaviours of
> various mail reading programs. Still, in the case we have
> it would make sense to try also the utf-16le, since this is
> a default endianness in Windows.

It might be sensible to strip nulls. That way if text
contains unconverted UTF-16 (either because conversion failed or
normalization is off), encoded ASCII characters get converted correctly
into single bytes. Most body rules will then work, and Bayes can
tokenize the text. 


Re: charset=utf-16 tricks out SA

Posted by Mark Martinec <Ma...@ijs.si>.
2015-10-10 03:03, RW wrote:

> I'm not seeing any body tokens, even after training.
> 
> I was expecting that the text would be tokenized as individual UTF-8
> sequences. ASCII characters encoded as UTF-16 and decoded with the
> wrong endianness are still valid UTF-16. Normalizing them into
> UTF-8 should produce completely multi-byte UTF-8 without whitespace or
> punctuation (not counting U+2000 inside UTF-8).
> 
> If I add John Hardin's diagnostic rule
> 
> body     __ALL_BODY     /.*/
> tflags   __ALL_BODY     multiple
> 
> I get:
> 
> ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
> _p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
> _p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
> _...
> 
> It looks like it's still UTF-16, and Bayes is seeing individual
> letters (which are too short to be tokens) separated by nulls.

The way it works now is if the decoding as declared fails,
and some guessing fails too, it falls back to Windows-1252,
which are single byte characters (superset of ISO-8859-1),
which can't fail, and gives you the result you are seeing
(spaced out by null characters).

> If I change the mime to utf-16le it works correctly, except that the
> subject isn't converted - including the copy in the body.  If I set the
> mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
> expecting.

The endoded-word in the Subject header field needs to be
declared as utf-16le too, then it works (tried on trunk).

> So SA isn't falling back to big-endian, it wont normalize without an
> explicit endianess.

It tries as BE, and when Encode::decode reports a failure, it
decodes as Windows-1252.

> BTW with normalize_charset 0 it looks like a spammer can effectively
> turn-off body tokenization by using UTF-16 (with correct endianness).

Yes. There are also other tricks that a spammer can't play.
It's not possible to emulate all different behaviours of
various mail reading programs. Still, in the case we have
it would make sense to try also the utf-16le, since this is
a default endianness in Windows.

   Mark

Re: charset=utf-16 tricks out SA

Posted by RW <rw...@googlemail.com>.
On Fri, 09 Oct 2015 14:22:18 +0200
Mark Martinec wrote:
 
> The problem with this message is that it declares encoding
> as UTF-16, i.e. not explicitly stating endianness like
> UTF-16BE or UTF-16LE, and there is no BOM mark at the
> beginning of each textual part, so endianness cannot be
> determined. The RFC 2781 says that big-endian encoding
> should be assumed in absence of BOM.
> See https://en.wikipedia.org/wiki/UTF-16
> 
> In the provided message the actual endianness is LE, and
> BOM is missing, so decoding as UTF-16BE fails and the
> rule does not hit. Garbage-in, garbage-out.


I'm not seeing any body tokens, even after training.

I was expecting that the text would be tokenized as individual UTF-8
sequences. ASCII characters encoded as UTF-16 and decoded with the
wrong endianness are still valid UTF-16. Normalizing them into
UTF-8 should produce completely multi-byte UTF-8 without whitespace or
punctuation (not counting U+2000 inside UTF-8).

If I add John Hardin's diagnostic rule

body     __ALL_BODY     /.*/
tflags   __ALL_BODY     multiple

I get:

ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
_p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
_p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
_...

It looks like it's still UTF-16, and Bayes is seeing individual
letters (which are too short to be tokens) separated by nulls.

If I change the mime to utf-16le it works correctly, except that the
subject isn't converted - including the copy in the body.  If I set the
mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
expecting.

So SA isn't falling back to big-endian, it wont normalize without an
explicit endianess. 


BTW with normalize_charset 0 it looks like a spammer can effectively
turn-off body tokenization by using UTF-16 (with correct endianness).

Re: charset=utf-16 tricks out SA

Posted by RW <rw...@googlemail.com>.
On Fri, 9 Oct 2015 14:47:53 +0200
Reindl Harald wrote:


> > In the provided message the actual endianness is LE, and
> > BOM is missing, so decoding as UTF-16BE fails and the
> > rule does not hit. Garbage-in, garbage-out.
> >
> > If you manually edit the sample and replace UTF-16
> > with UTF-16LE (and normalize is enabled), your rule should
> > hit - at least it does so in the current trunk code.
> 
> yes, but since thunderbird shows the message and it don#t contain 
> special chars....


I suspect that it's falling back to  native x86 little-endianness.

Personally, I don't get any UTF-16 mail, so I'd be happy to score
it.

Re: charset=utf-16 tricks out SA

Posted by Reindl Harald <h....@thelounge.net>.

Am 09.10.2015 um 14:22 schrieb Mark Martinec:
> Reindl Harald wrote:
>
>>>> no custom body rules hit like they do for ISO/UTF8 :-(
>>> What is your normalize_charsets setting?
>>
>> enabled, that's what i meant with "like they do for ISO/UTF8" and
>> adding "dear potencial partner" to CUST_BODY_17 did not change the
>> score
>>
>> see attached sample and rule below
>>
>> body      CUST_BODY_17    /.*(1st page ranking of google|dear
>> potencial partner).*/i
>> score     CUST_BODY_17    1.0
>> describe  CUST_BODY_17    Contains Low
>
> The problem with this message is that it declares encoding
> as UTF-16, i.e. not explicitly stating endianness like
> UTF-16BE or UTF-16LE, and there is no BOM mark at the
> beginning of each textual part, so endianness cannot be
> determined. The RFC 2781 says that big-endian encoding
> should be assumed in absence of BOM.
> See https://en.wikipedia.org/wiki/UTF-16

spammers are known to make mistakes, usually that's the things which got 
scored, that case is the opposite

> In the provided message the actual endianness is LE, and
> BOM is missing, so decoding as UTF-16BE fails and the
> rule does not hit. Garbage-in, garbage-out.
>
> If you manually edit the sample and replace UTF-16
> with UTF-16LE (and normalize is enabled), your rule should
> hit - at least it does so in the current trunk code.

yes, but since thunderbird shows the message and it don#t contain 
special chars....

> If this seems to be common in the wild, please open a
> bug ticket, as Kevin suggested, and attach the sample there.

that was a message from the wild hit BAYES_999 but not enough to exceed 
milter-reject score and hence the body rule which don't fire

will write a bugreport as soon i find some spare time (lot of stuff 
currently around me...)


Re: charset=utf-16 tricks out SA

Posted by Reindl Harald <h....@thelounge.net>.

Am 11.10.2015 um 22:46 schrieb @lbutlr:
> On Oct 10, 2015, at 3:59 AM, Linda A. Walsh <sa...@tlinx.org> wrote:
>
> [bollocks and tripe snipped]
>
>> But the big-iron struck back by pushing through an unrealistic default
>> for non-BOM UTF16 files... and yeah, it's in the standard, but
>> in the real world, it's not the default.
>
> Only if you consider the assbackward Microsoft as “the real word”.
>
> Hint: the vast majority of mail servers in the word are not running Microsoft OSes

how does it matter what software the majority of mail servers are 
running when you can easily trick out a mailserver running Linux / 
SpamAssassin?

and even if i hate it and don't understand: the majority of (relevant) 
mailsevers semms to running Mircosoft just because of the huge amount of 
backscatters with "unknown user" instead reject such messages to make 
proper bounce-management possible - *you* may not notice it, mailadmins 
which receive backscatters for classify mail do




Re: charset=utf-16 tricks out SA

Posted by "@lbutlr" <kr...@kreme.com>.
On Oct 10, 2015, at 3:59 AM, Linda A. Walsh <sa...@tlinx.org> wrote:

[bollocks and tripe snipped]

> But the big-iron struck back by pushing through an unrealistic default
> for non-BOM UTF16 files... and yeah, it's in the standard, but
> in the real world, it's not the default. 

Only if you consider the assbackward Microsoft as “the real word”.

Hint: the vast majority of mail servers in the word are not running Microsoft OSes.

-- 
“They always say time changes things, but you actually have to change
them yourself." Andy Warhol


Re: charset=utf-16 tricks out SA

Posted by "Linda A. Walsh" <sa...@tlinx.org>.

Mark Martinec wrote:
> Reindl Harald wrote:
> 
>>>> no custom body rules hit like they do for ISO/UTF8 :-(
>>> What is your normalize_charsets setting?
> 
> The problem with this message is that it declares encoding
> as UTF-16, i.e. not explicitly stating endianness like
> UTF-16BE or UTF-16LE, and there is no BOM mark at the
> beginning of each textual part, so endianness cannot be
> determined. The RFC 2781 says that big-endian encoding
> should be assumed in absence of BOM.
> See https://en.wikipedia.org/wiki/UTF-16
> 
> In the provided message the actual endianness is LE, and
> BOM is missing, so decoding as UTF-16BE fails and the
> rule does not hit. Garbage-in, garbage-out.
----
	In the real world, RFC 2781 is full of bovine excrement.

The most common and real-world default is UTF-16LE, as blessed by
MS.  And the bigassian fanboys who are MS-haters have always hated
that fact -- but that doesn't change what any intelligent person
would assume about UTF-16 in the real world.

So you can follow rules written for the large-iron days before the PC,
or you can follow the real world.  I've encountered multiple UTF16
files in the wild that came from before BOM marks were used in an attempt
to tame MS, and use of BOM marks is not inherent in the core NT OS,
it's aways a win32-consumer addon -- sorta like why MS's Unicode
support is still, most fully, only Unicode 2.0, with spurious additions
in the later versions (w/Unicode being @ version 8 now).

So it basically boils down to whether or not you want to go with
reality, or last generation losers. I ran into this stupidity
when the perl community came out with a supposed replacement 
for iconv.  Except that it wasn't compatble w/the defaults.

iconv's output for UTF16 UTF-16 is LE(w/BOM) and UCS2 = UTF-16 w/no BOM.
UCS2 = MS's full Unicode Standard 2.  And that's been the standard since
MS came out with their full UCS2 support in the UCS2 charset (cept that
UCS3 and beyond wouldn't fit in 2-bytes, so they had to go with a similar
encoding to UTF-8 in UTF-16 -- where they still used UCS2 byte ordering (LE).

But the big-iron struck back by pushing through an unrealistic default
for non-BOM UTF16 files... and yeah, it's in the standard, but
in the real world, it's not the default.  Unfortunately,
SA is written in Perl which goes against real world
usage and was about 10 years late to the game w/UTF-8
support when they reactively, completely reverted UTF-8
support in perl-5.8.0.  They have only in the past few years
restored somewhat proper function of assuming locale-encoding
on console-centric byte streams, while requiring files opened
w/open (not reading <> or writing to STDOUT/STDERR) to express
text encoding if they didn't want perl's unicode bug in reading 
& writing binary data (0-255) as latin1 but issuing runtime
warnings or fatal errors if you manipulate that binary data
such that a charval > 255 results in the stream.  In such 
a case, they write out chars >255 in UTF-8 encoding, but all
chars <255 are written out in incompatible latin1 -- unless
you pre-define the output charset as one or the other -- leaving
the default case to alway generate wrong output on mixed usage
of chars <255 and those > 255.

So that's the rough history and it's still a problem in today
in the real world vs. after-the-fact standards.





ts full support in perl-8.0


> 
> If you manually edit the sample and replace UTF-16
> with UTF-16LE (and normalize is enabled), your rule should
> hit - at least it does so in the current trunk code.
> 
> If this seems to be common in the wild, please open a
> bug ticket, as Kevin suggested, and attach the sample there.
> 
>   Mark
> 

Re: charset=utf-16 tricks out SA

Posted by Mark Martinec <Ma...@ijs.si>.
Reindl Harald wrote:

>>> no custom body rules hit like they do for ISO/UTF8 :-(
>> What is your normalize_charsets setting?
> 
> enabled, that's what i meant with "like they do for ISO/UTF8" and
> adding "dear potencial partner" to CUST_BODY_17 did not change the
> score
> 
> see attached sample and rule below
> 
> body      CUST_BODY_17    /.*(1st page ranking of google|dear
> potencial partner).*/i
> score     CUST_BODY_17    1.0
> describe  CUST_BODY_17    Contains Low

The problem with this message is that it declares encoding
as UTF-16, i.e. not explicitly stating endianness like
UTF-16BE or UTF-16LE, and there is no BOM mark at the
beginning of each textual part, so endianness cannot be
determined. The RFC 2781 says that big-endian encoding
should be assumed in absence of BOM.
See https://en.wikipedia.org/wiki/UTF-16

In the provided message the actual endianness is LE, and
BOM is missing, so decoding as UTF-16BE fails and the
rule does not hit. Garbage-in, garbage-out.

If you manually edit the sample and replace UTF-16
with UTF-16LE (and normalize is enabled), your rule should
hit - at least it does so in the current trunk code.

If this seems to be common in the wild, please open a
bug ticket, as Kevin suggested, and attach the sample there.

   Mark

Re: charset=utf-16 tricks out SA

Posted by Reindl Harald <h....@thelounge.net>.

Am 09.10.2015 um 08:10 schrieb John Wilcock:
> Le 08/10/2015 17:34, Reindl Harald a écrit :
>> Content-Type: text/plain; charset=utf-16
>> Content-Transfer-Encoding: base64
>>
>> no custom body rules hit like they do for ISO/UTF8 :-(
>
> What is your normalize_charsets setting?

enabled, that's what i meant with "like they do for ISO/UTF8" and adding 
"dear potencial partner" to CUST_BODY_17 did not change the score

see attached sample and rule below

body      CUST_BODY_17    /.*(1st page ranking of google|dear potencial 
partner).*/i
score     CUST_BODY_17    1.0
describe  CUST_BODY_17    Contains Low

bayes_path /var/lib/spamass-milter/.spamassassin/bayes
bayes_file_mode 0600
use_learner 1
use_bayes 1
use_bayes_rules 1
bayes_use_hapaxes 1
bayes_expiry_max_db_size 50000000
bayes_auto_expire 0
bayes_auto_learn 0
bayes_learn_during_report 0
bayes_learn_to_journal 1
bayes_token_sources all
normalize_charset 1

Re: charset=utf-16 tricks out SA

Posted by John Wilcock <jo...@tradoc.fr>.
Le 08/10/2015 17:34, Reindl Harald a écrit :
> Content-Type: text/plain; charset=utf-16
> Content-Transfer-Encoding: base64
>
> no custom body rules hit like they do for ISO/UTF8 :-(

What is your normalize_charsets setting?

-- 
John

Re: charset=utf-16 tricks out SA

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
Please open a bug especially if this is seen in the wild!

On 10/8/2015 11:34 AM, Reindl Harald wrote:
> Content-Type: text/plain; charset=utf-16
> Content-Transfer-Encoding: base64
>
> no custom body rules hit like they do for ISO/UTF8 :-(