You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alex <my...@gmail.com> on 2014/06/10 19:53:35 UTC

Operations on headers in UTF-8

Hi all,
I'm not very familiar with how to manage language encoding, and hoped
someone could help. Some time ago I wrote a rule that looks for subjects
that consist of a single word that's more than N characters. It works, but
I'm learning that it's performed before the content of the subject is
converted into something human-readable. Instead, it operates on something
like:

Subject: =?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=

How can I write a header rule that operates on the decoded utf content?

header          __SUB_NOSPACE   Subject =~ /^.\S+$/
header          __SUB_VERYLONG  Subject =~ /^.{20,200}\S+$/
meta            LOC_SUBNOSPACE  (__SUB_VERYLONG && __SUB_NOSPACE)
describe        LOC_SUBNOSPACE  Subject with no space and one long word
score           LOC_SUBNOSPACE  0.8

Thanks,
Alex

Re: Operations on headers in UTF-8

Posted by Daniel Staal <DS...@usa.net>.
--As of June 11, 2014 4:25:31 AM +0200, Karsten Bräckelmann is alleged to 
have said:

> On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote:
>> --As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged
>> to  have said:
>> >     Worse, enabling charset normalization completely breaks UTF-8 chars
>> >     in the regex. At least in my ad-hoc --cf command line testing.
>>
>> --As for the rest, it is mine.
>>
>> This sounds like something where `use feature 'unicode_strings'` might
>> have  an affect
>
> Possibly.
>
>> enabling normalization is probably setting the internal utf8
>> flag on incoming text, which could change the semantics of the regex
>> matching.
>
> Nope. *digging into code*
>
> This option mainly affects rendered textual parts and headers, treating
> them with Encode::Detect. More complex than just setting an internal
> flag. What exactly made the ad-hoc regex rules fail is beyond the scope
> of tonight's code-diving.

Right.  And as a side-effect, Encode::Detect (as documented in Encode) is 
probably setting the utf8 flag on the Perl string.

Note I mean internal to *perl*, not one of the modules or code.  The utf8 
flag affects what semantics perl uses when it compares strings, including 
in regexes.

>> If that's the case, it raises the question of if we want Spamassassin to
>> require Perl 5.12 (which includes that feature) - the current base
>> version  is 5.8.1.  Unicode support has been evolving in Perl; 5.8
>> supports it  generally, but there were bugs.  I think 5.12 got most of
>> them, but I'm not  sure.  (And of course it's not the current version of
>> Perl.)
>
> The normalize_charset option requires Perl 5.8.5.
>
> All the ad-hoc rule testing in this thread has been done with SA 3.3.2
> on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
> recent Perl version.

`use feature 'unicode_strings'`, as a feature, only tangentially cares 
about what version of Perl you are running.  Yes, you need a new enough 
version to use it, but since features are not enabled by default any affect 
they might have doesn't occur unless they are requested.

> While of course something to potentially improve on itself, the topic of
> charset normalization is just a by-product explaining the original
> issue: Header rules and string encoding, with a grain of charset
> encoding salt.

True.  I was just thinking aloud as it were, and wondering if an 
explanation could be found for breaking UTF-8 strings in the regex.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: Operations on headers in UTF-8

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote:
> --As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to 
> have said:
> >     Worse, enabling charset normalization completely breaks UTF-8 chars
> >     in the regex. At least in my ad-hoc --cf command line testing.
> 
> --As for the rest, it is mine.
> 
> This sounds like something where `use feature 'unicode_strings'` might have 
> an affect

Possibly.

> enabling normalization is probably setting the internal utf8 
> flag on incoming text, which could change the semantics of the regex 
> matching.

Nope. *digging into code*

This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.


> If that's the case, it raises the question of if we want Spamassassin to 
> require Perl 5.12 (which includes that feature) - the current base version 
> is 5.8.1.  Unicode support has been evolving in Perl; 5.8 supports it 
> generally, but there were bugs.  I think 5.12 got most of them, but I'm not 
> sure.  (And of course it's not the current version of Perl.)

The normalize_charset option requires Perl 5.8.5.

All the ad-hoc rule testing in this thread has been done with SA 3.3.2
on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
recent Perl version.


While of course something to potentially improve on itself, the topic of
charset normalization is just a by-product explaining the original
issue: Header rules and string encoding, with a grain of charset
encoding salt.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Operations on headers in UTF-8

Posted by Daniel Staal <DS...@usa.net>.
--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to 
have said:

>     Worse, enabling charset normalization completely breaks UTF-8 chars
>     in the regex. At least in my ad-hoc --cf command line testing.

--As for the rest, it is mine.

This sounds like something where `use feature 'unicode_strings'` might have 
an affect - enabling normalization is probably setting the internal utf8 
flag on incoming text, which could change the semantics of the regex 
matching.

If that's the case, it raises the question of if we want Spamassassin to 
require Perl 5.12 (which includes that feature) - the current base version 
is 5.8.1.  Unicode support has been evolving in Perl; 5.8 supports it 
generally, but there were bugs.  I think 5.12 got most of them, but I'm not 
sure.  (And of course it's not the current version of Perl.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: Operations on headers in UTF-8

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2014-06-10 at 17:39 -0400, Alex wrote:
> On Tue, Jun 10, 2014 at 3:25 PM, Karsten Bräckelmann wrote:

> It's here where I'm starting to lose you:

Reading through your reply, I see we need to get to the basics first.
You are massively confusing different types of encoding and not fully
realizing the difference between a character and a byte.


Character Encoding

ASCII is a fixed width character encoding. A simple lookup table, where
each char corresponds directly to one of the 256 possible values. In
ASCII, a char is exactly 8 bit, 1 byte.

Since it is a fixed width encoding, there's an upper limit of different
chars it can represent: 256. This becomes a problem, when you want to
support more chars: Regional latin based chars, like the German Umlauts,
chars specific to French, Spanish, Norwegian, etc. And Greek. Cyrillic,
the Hebrew alphabet, Chinese and Japanese characters...

That's where UTF-8 enters the picture. It's a variable length charset
encoding. For backward compatibility, the first 128 (7 bit) are
identical to ASCII, covering the common latin chars. The 8th bit is used
to extend the number of characters available, by extending the
bit-length: The following byte becomes part of the same character.

In simple terms, there are 128 different byte values with the 8th bit
set. Each of these includes the following byte to form a 16 bit encoded
char. Since that second byte can hold 256 different bit-strings, we end
up with 128*256 characters 16 bit wide, plus the 128 characters 8 bit
wide. (The actual encoding is slightly different, and UTF-8 characters
can range from 1 up to 4 bytes in legth.)

Character Encoding is usually transparent, not visible to the user. The
Chinese(?) chars in this thread are an example. The Umlaut in my name is
another. You see a single character, which internally is represented by
2 or more bytes.


String Encoding

Base64 and quoted-printable are string encodings. They are unaware of a
possible (variable length) charset encoding. They even don't care about
the original string at all. They work strictly on a byte basis, not
differentiating between "human readable text" in whatever charset and a
binary blob.

String encodings are commonly used to ensure there are no bytes with the
8th bit set.


In raw mail headers, encoded words look like "=?utf-8?B?...?=".

The "B" indicates Base64 string encoding of the "..." encoded text.
This, the string encoding is what SA decodes by default for header
rules.


> > I assume your actual problem is with the SUB_VERYLONG rule hitting.
> > Since the above test rule shows the complete decoded Subject, we can
> > tell it's 13 chars long, clearly below the "verylong" threshold of 20
> > chars.
> 
> Visually counting the individual characters, including the colon, is
> indeed 13. However, there are spaces, which should have negated the
> rule (\S), no? Also, wc shows me the string is 41 chars long.

Correct, 13 characters. There are no spaces, though.

The string is 39 *bytes* long. (You included some line endings.) 'wc'
historically defaults to bytes, but supports chars, too.

$ echo -n "《环球旅讯》原创:在线旅游" | wc --chars --bytes
     13      39


> > That is not caused by the encoding, though, but because the regex
> > operates on bytes rather than characters.
> 
> Is not each character exactly one byte? Or are you referring to the
> fact that it takes untold multiple bytes to produce one encoded
> character?

With UTF-8 charset encoding, a character may be up to 4 bytes long.
Doing the math, it is quite obvious the Chinese(?) chars each take 3
bytes.

(The "encoding" I refered to in that quote is the Base64 string encoding
blowing up the length of the encoded string.)


> > To make the regex matching aware of UTF-8 encoding, and match chars
> > instead of (raw) bytes, we will need the normalize_charset option.
> >
> >   header TEST Subject =~ /^.{10}/"
> >   normalize_charset 1
> 
> Why wouldn't it make sense for this to be the default? What is the
> utility in trying to match on an encoded string?
> 
> I think I'm also confused by your reference above that header rules
> are matched against the decoded string. What then would be the purpose
> of normalize_charset here? Does normalize here mean to decode it?

It probably makes sense to switch to normalize_charset enabled by
default. It is disabled for historical reasons, and due to possible
side-effects or increased resource usage (particularly a concern with
earlier Perl versions). This needs to be investigated.


It's worth to point out that charset normalization is *not* a magical
UTF-8 support switch.

(a) Charset normalization does affect regex wildcard matching, changing
the meaning of /./ from byte to UTF-8 character. This is only relevant
with ranges, as can be seen here.

Moreover, it does make a much more significant difference with e.g.
Chinese. The occasional 2 bytes a German Umlaut takes makes almost no
difference overall even in German text. (Think an arbitrary boundary of
200 chars in pure US-ASCII, or an arbitrary boundary of 195 chars in
UTF-8 encoded German text, both being 200 bytes.)

(b) Normalization is not needed for using UTF-8 encoded strings in
regex-based rules. You can almost freely write REs including multi-byte
UTF-8 chars [1]. While internally represented by more than one byte,
your editor or shell will show you a single character.

Try it with a test header rule directly matching one of those Chinese(?)
characters.


> > Along with yet another modification of the test rule, now matching the
> > first 10 chars only.
> >
> >   got hit: "《环球旅讯》原创:在"
> >
> > The effect is clear. That 10 (chars) long match with normalize_charset
> > enabled is even longer than the above 20 (byte) match.
> 
> Okay, I think I understand. So I don't want to avoid scanning encoded
> headers, [...]

Nit-picking, but in the spirit of the whole thread: decoded. ;)

Header rules are matched against decoded (readable) strings of encoded
(gibberish) headers in the raw mail.


[1] Validating that claim of freely before sending, and trying to break
    it, there are caveats.
    First of all multi-byte chars cannot simply be optional by a
    trailing question mark. The multi-byte char needs to be enclosed in
    brackets for the optional affecting both bytes.
    Worse, enabling charset normalization completely breaks UTF-8 chars
    in the regex. At least in my ad-hoc --cf command line testing.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Operations on headers in UTF-8

Posted by Alex <my...@gmail.com>.
Hi,

On Tue, Jun 10, 2014 at 3:25 PM, Karsten Bräckelmann <gu...@rudersport.de>
wrote:
>
> On Tue, 2014-06-10 at 13:53 -0400, Alex wrote:
> > I'm not very familiar with how to manage language encoding, and hoped
> > someone could help. Some time ago I wrote a rule that looks for
> > subjects that consist of a single word that's more than N characters.
> > It works, but I'm learning that it's performed before the content of
> > the subject is converted into something human-readable.
>
> This is not true. Header rules are matched against the decoded string by
> default. To prevent decoding of quoted-printable or base-64 encoded
> headers, the :raw modifier needs to be appended to the header name.

I've also realized I made the improper assumption that, even if it was
operating on the encoded string, that the decoded string may have been
greater than 20 chars.

> > Subject:
=?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=
>
> That's a base-64 encoded UTF-8 string, decoded for header rules. To see
> for yourself, just echo your test header into
>
>   spamassassin -D -L --cf="header TEST Subject =~ /.+/"
>
> and the debug output will show you what it matched.
>
>   dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创:在线旅游"

Great info, thanks.

It's here where I'm starting to lose you:

> > How can I write a header rule that operates on the decoded utf
> > content?
> >
> > header          __SUB_NOSPACE   Subject =~ /^.\S+$/
> > header          __SUB_VERYLONG  Subject =~ /^.{20,200}\S+$/
> > meta            LOC_SUBNOSPACE  (__SUB_VERYLONG && __SUB_NOSPACE)
>
> Again, header rules by default operate on the decoded string.
>
> I assume your actual problem is with the SUB_VERYLONG rule hitting.
> Since the above test rule shows the complete decoded Subject, we can
> tell it's 13 chars long, clearly below the "verylong" threshold of 20
> chars.

Visually counting the individual characters, including the colon, is indeed
13. However, there are spaces, which should have negated the rule (\S), no?
Also, wc shows me the string is 41 chars long.

> That is not caused by the encoding, though, but because the regex
> operates on bytes rather than characters.

Is not each character exactly one byte? Or are you referring to the fact
that it takes untold multiple bytes to produce one encoded character?

> Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
> modified rule will match the first 20 bytes only:
>
>   header TEST Subject =~ /^.{20}/
>
> The result shows the string is longer than 20 bytes, and the match even
> ends right within a single UTF-8 encoded char.
>
>   got hit: "《环球旅讯》<E5><8E>"

Yes, on my xterm it just produces an unintelligible character as a question
mark.

> To make the regex matching aware of UTF-8 encoding, and match chars
> instead of (raw) bytes, we will need the normalize_charset option.
>
>   header TEST Subject =~ /^.{10}/"
>   normalize_charset 1

Why wouldn't it make sense for this to be the default? What is the utility
in trying to match on an encoded string?

I think I'm also confused by your reference above that header rules are
matched against the decoded string. What then would be the purpose of
normalize_charset here? Does normalize here mean to decode it?

> Along with yet another modification of the test rule, now matching the
> first 10 chars only.
>
>   got hit: "《环球旅讯》原创:在"
>
> The effect is clear. That 10 (chars) long match with normalize_charset
> enabled is even longer than the above 20 (byte) match.

Okay, I think I understand. So I don't want to avoid scanning encoded
headers, but it's also very unlikely to find a 20-byte string of Japanese
characters in any spam message, so I don't really know what I should do
with this.

I've also investigated a bit further, and it appears to hit quite a bit of
ham (really_long_spreadsheet.xls, for example), so maybe I need to meta it
with something, or just abandon it.

Thanks,
Alex

Re: Operations on headers in UTF-8

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2014-06-10 at 13:53 -0400, Alex wrote:
> I'm not very familiar with how to manage language encoding, and hoped
> someone could help. Some time ago I wrote a rule that looks for
> subjects that consist of a single word that's more than N characters.
> It works, but I'm learning that it's performed before the content of
> the subject is converted into something human-readable.

This is not true. Header rules are matched against the decoded string by
default. To prevent decoding of quoted-printable or base-64 encoded
headers, the :raw modifier needs to be appended to the header name.


> Instead, it operates on something like:
> 
> Subject: =?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=

That's a base-64 encoded UTF-8 string, decoded for header rules. To see
for yourself, just echo your test header into

  spamassassin -D -L --cf="header TEST Subject =~ /.+/"

and the debug output will show you what it matched.

  dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创:在线旅游"


> How can I write a header rule that operates on the decoded utf
> content?
> 
> header          __SUB_NOSPACE   Subject =~ /^.\S+$/
> header          __SUB_VERYLONG  Subject =~ /^.{20,200}\S+$/
> meta            LOC_SUBNOSPACE  (__SUB_VERYLONG && __SUB_NOSPACE)

Again, header rules by default operate on the decoded string.

I assume your actual problem is with the SUB_VERYLONG rule hitting.
Since the above test rule shows the complete decoded Subject, we can
tell it's 13 chars long, clearly below the "verylong" threshold of 20
chars.

That is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.


Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
modified rule will match the first 20 bytes only:

  header TEST Subject =~ /^.{20}/

The result shows the string is longer than 20 bytes, and the match even
ends right within a single UTF-8 encoded char.

  got hit: "《环球旅讯》<E5><8E>"


To make the regex matching aware of UTF-8 encoding, and match chars
instead of (raw) bytes, we will need the normalize_charset option.

  header TEST Subject =~ /^.{10}/"
  normalize_charset 1

Along with yet another modification of the test rule, now matching the
first 10 chars only.

  got hit: "《环球旅讯》原创:在"

The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}