You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2015/10/05 13:21:39 UTC

[Bug 7249] New: Decode MIME-encoded filenames in attachments

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

            Bug ID: 7249
           Summary: Decode MIME-encoded filenames in attachments
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: azotov@geolink-group.com

File attachments with non-ascii filenames appear encoded in attachment header.
Example:

Content-Type: application/octet-stream; name=
    "=?utf-8?B?0LTQvtC60YPQvNC10L3RgtGLINC00LvRjyDQvtGC0LTQ?=
    =?utf-8?B?tdC70LAg0LrQsNC00YDQvtCyLnBkZg==?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=
    "=?utf-8?B?0LTQvtC60YPQvNC10L3RgtGLINC00LvRjyDQvtGC0LTQ?=
    =?utf-8?B?tdC70LAg0LrQsNC00YDQvtCyLnBkZg==?="

The "real" filename in this example is 'документы для отдела кадров.pdf'.

But spamassassin does not decode such filenames. This bug/feature_missing leads
to Bayes missing actual spam words used in filename and PDFInfo plugin
completely ignoring such attachments because it does not find .pdf extension in
MIME-encoded version of filename.

Suggested patch for decoding mime-encoded filenames in attachments:

--- lib/Mail/SpamAssassin/Message.pm.orig       2015-04-28 22:56:49.000000000
+0300
+++ lib/Mail/SpamAssassin/Message.pm    2015-10-05 13:56:42.000000000 +0300
@@ -1046,6 +1046,7 @@
   elsif ($ct[3]) {
     $msg->{'name'} = $ct[3];
   }
+  if ($msg->{'name'}) { $msg->{'name'} = Encode::decode("MIME-Header",
$msg->{'name'}); }

   $msg->{'boundary'} = $boundary;

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #9 from Mark Martinec <Ma...@ijs.si> ---
Btw (for reference), Bug 7307 is related.


(In reply to John Wilcock from comment #8)
> (In reply to azotov from comment #4)
> > This example was taken from real spam message which was created by some
> > non-rfc-compliant software.
> 
> Is there a mechanism in place for SA to not only work around such RFC
> violations but also flag the fact that it has done so, because the violation
> might be a spam sign worth scoring?
> 
> If not, is it worth raising a new bug to record the idea?

I'm using the following two rules to check for such violations:

# RFC 2047 section 5:
#   Each 'encoded-word' MUST represent an integral number of characters.
#   A multi-octet character may not be split across adjacent 'encoded-word's.

header L_SPLIT_UTF8_SUBJ  Subject:raw =~ m{(=\?UTF-8) (?: \* [^?=<>, \t]* )?
(\?Q\?) [^ ?]* =[89A-F][0-9A-F] \?= \s* \1 (?: \* [^ ?=]* )? \2
=[89AB][0-9A-F]}xsmi
describe L_SPLIT_UTF8_SUBJ  UTF-8 char split across QP encoded-words in Subject
score  L_SPLIT_UTF8_SUBJ  1.5

header L_SPLIT_UTF8_FROM  From:raw =~ m{(=\?UTF-8) (?: \* [^?=<>, \t]* )?
(\?Q\?) [^ ?]* =[89A-F][0-9A-F] \?= \s* \1 (?: \* [^ ?=]* )? \2
=[89AB][0-9A-F]}xsmi
describe L_SPLIT_UTF8_FROM  UTF-8 char split across QP encoded-words in From
score  L_SPLIT_UTF8_FROM  1.5



The L_SPLIT_UTF8_FROM hit only 4 times in the last three weeks
(of 5 million messages processed at my site during that time),
all in spam which already scored pretty high by other rules.

The L_SPLIT_UTF8_SUBJ hit 62 times, almost all of which was spam.

In our case the score of 1.5 seems to work fine. The hit rate might
be higher in countries using multibyte character sets, depending
on how poorly mail clients there (and bulk mail generating software)
implement RFC 2047.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #6 from Mark Martinec <Ma...@ijs.si> ---
> The "real" filename in this example is 'документы для отдела кадров.pdf'.
> But spamassassin does not decode such filenames. This bug/feature_missing
> leads to Bayes missing actual spam words used in filename and PDFInfo plugin
> completely ignoring such attachments because it does not find .pdf extension
> in MIME-encoded version of filename.

These words are also missing from Bayes tokens, although the code
path there is different: header decoding goes through decoding in
Message::Node::_decode_header, which intentionally avoids decoding
MIME-words in Content-* header fields. The reason is probably in
RFC 2047, which explicitly excluded the use of MIME-words there,
although a later RFC 2184 introduced such encodings.

Will see what can be done with __decode_header() and _normalize()
to get such names decoded.

Interestingly some time in the far past it seems to have been decided
that Encode::decode("MIME-Header",...) may not be the best choice,
but have implemented own decoding (Mail::SpamAssassin::Util::qp_decode,
Mail::SpamAssassin::Util::base64_decode, __decode_header). Not sure
what was the rationale, possibly some bug in the Encode::MIME::Header
back then. Seems suboptional now to use two different decoding
implementations for decoding of the same header field in two places.

>> Btw, the MIME encoding in the provided sample is incorrect, it breaks
>> the RFC 2047 section 5 requirement:
>
> This example was taken from real spam message which was created by some
> non-rfc-compliant software.

I made some modifications to my copy of Message::Node.pm to better
deal with it: just mangle the split character instead of giving up
on UTF-8 decoding entirely and falling back to Windows 1250, which
yields true gibberish. Needs some more testing.

> Do you recommend reverting the change?

It can definitely stay in trunk/4.0 but needs more work to deal with
such case elsewhere in code. I'm slowly crunching at the characters
vs. octets choices, and this is one more welcome piece of the puzzle.

On a quick look it seems the $msg->{'name'} is hardly used anywhere
except in the PDFInfo plugin, so a change there for the 3.4 branch
will likely only have a local effect in this plugin, so it is probably
alright. It might be safer to encode the obtained characters into
UTF-8 octets for the 3.4 branch, so that octets stay octets.


> Perhaps a normalize_charset config true check encapsulating the
> change can help then?

The normalize_charset is not involved in this code path, so it should
not matter whether it is on or off.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #3 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to Mark Martinec from comment #2)
> > This is a trivial patch and won't need CLA. Committed to 4.0 and 3.4.2
> > branches.
> > Committed revision 1706850.
> > Committed revision 1706851.
> 
> Thanks. I agree this needs decoding, although some caution is warranted,
> as the result of Encode::decode("MIME-Header",...) are perl characters
> (utf8 flag on), which when potentially concatenated elsewhere with
> some UTF-8 encoded string (octets, non-decoded) yields doubly encoding
> of UTF-8 octets, i.e. a real mojibake mess.
> 
> Btw, the MIME encoding in the provided sample is incorrect, it breaks
> the RFC 2047 section 5 requirement:
> 
>   Each 'encoded-word' MUST represent an integral number of characters.
>   A multi-octet character may not be split across adjacent 'encoded-word's.
> 
> When decoded according to rules it yields:
> 
> "0LTQvtC60YPQvNC10L3RgtGLINC00LvRjyDQvtGC0LTQ" -> документы для отд�
> "tdC70LAg0LrQsNC00YDQvtCyLnBkZg=="            -> �ла кадров.pdf

Do you recommend reverting the change?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #5 from Kevin A. McGrail <km...@pccc.com> ---
Perhaps a normalize_charset config true check encapsulating the change can help
then?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |4.0.0

--- Comment #7 from Mark Martinec <Ma...@ijs.si> ---
> These words are also missing from Bayes tokens, although the code
> path there is different: header decoding goes through decoding in
> Message::Node::_decode_header, which intentionally avoids decoding
> MIME-words in Content-* header fields. The reason is probably in
> RFC 2047, which explicitly excluded the use of MIME-words there,
> although a later RFC 2184 introduced such encodings.
> 
> Will see what can be done with __decode_header() and _normalize()
> to get such names decoded.

Enhanced _decode_header() and _normalize(), committed below.

> Interestingly some time in the far past it seems to have been decided
> that Encode::decode("MIME-Header",...) may not be the best choice,
> but have implemented own decoding (Mail::SpamAssassin::Util::qp_decode,
> Mail::SpamAssassin::Util::base64_decode, __decode_header). Not sure
> what was the rationale, possibly some bug in the Encode::MIME::Header
> back then. Seems suboptional now to use two different decoding
> implementations for decoding of the same header field in two places.

Checked the changelog of Encode::MIME::Header. Seems that most of the
problematic cases were fixed with a version of Encode that came with
perl 5.10.1, although there are still some unresolved issues, like
incorrectly discarding whitespace on header folding. Our code does it
better, especially with the new code just committed.

> I made some modifications to my copy of Message::Node.pm to better
> deal with it: just mangle the split character instead of giving up
> on UTF-8 decoding entirely and falling back to Windows 1250, which
> yields true gibberish. Needs some more testing.

trunk:
  _decode_header:
    deal with invalid splicing of multibyte characters in encoded-words,
    allow language info in encoded words (RFC 2231),
    decode Content-* for benefit of Bayes;
  renamed __decode_header to _decode_mime_encoded_word
Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1707593.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #10 from azotov@geolink-group.com ---
> In our case the score of 1.5 seems to work fine. The hit rate might
> be higher in countries using multibyte character sets, depending
> on how poorly mail clients there (and bulk mail generating software)
> implement RFC 2047.

I have tested L_SPLIT_UTF8_SUBJ and L_SPLIT_UTF8_FROM on my corpus containing
mostly messages with cyrillic characters (34460 spam messages and 20354 ham).
~90% of messages have MIME-encoded subjects and ~50% have MIME-encoded From
(but most of them have windows-1251 and koi8r encoding, UTF8 have only <5%
messages).

The L_SPLIT_UTF8_SUBJ got 3 hits in ham messages and 50 hits in spam, nice
result.
The L_SPLIT_UTF8_FROM got 46 hits in ham and 19 hits in spam, too many false
positives. False positives were created by badly written php mail robots and
bulk mail software sending ham messages.

So, L_SPLIT_UTF8_SUBJ seems to be good addition for cyrillic spam, at least for
my mail flow.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #2 from Mark Martinec <Ma...@ijs.si> ---
> This is a trivial patch and won't need CLA. Committed to 4.0 and 3.4.2
> branches.
> Committed revision 1706850.
> Committed revision 1706851.

Thanks. I agree this needs decoding, although some caution is warranted,
as the result of Encode::decode("MIME-Header",...) are perl characters
(utf8 flag on), which when potentially concatenated elsewhere with
some UTF-8 encoded string (octets, non-decoded) yields doubly encoding
of UTF-8 octets, i.e. a real mojibake mess.

Btw, the MIME encoding in the provided sample is incorrect, it breaks
the RFC 2047 section 5 requirement:

  Each 'encoded-word' MUST represent an integral number of characters.
  A multi-octet character may not be split across adjacent 'encoded-word's.

When decoded according to rules it yields:

"0LTQvtC60YPQvNC10L3RgtGLINC00LvRjyDQvtGC0LTQ" -> документы для отд�
"tdC70LAg0LrQsNC00YDQvtCyLnBkZg=="            -> �ла кадров.pdf

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

John Wilcock <jo...@tradoc.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |john@tradoc.fr

--- Comment #8 from John Wilcock <jo...@tradoc.fr> ---
(In reply to azotov from comment #4)
> This example was taken from real spam message which was created by some
> non-rfc-compliant software.

Is there a mechanism in place for SA to not only work around such RFC
violations but also flag the fact that it has done so, because the violation
might be a spam sign worth scoring?

If not, is it worth raising a new bug to record the idea?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

Joe Quinn <jq...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
                 CC|                            |jquinn+SAbug@pccc.com

--- Comment #1 from Joe Quinn <jq...@pccc.com> ---
This is a trivial patch and won't need CLA. Committed to 4.0 and 3.4.2
branches.

Committed revision 1706850.
Committed revision 1706851.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Posted by bu...@bugzilla.spamassassin.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

azotov@geolink-group.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |azotov@geolink-group.com

--- Comment #4 from azotov@geolink-group.com ---
(In reply to Mark Martinec from comment #2)
> Thanks. I agree this needs decoding, although some caution is warranted,
> as the result of Encode::decode("MIME-Header",...) are perl characters
> (utf8 flag on), which when potentially concatenated elsewhere with
> some UTF-8 encoded string (octets, non-decoded) yields doubly encoding
> of UTF-8 octets, i.e. a real mojibake mess.

Patch was tested by me on live spamassassin installation with
"normalize_charset 1" config and at least it didn't get any perl errors during
couple of months. But some hidden problems  with utf8 characters/octet may
occur, so the patch needs careful inspection by somebody with better knowledge
of perl utf8 and spamassassin sources than me :)

> Btw, the MIME encoding in the provided sample is incorrect, it breaks
> the RFC 2047 section 5 requirement:

This example was taken from real spam message which was created by some
non-rfc-compliant software.

-- 
You are receiving this mail because:
You are the assignee for the bug.