You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/09/05 01:11:20 UTC

[Bug 5083] New: spamd REPORT fails if content preview contains wide characters

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083

           Summary: spamd REPORT fails if content preview contains wide
                    characters
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P3
         Component: spamc/spamd
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: richard+spamassassin@musicbox.net


The spamd daemon fails (with the error "Wide character in syswrite"), and hence
allows the message through, if it is asked to REPORT (or REPORT_IFSPAM) on a
message which contains wide characters in the part shown as a content preview. A
REPORT request is generated by "spamc -R", and by the Exim 4 "spam=<user>" ACL.
This gives a fairly trivial way for a spammer to bypass any check done by Exim
or a similar SA client - just include a (correctly encoded) wide character in
the first few lines of the spam.

The reason is that Mail::SpamAssassin::PerMsgStatus::get_report() returns a
native Perl string, which might (depending on the format of the original email)
contain wide characters. Attempting to send such a string back through the
socket connection to the client fails, since syswrite only handles octets. The
problem doesn't arise for other types of request (eg. PROCESS, the default
action for spamc), because Mail::SpamAssassin::PerMsgStatus::rewrite_mail()
always yields a octet-stream suitable for sending through a network socket.

We don't actually know what character encoding the client wants (perhaps there's
scope for an extension to the spamd API for this?). The attached patch encodes
the report as UTF-8, though perhaps stripping out wide characters altogether, or
attempting a transliteration to US-ASCII, would be even safer.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-04 23:12 -------
Created an attachment (id=3686)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3686&action=view)
Convert wide characters in spamd REPORT output to UTF-8




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From jm@jmason.org  2006-09-05 12:17 -------
'The "$]<5.008" should take care of 5.6.x compatibility: I copied the code from
PerMsgStatus, and in fact the whole concept (ie. forcing an explicit octet
encoding on any data intended to be passed back out of SA) is already done in
rewrite_mail().'

ah, sorry.  I hadn't seen that there (and I'm surprised it works! ;) 

Anyway, the problem is that perl *itself* is supposed to know how to write utf-8
data to a network socket, taking care of the conversions required.  For us to
have to explicitly upgrade and downgrade strings inside our libs, is entirely
the wrong approach, since that way leads to double-encoding.  


Could you try editing spamd.raw and adding a

  binmode($client, ":utf8");

just before the call to $spamtest->check($mail), and report if that works
instead of the patch?

Also, a test case would be very helpful, for others to test on other platforms.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-05 15:33 -------
No, that doesn't work unfortunately.  Leaving $msg_resp flagged as native Perl
Unicode causes the length() function to give the number of Unicode characters,
and hence the Content-Length header sent back to the client doesn't match the
number of UTF-8 bytes in the response body.  spamc whinges, although other
clients might either be using protocol 1.2 or below (no Content-Length) or just
ignore the discrepancy and carry on anyway.

How about explicitly turning *off* the :utf8 layer on the socket (ie. stop the
magic that appears to be happening on Linux but not on BSD), and do the UTF-8
conversion as in the original patch?  At least that way we won't be potentially
double-encoding anything.

Or we could use "length(Encode::encode_utf8($msg_resp))" or "use bytes;
length($msg_resp)" to try to ensure the Content-Length matches the data
generated by the :utf8 output layer.  However, that feels even hackier than
doing the conversion ourselves once, and then using the same resulting
byte-string both for Content-Length and for verbatim output.

I agree that in general we should let Perl get on with the details of character
conversion, but one other advantage of doing it ourselves in the *particular*
case of spamd, rather than relying on PerlIO layers, is that we can control the
encoding more precisely.  This would be useful if announcement of the encoding
becomes part of the spamc protocol (ie. Content-Transfer-Encoding).  We can also
be more certain about when the encoding happens, which might be an issue if a
different encoding is needed in each direction: as you pointed out, even with
the current API the best place for a binmode() call is *after* the client
request has been read, since binmode() will apply the translation layer to both
input and output, and it's only output we want to be UTF-8-encoded.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083


richard+spamassassin@musicbox.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         OS/Version|All                         |FreeBSD




------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-05 09:35 -------
Just checked on Linux with Perl 5.8.4 (problem originally identified on FreeBSD
with Perl 5.8.8): on Linux the output stream from spamd seems to be set to utf8
mode by default, and syswrite allows wide characters through without error.
Arguably that's a bug in the OS, or at least something that depends on the
environment, so I still think something is needed to give a reliable and
deterministic result on all platforms - either the patch above, or perhaps
explicitly setting the socket output stream to utf8 mode?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From jm@jmason.org  2006-09-05 16:10 -------
'concatenating a marked-as-UTF-8
string and a marked-as-non-UTF-8 string will result in double encoding, iirc.'

er, I mean it'll result in double encoding, if the latter string contains UTF-8
text that has been downgraded into "marked-as-non-UTF-8" byte data.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-05 15:40 -------
Created an attachment (id=3687)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3687&action=view)
Test case spam giving a wide character in Content preview

The spam itself is US-ASCII: the wide character in this case is generated
internally by SA in expanding the HTML entity &trade; (Unicode 0x2122, UTF-8
0xe2 0x84 0xa2).




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From jm@jmason.org  2006-09-05 09:48 -------
oops, that's "LC_ALL=en_US.UTF-8", with uppercase UTF-8.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From jm@jmason.org  2006-09-05 09:44 -------
yep, if you're seeing this, please post platform and perl version -- UTF-8
handling differs widely between perl releases, unfortunately :(

I would suggest checking the locales set for spamd and spamc on both platforms
-- for example, setting "LC_ALL=en_US.utf-8" in the environment is required on
some OSes to get correct UTF-8 support in the libc APIs and in the perl interpreter.

the patch, btw, will probably cause fatal errors under perl 5.6.1, which is a
supported platform for Spamassassin -- any patches need to work on that release,
too...



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-05 11:13 -------
The "$]<5.008" should take care of 5.6.x compatibility: I copied the code from
PerMsgStatus, and in fact the whole concept (ie. forcing an explicit octet
encoding on any data intended to be passed back out of SA) is already done in
rewrite_mail().

I agree that UTF support in Perl and the underlying OS can be a minefield
generally... but in this particular case, we're dealing with a network socket,
which is defined to pass only octets. Therefore, I think SA should "do the right
thing" (for some definition of "right"!) deterministically, independently of the
environment, and not rely on any implicit conversions (that might or might not
happen on different platforms) when it comes to sending data back to the client.

On both of my test platforms, btw, the environment is empty (spamd is started
with "env -"), so any differences are down to the vanilla default behaviour of
Perl/libc.  If any aspect of SA processing really needs a specific environment
to work properly, then perhaps SA *itself* should set it (eg. based on a test in
Makefile.PL)?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5083] spamd REPORT fails if content preview contains wide characters

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From jm@jmason.org  2006-09-05 15:55 -------
This idiom:

  { use bytes; $len = length($msg_resp) }

is actually the recommended way to get the length in bytes of a unicode string.
 We definitely need to be using that, alright.

The difficulty with manually doing the utf8 conversions in our own code is that
there are other places where a string will be "upgraded" automatically in the
other parts of the perl API.  For example, concatenating a marked-as-UTF-8
string and a marked-as-non-UTF-8 string will result in double encoding, iirc.

However, the idea of explicitly turning off the utf-8 layer on the spamd/spamc
socket, then performing the UTF-8 downgrade, is not a bad one, I think.  That
may work well, but we'd have to be sure to do this at the last minute, before
writing the strings to the sockets.

(BTW, other SA developers with experience in working with utf-8 strings --
particularly in the SA code, or in spamd -- please speak up here...)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.