You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by wolfgang <me...@gmx.net> on 2005/04/30 13:27:39 UTC

character set / encoding problem?

Again and again, we receive messages that contain stuff like
<a href=3d"http://advinc-ma=2enetfirms=2ecom/">
instead of
<a href="http://advinc-ma.netfirms.com/">

That prevents uri / body rules like e.g.
/netfirms\.com/
and URIBL rules from being triggered. I wonder if there is some "function" to 
automatically "de-code" such items instead of having to use stuff like
/netfirms(?:\.|=2e)com/ and how i could use it with SA.

regards,
wolfgang


Re: character set / encoding problem?

Posted by Fred <sp...@freddyt.com>.
wolfgang wrote:
> In an older episode (Saturday 30 April 2005 14:45), Theo Van Dinter
> wrote:
>> "=3d" is quoted-printable encoding for "=", "=2e" for ".", etc...
>> SA handles "proper" encoding (it handles a lot of non-proper encoding
>> as well), but doesn't make guesses if the MIME part says there is no
>> encoding.

I remember a discussion a while back about this, =2e is invalid while =2E is
valid.

But then I searched and found this:

Rule #1: (General 8-bit representation) Any octet, except those
      indicating a line break according to the newline convention of the
      canonical (standard) form of the data being encoded, may be
      represented by an "=" followed by a two digit hexadecimal
      representation of the octet's value.  The digits of the
      hexadecimal alphabet, for this purpose, are "0123456789ABCDEF".
      Uppercase letters must be used when sending hexadecimal data,
      though a robust implementation may choose to recognize lowercase
      letters on receipt.  Thus, for example, the value 12 (ASCII form
      feed) can be represented by "=0C", and the value 61 (ASCII EQUAL
      SIGN) can be represented by "=3D".  Except when the following
      rules allow an alternative encoding, this rule is mandatory.


IT's this line:  "Uppercase letters must be used when sending hexadecimal
data,
      though a robust implementation may choose to recognize lowercase
      letters on receipt."


Re: character set / encoding problem?

Posted by wolfgang <me...@gmx.net>.
In an older episode (Saturday 30 April 2005 14:45), Theo Van Dinter wrote:
> "=3d" is quoted-printable encoding for "=", "=2e" for ".", etc...
> SA handles "proper" encoding (it handles a lot of non-proper encoding
> as well), but doesn't make guesses if the MIME part says there is no
> encoding.
> 
> Without samples of the message, it's hard to comment on why something does 
or
> does not work.

the message headers say
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

I enclose the message for reference, local user data obfuscated with "xxx".

regards,

wolfgang

Re: character set / encoding problem?

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sat, Apr 30, 2005 at 02:41:57PM -0500, David B Funk wrote:
> We've already gone 'round this issue in past discussions on this list, the
> DEVs reply was, maybe 'fixed' in future releases.

Ok, fair enough.  Then FYI: 3.1 handles the lowercase version. :)

-- 
Randomly Generated Tagline:
"I was up all night trying to round off infinity." - Bob Lazarus

Re: character set / encoding problem?

Posted by wolfgang <me...@gmx.net>.
In an older episode (Saturday 30 April 2005 21:41), David B Funk wrote:

> In the meantime, I've coded local rules that explicitly target this bogus
> encoding as a spam sign:
> 
> body L_BOGUS_QP1        /\b=2e(?:com|biz|info|net|org|us)[:\/]\b/
> describe L_BOGUS_QP1    Bogus QuotedPrintable encoding
> score L_BOGUS_QP1       1.1
> 
> meta L_BOGUS_QP2        (L_BOGUS_QP1 && HTML_MESSAGE)
> describe L_BOGUS_QP2    HTML message that uses Bogus QP
> score L_BOGUS_QP2       1.5

they don't work for me with the message I enclosed earlier.

why "\b=2e" by the way?

regards,

wolfgang


Re: character set / encoding problem?

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Sat, 30 Apr 2005, Theo Van Dinter wrote:

> On Sat, Apr 30, 2005 at 01:27:39PM +0200, wolfgang wrote:
> > Again and again, we receive messages that contain stuff like
> > <a href=3d"http://advinc-ma=2enetfirms=2ecom/">
> > instead of
> > <a href="http://advinc-ma.netfirms.com/">
> >
> > I wonder if there is some "function" to
> > automatically "de-code" such items instead of having to use stuff like
> > /netfirms(?:\.|=2e)com/ and how i could use it with SA.
>
> "=3d" is quoted-printable encoding for "=", "=2e" for ".", etc...
> SA handles "proper" encoding (it handles a lot of non-proper encoding
> as well), but doesn't make guesses if the MIME part says there is no
> encoding.

No, '=3d' is BOGUS, it is not RFC compliant quoted-printable encoding.
The MIME RFC states clearly that the hex characters MUST be CAPS
(EG '=3D' is valid QP, '=3d' is not). SA does not handle the bogus form
altho many mail clients do.
We've already gone 'round this issue in past discussions on this list, the
DEVs reply was, maybe 'fixed' in future releases.

In the meantime, I've coded local rules that explicitly target this bogus
encoding as a spam sign:

body L_BOGUS_QP1        /\b=2e(?:com|biz|info|net|org|us)[:\/]\b/
describe L_BOGUS_QP1    Bogus QuotedPrintable encoding
score L_BOGUS_QP1       1.1

meta L_BOGUS_QP2        (L_BOGUS_QP1 && HTML_MESSAGE)
describe L_BOGUS_QP2    HTML message that uses Bogus QP
score L_BOGUS_QP2       1.5



-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: character set / encoding problem?

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sat, Apr 30, 2005 at 01:27:39PM +0200, wolfgang wrote:
> Again and again, we receive messages that contain stuff like
> <a href=3d"http://advinc-ma=2enetfirms=2ecom/">
> instead of
> <a href="http://advinc-ma.netfirms.com/">
> 
> I wonder if there is some "function" to 
> automatically "de-code" such items instead of having to use stuff like
> /netfirms(?:\.|=2e)com/ and how i could use it with SA.

"=3d" is quoted-printable encoding for "=", "=2e" for ".", etc...
SA handles "proper" encoding (it handles a lot of non-proper encoding
as well), but doesn't make guesses if the MIME part says there is no
encoding.

Without samples of the message, it's hard to comment on why something does or
does not work.

-- 
Randomly Generated Tagline:
"M: Can anyone tell us the lesson that has been learned here?
  S: Yes Master, not a single one of us could defeat you.
  M: You gain wisdom child ... "            - The Frantics

Re: character set / encoding problem?

Posted by wolfgang <me...@gmx.net>.
In an older episode (Sunday 01 May 2005 02:07), Loren Wilton wrote:
> > Again and again, we receive messages that contain stuff like
> > <a href=3d"http://advinc-ma=2enetfirms=2ecom/">
> > instead of
> > <a href="http://advinc-ma.netfirms.com/">
> >
> > That prevents uri / body rules like e.g.
> 
> no/yes
> 
> > /netfirms\.com/
> > and URIBL rules from being triggered. I wonder if there is some "function"
> to
> > automatically "de-code" such items instead of having to use stuff like
> > /netfirms(?:\.|=2e)com/ and how i could use it with SA.
> 
> URI rules should be hitting already; certainly on 3.0. 

indeed, URI rules hit, thanks for the hint.

> Body rules on 3.0 
> may be failing.  But then, I'm not sure that 3.0 will have the uri in the
> body text.  Rawbody and full rules will certainly fail.  After all, that is
> the whole reason the spammers do that extra extraneous encoding.
> 
> But then, it is nice that they put that extra encoding in the uris.  Makes
> it easy to add points for useless uri encoding.  :-)
> 
>         Loren