You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Philip Prindeville <ph...@redfish-solutions.com> on 2014/06/25 22:21:33 UTC

Funky HARP Spam

I was surprised that my SPAM filters didn’t find this.

Not sure what code page it’s using… whatever 0x04xx is in… what?  Is this UTF-8?

There’s no explicit charset given.

Also, I noticed that a lot of these types of SPAMs have ‘b’ replaced by cyrillic soft sound, i.e. the word “about” is written as &#x0430;&#x042C;&#x043E;ut instead.

Here’s the entire message.

http://pastebin.com/qLyKx40b

Here’s what I’m showing it matched:

Jun 25 11:16:07 mail mimedefang.pl[18682]: s5PHFqsC019802: s5PHFqsC019802: 4.889 (****) BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VERIFIED,HTML_MESSAGE,L_BLOCK_ISP,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD

Odd that it didn’t match MIME_CHARSET_FARAWAY or CHARSET_FARAWAY_HEADER or any rules about excessive redundant encoding.

Would it be a lot of work to the number of ETH (D w/ STROKE, whatever) followed by 3/4 character pairs?

Here’s the other thing I don’t get.

The message claims to be 7-bit and text/plain, yet it uses encoded characters which exceed 7-bit widths yet this doesn’t seem to be firing any rules either.

&#x042C would seem to be at least an 11-bit wide character.

Are we being “too liberal in what we accept”?



Re: Funky HARP Spam

Posted by Richard Doyle <li...@islandnetworks.com>.
On 06/25/2014 02:12 PM, Philip Prindeville wrote:
> On Jun 25, 2014, at 2:58 PM, Axb <ax...@gmail.com> wrote:
>
>> On 06/25/2014 10:21 PM, Philip Prindeville wrote:
>>
>>> http://pastebin.com/qLyKx40b
>> "This paste has been removed!" :(
> I’ve temporarily posted it on ftp://ftp.redfish-solutions.com/pub/harp.eml
I'm seeing quite a few spam from this and other domains that are
registered with GoDaddy and recently updated. This one (hjackman.com)
was created on 2003-05-10 but updated on 2014-06-21. I don't know any
way to use this pattern in spamassassin, though.



Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On Jun 25, 2014, at 2:58 PM, Axb <ax...@gmail.com> wrote:

> On 06/25/2014 10:21 PM, Philip Prindeville wrote:
> 
>> http://pastebin.com/qLyKx40b
> 
> "This paste has been removed!" :(

I’ve temporarily posted it on ftp://ftp.redfish-solutions.com/pub/harp.eml


> 
>> Here’s what I’m showing it matched:
>> 
>> Jun 25 11:16:07 mail mimedefang.pl[18682]: s5PHFqsC019802:
>> s5PHFqsC019802: 4.889 (****)
>> BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VERIFIED,HTML_MESSAGE,L_BLOCK_ISP,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD
>> 
>> Odd that it didn’t match MIME_CHARSET_FARAWAY or
>> CHARSET_FARAWAY_HEADER or any rules about excessive redundant
>> encoding.
> 
> FARAWAY will only match if you enabled
> ok_locales en
> or some other country - not always reliable
> 
> interesting that you got a BAYES_00 - meaning your Bayes may need more learning/ auto learning


I don’t think I’ve enabled learning.


> 
>> Would it be a lot of work to the number of ETH (D w/ STROKE,
>> whatever) followed by 3/4 character pairs?
>> 
>> Here’s the other thing I don’t get.
>> 
>> The message claims to be 7-bit and text/plain, yet it uses encoded
>> characters which exceed 7-bit widths yet this doesn’t seem to be
>> firing any rules either.
>> 
>> &#x042C would seem to be at least an 11-bit wide character.
>> 
>> Are we being “too liberal in what we accept”?
> 
> or trying to be too restrictive? ESPish sort of mail can contain often contain such wonderful combinations. The travel business senders are experts in producing the wildest of mixes.
> 
> 

ESPish?  That’s a sports channel, right?





Re: Funky HARP Spam

Posted by Axb <ax...@gmail.com>.
On 06/25/2014 10:21 PM, Philip Prindeville wrote:

> http://pastebin.com/qLyKx40b

"This paste has been removed!" :(

> Here’s what I’m showing it matched:
>
> Jun 25 11:16:07 mail mimedefang.pl[18682]: s5PHFqsC019802:
> s5PHFqsC019802: 4.889 (****)
> BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VERIFIED,HTML_MESSAGE,L_BLOCK_ISP,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD
>
>  Odd that it didn’t match MIME_CHARSET_FARAWAY or
> CHARSET_FARAWAY_HEADER or any rules about excessive redundant
> encoding.

FARAWAY will only match if you enabled
ok_locales en
or some other country - not always reliable

interesting that you got a BAYES_00 - meaning your Bayes may need more 
learning/ auto learning

> Would it be a lot of work to the number of ETH (D w/ STROKE,
> whatever) followed by 3/4 character pairs?
>
> Here’s the other thing I don’t get.
>
> The message claims to be 7-bit and text/plain, yet it uses encoded
> characters which exceed 7-bit widths yet this doesn’t seem to be
> firing any rules either.
>
> &#x042C would seem to be at least an 11-bit wide character.
>
> Are we being “too liberal in what we accept”?

or trying to be too restrictive? ESPish sort of mail can contain often 
contain such wonderful combinations. The travel business senders are 
experts in producing the wildest of mixes.




Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
And on a totally unrelated note, is there any way enforce a rule to only be true if it applies for an individual MIME body part?

For instance, I might test for a mimeheader of Content-Transfer-Encoding being “7bit”, but also having seen BODY_8BITS… but I need them to both be true in an individual mime part.

It doesn’t do me any good if there’s one text/plain section that is 7bit, followed by another text/html section that’s “base64” which fires the BODY_8BITS rule too.


On Jun 25, 2014, at 2:21 PM, Philip Prindeville <ph...@redfish-solutions.com> wrote:

> I was surprised that my SPAM filters didn’t find this.
> 
> Not sure what code page it’s using… whatever 0x04xx is in… what?  Is this UTF-8?
> 
> There’s no explicit charset given.
> 
> Also, I noticed that a lot of these types of SPAMs have ‘b’ replaced by cyrillic soft sound, i.e. the word “about” is written as &#x0430;&#x042C;&#x043E;ut instead.
> 
> Here’s the entire message.
> 
> http://pastebin.com/qLyKx40b
> 
> Here’s what I’m showing it matched:
> 
> Jun 25 11:16:07 mail mimedefang.pl[18682]: s5PHFqsC019802: s5PHFqsC019802: 4.889 (****) BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VERIFIED,HTML_MESSAGE,L_BLOCK_ISP,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD
> 
> Odd that it didn’t match MIME_CHARSET_FARAWAY or CHARSET_FARAWAY_HEADER or any rules about excessive redundant encoding.
> 
> Would it be a lot of work to the number of ETH (D w/ STROKE, whatever) followed by 3/4 character pairs?
> 
> Here’s the other thing I don’t get.
> 
> The message claims to be 7-bit and text/plain, yet it uses encoded characters which exceed 7-bit widths yet this doesn’t seem to be firing any rules either.
> 
> &#x042C would seem to be at least an 11-bit wide character.
> 
> Are we being “too liberal in what we accept”?
> 
> 


Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On Jun 26, 2014, at 7:02 PM, Philip Prindeville <ph...@redfish-solutions.com> wrote:

> 
> On Jun 25, 2014, at 5:29 PM, RW <rw...@googlemail.com> wrote:
> 
>> On Wed, 25 Jun 2014 14:21:33 -0600
>> Philip Prindeville wrote:
>> 
>> 
>>> Here’s the other thing I don’t get.
>>> 
>>> The message claims to be 7-bit and text/plain, yet it uses encoded
>>> characters which exceed 7-bit widths yet this doesn’t seem to be
>>> firing any rules either.
>>> 
>>> &#x042C would seem to be at least an 11-bit wide character.
>> 
>> You are mixing-up different levels of encoding. The characters
>> &,#,x,0,4,2 and C are all 7-bit ASCI, and so are consistent with
>> Content-Transfer-Encoding: 7bit.
> 
> You’re correct… That is consistent with the CTE.
> 
> But the Content-Type omitted a ;charset=“XXX” attribute, which means it defaults to “US-ASCII”.
> 
> Quoting RFC-2046:
> 
> 4.1.2.  Charset Parameter
> 
>   A critical parameter that may be specified in the Content-Type field
>   for "text/plain" data is the character set.  This is specified with a
>   "charset" parameter, as in:
> 
>     Content-type: text/plain; charset=iso-8859-1
> 
>   Unlike some other parameter values, the values of the charset
>   parameter are NOT case sensitive.  The default character set, which
>   must be assumed in the absence of a charset parameter, is US-ASCII.
> 
> 
> Since &#x042C is outside the US-ASCII character set, this would be an encoding violation.
> 
> -Philip
> 


Can anyone point me at how to write a test that confirms that the actual encoded text will fit into the named (or implicit) charset?

I.e. what’s a good template or example to go by?

Thanks.


Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On Jun 27, 2014, at 12:34 PM, Philip Prindeville <ph...@redfish-solutions.com> wrote:

> 
> On Jun 27, 2014, at 7:30 AM, RW <rw...@googlemail.com> wrote:
> 
>> 
>> As I mentioned before, the real violation is in the previous mime
>> section, which claims 7bit, but contains octets with the high-bit set. 
> 
> 
> Yup.  Just submitted a patch for this:
> 
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063
> 

Loving this filter!  It’s catching 50% or more of our SPAM!!!!


Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On Jun 27, 2014, at 7:30 AM, RW <rw...@googlemail.com> wrote:

> 
> As I mentioned before, the real violation is in the previous mime
> section, which claims 7bit, but contains octets with the high-bit set. 


Yup.  Just submitted a patch for this:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063


Re: Funky HARP Spam

Posted by RW <rw...@googlemail.com>.
On Thu, 26 Jun 2014 19:02:42 -0600
Philip Prindeville wrote:


> 
> Since &#x042C is outside the US-ASCII character set, this would be an
> encoding violation.

It's not.

In HTML &#x042C is an ASCII representation of a unicode character. It
represents a character within HTML, but as far as mime is concerned
it's 7 characters - that's the whole point of allowing unicode to be
represented this way. Actually the mime section it's in is text/html,
not text/plain, but it's legal either way.


As I mentioned before, the real violation is in the previous mime
section, which claims 7bit, but contains octets with the high-bit set. 

Re: Funky HARP Spam

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On Jun 25, 2014, at 5:29 PM, RW <rw...@googlemail.com> wrote:

> On Wed, 25 Jun 2014 14:21:33 -0600
> Philip Prindeville wrote:
> 
> 
>> Here’s the other thing I don’t get.
>> 
>> The message claims to be 7-bit and text/plain, yet it uses encoded
>> characters which exceed 7-bit widths yet this doesn’t seem to be
>> firing any rules either.
>> 
>> &#x042C would seem to be at least an 11-bit wide character.
> 
> You are mixing-up different levels of encoding. The characters
> &,#,x,0,4,2 and C are all 7-bit ASCI, and so are consistent with
> Content-Transfer-Encoding: 7bit.

You’re correct… That is consistent with the CTE.

But the Content-Type omitted a ;charset=“XXX” attribute, which means it defaults to “US-ASCII”.

Quoting RFC-2046:

4.1.2.  Charset Parameter

   A critical parameter that may be specified in the Content-Type field
   for "text/plain" data is the character set.  This is specified with a
   "charset" parameter, as in:

     Content-type: text/plain; charset=iso-8859-1

   Unlike some other parameter values, the values of the charset
   parameter are NOT case sensitive.  The default character set, which
   must be assumed in the absence of a charset parameter, is US-ASCII.


Since &#x042C is outside the US-ASCII character set, this would be an encoding violation.

-Philip



> 
> The previous mime section is more problematic since it appears to
> contain 8-bit data. 


Re: Funky HARP Spam

Posted by RW <rw...@googlemail.com>.
On Wed, 25 Jun 2014 14:21:33 -0600
Philip Prindeville wrote:


> Here’s the other thing I don’t get.
> 
> The message claims to be 7-bit and text/plain, yet it uses encoded
> characters which exceed 7-bit widths yet this doesn’t seem to be
> firing any rules either.
> 
> &#x042C would seem to be at least an 11-bit wide character.

You are mixing-up different levels of encoding. The characters
&,#,x,0,4,2 and C are all 7-bit ASCI, and so are consistent with
Content-Transfer-Encoding: 7bit.

The previous mime section is more problematic since it appears to
contain 8-bit data.