You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Greg Troxel <gd...@lexort.com> on 2022/02/08 01:27:52 UTC

CONTENT_AFTER_HTML: better not discuss formatting!!

(Instances of html have been changed to htnl in this message to
avoid tripping the rule I'm talking about.)

A legit message arrived at my server, for me and another user, and it
scored 8 for them and I think about 11 for me.  This is really unusual.
The big issues were:

  Sent by sendgrid: points from KAM and from URIBL_GREY both, each
  reasonable separately and I think URIBL_GREY newly lists sendgrid.

  From: was someone's (class teacher) gmail address, but it got sent out
  via sendgrid via a schoool, and there was no DKIM, so it lit up all
  sorts of FREEMAIL_FORGED, From:/env mismatch with freemail, ought to
  have DKIM from google and doesn't.

So I wrote to the person because they probably had no idea, and exlained
the above and added some other "deliverabilty hygiene" :-) comments:

> with more minor issues:
>
>    The message is html only, rather than also having text/plain.
>
>    The message body doesn't have enclosing <htnl> </htnl> tags, so it is
>    malformed.

and then I got a reply back with the content he was trying to send etc.
But, it had:

	*  2.5 CONTENT_AFTER_HTML More content after HTML close tag

but one was only text/plain and I could see nothing wrong.   reading
72_active.cf I found:

  rawbody    __CONTENT_AFTER_HTML        /<\/htnl>\s*[a-z0-9]/i                                                                                                                                                                                                    
which fires on a text/plain part that discusses html formatting!

So I'll be reducing that score...

Re: FROM header obfuscation

Posted by "Laurent S." <11...@protonmail.ch>.

On Thursday, February 10th, 2022 at 16:33, Kris Deugau <kd...@vianet.ca> wrote:

> (Please keep mail on-list)

Oops, replied too quick without checking this. Sorry.

> > Out of curiosity, I've tested it with a replace_tag rule (/<P><O><S><T>/) without luck. Shouldn't those UTF8 range be added to the ReplaceTags plugin?
> 

> Probably. However, the rules as above and the other similar ones I've
> set up locally are detecting the abstracted use of certain subsets of
> these variant characters seen in local FNs (often different variant sets
> for different cases, FN corpus depending), not variations of a
> particular character as used for ReplaceTags.
> To put it another way, I explicitly do not care about what these
> characters are spelling out, just the fact that they're present at all
> in certain places where I consider them to be inherently invalid. I
> also don't want to match the ASCII version - ReplaceTags substitutions
> usually include the base ASCII character, so your final rule has to have
> some exclusion component on its own, eg:
> /(?!Post)<P><O><S><T>/
> or
> /(?!P)<P><(?!o)<O>(?!s)<S>(?!t)<T>/
> etc.
> TBH for specific phishing cases like yours, I would tend to just
> copy-paste the spoofed From: name into a rule directly - text editor
> depending, this should work fine. Perl will happily match the literal
> pasted character or the hex sequence equally well unless your editor
> mangles the character.
> -kgd

I think both are valid. Your way to counting the number of those special characters is great. But I also want to be able to block some specific strings like the usual suspects (paypal, dhl, volksbank, post, ...) where a single unexpected char is enough. I've been using a meta for this, with the same idea that you just gave. 

I guess a few people created their own ReplaceTags with for instance their own company name. Including letter in  \xd0[\xa0-\xbf] in ReplaceTags would be good I think.

Re: FROM header obfuscation

Posted by Kris Deugau <kd...@vianet.ca>.

(Please keep mail on-list)

Laurent S. wrote:
> On Tuesday, February 8th, 2022 at 16:41, Kris Deugau <kd...@vianet.ca> wrote:
> 
>> I have a longish list of rule groups similar to below for different
>> extended UTF8 ASCII-lookalike characters and words. Some are derived
>> from rules discussed on this list within the past year or so.
>> header __SUSP_NAME_CHAR_01 From:name =~ /(?:\xd0[\xa0-\xbf])/
>> tflags __SUSP_NAME_CHAR_01 multiple maxhits 10
>> header __SUSP_NAME_CHAR_02 From:name =~
>> /(?:\xef\xbc[\x80-\xbf]|\xef\xbd[\x80-\xa0])/
>> tflags __SUSP_NAME_CHAR_02 multiple maxhits 10
>> meta __SUSP_NAME_CHAR __SUSP_NAME_CHAR_01 + __SUSP_NAME_CHAR_02
>> meta SUSP_NAME_CHAR_5 __SUSP_NAME_CHAR >= 5
>> describe SUSP_NAME_CHAR_5 5 or more lookalike characters in the
>> From: name
>> score SUSP_NAME_CHAR_5 1.5
>> meta SUSP_NAME_CHAR_10 __SUSP_NAME_CHAR >= 10
>> describe SUSP_NAME_CHAR_10 10 or more lookalike characters in the
>> From: name
>> score SUSP_NAME_CHAR_10 1.75
>> I've used this tool:
>> https://www.utf8-chartable.de/
>> with a bit of effort to take an example character and locate the full
>> a-z list of entries for these rules. (Convert individual characters to
>> hex, then flip pages until you've found the fakes. There are many groups.)
>> Single characters are trickier; depending on context I've added rules
>> for individual lookalike characters, or whole words with mixed variants
>> (and an exclusion for pure ASCII) as I see new runs of FNs.
>>
> 
>> -kgd
> 
> Out of curiosity, I've tested it with a replace_tag rule (/<P><O><S><T>/) without luck. Shouldn't those UTF8 range be added to the ReplaceTags plugin?

Probably.  However, the rules as above and the other similar ones I've 
set up locally are detecting the abstracted use of certain subsets of 
these variant characters seen in local FNs (often different variant sets 
for different cases, FN corpus depending), not variations of a 
particular character as used for ReplaceTags.

To put it another way, I explicitly do not care about *what* these 
characters are spelling out, just the fact that they're present at all 
in certain places where I consider them to be inherently invalid.  I 
also *don't* want to match the ASCII version - ReplaceTags substitutions 
usually include the base ASCII character, so your final rule has to have 
some exclusion component on its own, eg:

/(?!Post)<P><O><S><T>/

or

/(?!P)<P><(?!o)<O>(?!s)<S>(?!t)<T>/

etc.

TBH for specific phishing cases like yours, I would tend to just 
copy-paste the spoofed From: name into a rule directly - text editor 
depending, this should work fine.  Perl will happily match the literal 
pasted character or the hex sequence equally well unless your editor 
mangles the character.

-kgd

Re: FROM header obfuscation

Posted by Kris Deugau <kd...@vianet.ca>.

Frido Otten wrote:
> Hi All,
> 
> Recently we're seeing more spam passing our spamfilters using text 
> obfuscating in the FROM header. The problem mainly targets users which 
> are using mail clients like iPhone Mail which are only displaying the 
> display name of the FROM header and not the actual email address which 
> was used, bypassing DKIM measures. For example:
> 
> From: =?UTF-8?B?0KBvc3RubC5ubCDQoGFra2V0?= <ar...@qbocel.com>
> 
> This is base64 encoded "Рostnl.nl Рakket" and pretends to come from 
> Postnl, a dutch snailmail company. However the hexadecimal 
> representation of this base64 decoded text differs from that of normal 
> ASCII:
> 
> Obfuscated:
> 
> $ printf "Рostnl.nl Рakket" | od -A n -t x1
>   d0 a0 6f 73 74 6e 6c 2e 6e 6c 20 d0 a0 61 6b 6b
>   65 74
> 
> Plain ASCII:
> 
> $ printf "Postnl.nl Pakket" | od -A n -t x1
>   50 6f 73 74 6e 6c 2e 6e 6c 20 50 61 6b 6b 65 74
> 
> There is no way to tell the difference with the naked eye.

That depends on the font.  Many variations do in fact look different, 
and from some of the FP-approaching "ham" I've seen that abuses this I 
can only conclude that some marketing....  person has decided that this 
is Necessary and Required and the tech folks can Go Suck It.

As far as I'm concerned, formatting outside of language accents on 
characters absolutely does NOT belong in either the From: name or 
Subject.  An "a" in the From: name or Subject absolutely MUST be 
presented as a US-ASCII "a", and not some extended UTF8 lookalike 
that's...   oooooo!  in *italics*!

Naturally the spammers go to various amounts of effort to avoid the ones 
that are clearly different.

> Is there any way to detect this type of obfuscation with a spamassassin 
> rule?

I have a longish list of rule groups similar to below for different 
extended UTF8 ASCII-lookalike characters and words.  Some are derived 
from rules discussed on this list within the past year or so.

header  __SUSP_NAME_CHAR_01     From:name =~ /(?:\xd0[\xa0-\xbf])/
tflags __SUSP_NAME_CHAR_01 multiple maxhits 10
header  __SUSP_NAME_CHAR_02     From:name =~ 
/(?:\xef\xbc[\x80-\xbf]|\xef\xbd[\x80-\xa0])/
tflags __SUSP_NAME_CHAR_02 multiple maxhits 10
meta    __SUSP_NAME_CHAR        __SUSP_NAME_CHAR_01 + __SUSP_NAME_CHAR_02
meta    SUSP_NAME_CHAR_5        __SUSP_NAME_CHAR >= 5
describe SUSP_NAME_CHAR_5       5 or more lookalike characters in the 
From: name
score   SUSP_NAME_CHAR_5        1.5
meta    SUSP_NAME_CHAR_10       __SUSP_NAME_CHAR >= 10
describe SUSP_NAME_CHAR_10      10 or more lookalike characters in the 
From: name
score   SUSP_NAME_CHAR_10       1.75

I've used this tool:

https://www.utf8-chartable.de/

with a bit of effort to take an example character and locate the full 
a-z list of entries for these rules.  (Convert individual characters to 
hex, then flip pages until you've found the fakes.  There are many groups.)

Single characters are trickier;  depending on context I've added rules 
for individual lookalike characters, or whole words with mixed variants 
(and an exclusion for pure ASCII) as I see new runs of FNs.

-kgd

FROM header obfuscation

Posted by Frido Otten <fr...@0tten.nl>.

Hi All,

Recently we're seeing more spam passing our spamfilters using text 
obfuscating in the FROM header. The problem mainly targets users which 
are using mail clients like iPhone Mail which are only displaying the 
display name of the FROM header and not the actual email address which 
was used, bypassing DKIM measures. For example:

From: =?UTF-8?B?0KBvc3RubC5ubCDQoGFra2V0?= <ar...@qbocel.com>

This is base64 encoded "Рostnl.nl Рakket" and pretends to come from 
Postnl, a dutch snailmail company. However the hexadecimal 
representation of this base64 decoded text differs from that of normal 
ASCII:

Obfuscated:

$ printf "Рostnl.nl Рakket" | od -A n -t x1
  d0 a0 6f 73 74 6e 6c 2e 6e 6c 20 d0 a0 61 6b 6b
  65 74

Plain ASCII:

$ printf "Postnl.nl Pakket" | od -A n -t x1
  50 6f 73 74 6e 6c 2e 6e 6c 20 50 61 6b 6b 65 74

There is no way to tell the difference with the naked eye. You can 
obfuscate text using this online tool: https://obfuscator.uo1.net/

Is there any way to detect this type of obfuscation with a spamassassin 
rule?

Best regards,
Frido Otten

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by John Hardin <jh...@impsec.org>.

On Tue, 8 Feb 2022, Loren Wilton wrote:

>>>  Are you talking about the use of m'' as the regex delimiter?
>>
>>  Yes.
>>
>>  It will probably work just fine for the foreseeable future, as long as the
>>  input validation of rules files is lenient.
>
> I think you may have a very hard time removing the m<char> matching 
> delimiters from SA. I suspect there are at least hundreds of rules like that 
> in the release database. I have about a hundred local rules of my own that 
> use that.

Indeed.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Journalism is about covering important stories.
   With a pillow, until they stop moving.               -- David Burge
-----------------------------------------------------------------------
  74 more days working to pay your (average) annual US tax bill
  before you're finally working for yourself.

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Loren Wilton <lw...@earthlink.net>.

>> Are you talking about the use of m'' as the regex delimiter?
>
> Yes.
>
> It will probably work just fine for the foreseeable future, as long as the 
> input validation of rules files is lenient.

I think you may have a very hard time removing the m<char> matching 
delimiters from SA. I suspect there are at least hundreds of rules like that 
in the release database. I have about a hundred local rules of my own that 
use that.

Any time I have more than one backslash in a pattern, I use an alternate 
delimiter (usually single quote) so that I don't have to escape all the 
backslashes in the rule body. I'm not a fan of obfuscated rule bodies where 
it is impossible to tell what it is intended to match. My experience is that 
any time you have to write \\\\ or \\\\\\ multiple times in a rule body, you 
are almost guaranteed to get the number of backslahses wrong, and the rule 
won't work. But of course it may work in some cases (like the one you used 
to test it) while not working in general.

I don't have time in my life to deal with that sort of thing. It caused me 
enough grief when I started writing rules 20 years ago, which is why I 
started using m'.

BTW, that particular rule dates from RulesEmporium days, which was what, 
2005 or so?

        Loren

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Bill Cole <sa...@billmail.scconsult.com>.

On 2022-02-08 at 13:14:06 UTC-0500 (Tue, 8 Feb 2022 13:14:06 -0500)
Kris Deugau <kd...@vianet.ca>
is rumored to have said:
[...]
> Are you talking about the use of m'' as the regex delimiter?

Yes.

It will probably work just fine for the foreseeable future, as long as the input validation of rules files is lenient.

It isn't beyond the realm of possibility that someday we'll tighten up syntax checking. We've had security issues in the past which involved the hypothetical potential to sneak in malicious code via rules. I don't expect that we'll have another one bad enough to make a rewrite of the config parser justified, but it could happen, and I don't think we'd design it today as it was done 20 years ago.


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Kris Deugau <kd...@vianet.ca>.

Bill Cole wrote:
> On 2022-02-08 at 04:28:16 UTC-0500 (Tue, 8 Feb 2022 01:28:16 -0800)
> Loren Wilton <lw...@earthlink.net>
> is rumored to have said:
> 
>>> No, I added that after observing multiple spams with random garbage after the closing HTML tag in the HTML body part. Presumably it was an attempt at Bayes poison, checksum avoidance, or some other filter evasion technique.
>>>
>>> I'll tighten it up.
>>
>> FWIW, here is the rule I use. It obviously could be better, but I haven't noticed that it misfires.
>>
>> full __GOODEHTML1 m'</html>'i
>>
>> full __GOODEHTML2 m'</html>(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime ending boundary
> 
> TANGENTIAL:
> 
> I would advise against using such alternative regex syntax in rules. As you obviously figured out, you CAN (for now...) use any valid Perl syntax for writing a regex match, but I do not believe that we want to bless that as something which will never break.

Maybe it's just inexperience with deep regex voodoo, but I'm not seeing 
anything odd in those.

Are you talking about the use of m'' as the regex delimiter?

-kgd

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Bill Cole <sa...@billmail.scconsult.com>.

On 2022-02-08 at 04:28:16 UTC-0500 (Tue, 8 Feb 2022 01:28:16 -0800)
Loren Wilton <lw...@earthlink.net>
is rumored to have said:

>> No, I added that after observing multiple spams with random garbage after the closing HTML tag in the HTML body part. Presumably it was an attempt at Bayes poison, checksum avoidance, or some other filter evasion technique.
>>
>> I'll tighten it up.
>
> FWIW, here is the rule I use. It obviously could be better, but I haven't noticed that it misfires.
>
> full __GOODEHTML1 m'</html>'i
>
> full __GOODEHTML2 m'</html>(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime ending boundary

TANGENTIAL:

I would advise against using such alternative regex syntax in rules. As you obviously figured out, you CAN (for now...) use any valid Perl syntax for writing a regex match, but I do not believe that we want to bless that as something which will never break.


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Loren Wilton <lw...@earthlink.net>.

> No, I added that after observing multiple spams with random garbage after 
> the closing HTML tag in the HTML body part. Presumably it was an attempt 
> at Bayes poison, checksum avoidance, or some other filter evasion 
> technique.
>
> I'll tighten it up.

FWIW, here is the rule I use. It obviously could be better, but I haven't 
noticed that it misfires.

full __GOODEHTML1 m'</html>'i

full __GOODEHTML2 m'</html>(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime 
ending boundary

meta LW_BADEHTML1 (__GOODEHTML1 && !__GOODEHTML2)

describe LW_BADEHTML1 Bad ending - something after </HTML>

score LW_BADEHTML1 1

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by John Hardin <jh...@impsec.org>.

On Mon, 7 Feb 2022, Loren Wilton wrote:

>>  But, it had:
>>
>>   *  2.5 CONTENT_AFTER_HTML More content after HTML close tag
>>
>>  but one was only text/plain and I could see nothing wrong.   reading
>>  72_active.cf I found:
>>
>>    rawbody    __CONTENT_AFTER_HTML        /<\/htnl>\s*[a-z0-9]/i
>>  >
>>  which fires on a text/plain part that discusses html formatting!
>
> Note you show __CONTENT_AFTER_HTML and CONTENT_AFTER_HTML, which are not the 
> same rule. I suspect the meta for CONTENT_AFTER_HTML  contains some other 
> things that should in theory make it not hit in this case.
>
> I've personally never seen this rule hit, and didn't know it existed. Are you 
> sure it isn't a local rule? I have a rule of my own that gives 1 point for 
> extra trash after the /html end tag. I see it frequently on spam and UCE that 
> has a tracking tag in the HTML section after the official end of the html.

No, I added that after observing multiple spams with random garbage after 
the closing HTML tag in the HTML body part. Presumably it was an attempt 
at Bayes poison, checksum avoidance, or some other filter evasion 
technique.

I'll tighten it up.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   You do not examine legislation in the light of the benefits it
   will convey if properly administered, but in the light of the
   wrongs it would do and the harms it would cause if improperly
   administered.                                  -- Lyndon B. Johnson
-----------------------------------------------------------------------
  5 days until Abraham Lincoln's and Charles Darwin's 213th Birthdays

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Loren Wilton <lw...@earthlink.net>.

> But, it had:
>
>  *  2.5 CONTENT_AFTER_HTML More content after HTML close tag
>
> but one was only text/plain and I could see nothing wrong.   reading
> 72_active.cf I found:
>
>   rawbody    __CONTENT_AFTER_HTML        /<\/htnl>\s*[a-z0-9]/i 
>  >
> which fires on a text/plain part that discusses html formatting!

Note you show __CONTENT_AFTER_HTML and CONTENT_AFTER_HTML, which are not the 
same rule. I suspect the meta for CONTENT_AFTER_HTML  contains some other 
things that should in theory make it not hit in this case.

I've personally never seen this rule hit, and didn't know it existed. Are 
you sure it isn't a local rule? I have a rule of my own that gives 1 point 
for extra trash after the /html end tag. I see it frequently on spam and UCE 
that has a tracking tag in the HTML section after the official end of the 
html.

        Loren

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by Greg Troxel <gd...@lexort.com>.

John Hardin <jh...@impsec.org> writes:

> On Mon, 7 Feb 2022, Greg Troxel wrote:
>
>> and then I got a reply back with the content he was trying to send etc.
>> But, it had:
>>
>> 	*  2.5 CONTENT_AFTER_HTML More content after HTML close tag
>>
>> but one was only text/plain and I could see nothing wrong.   reading
>> 72_active.cf I found:
>>
>>  rawbody    __CONTENT_AFTER_HTML        /<\/htnl>\s*[a-z0-9]/i
>> which fires on a text/plain part that discusses html formatting!
>
> Ah, I'll see if I can add something to that so it only fires when
> there's an actual HTML body part. Thanks for the report.
>
> Pity there's not an "htmlbody" rule type...

Agreed - I think the way you are trying to tighten is correct.

Re: CONTENT_AFTER_HTML: better not discuss formatting!!

Posted by John Hardin <jh...@impsec.org>.

On Mon, 7 Feb 2022, Greg Troxel wrote:

> and then I got a reply back with the content he was trying to send etc.
> But, it had:
>
> 	*  2.5 CONTENT_AFTER_HTML More content after HTML close tag
>
> but one was only text/plain and I could see nothing wrong.   reading
> 72_active.cf I found:
>
>  rawbody    __CONTENT_AFTER_HTML        /<\/htnl>\s*[a-z0-9]/i
> which fires on a text/plain part that discusses html formatting!

Ah, I'll see if I can add something to that so it only fires when there's 
an actual HTML body part. Thanks for the report.

Pity there's not an "htmlbody" rule type...


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #2: Anything worth shooting
   is worth shooting twice. Ammo is cheap. Your life is expensive.
-----------------------------------------------------------------------
  5 days until Abraham Lincoln's and Charles Darwin's 213th Birthdays