You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by ja...@mail-central.com on 2016/06/15 16:27:11 UTC

how to write body rules to match 'tortured html' variations of text phrases?

I've installed SA 3.4.1.

I'm writing body rules to deal with some persistent spam I'm getting.

plain-text match rules are simple enough.

Much of the spam contains 'tortured html'.  I just want to get clear about how to correctly match it.

For example, here's a body snippet from one of those 'tortured' spams

-----
#hearthrugs-tablecloths-dishcovers-coalscuttles-a {
    pl=
ay-during: auto;
    page-break-before: auto
    }</style><title>Succes=
sful women join us and become even more successful...</title><meta content=
=3D"IE=3Dedge" http-equiv=3D"X-UA-Compatible"/><meta content=3D"width=3Ddev=
ice-width, initial-scale=3D1" name=3D"viewport"/>
-----

Notice that the phrase "Successful women" is (1) line-broken, and (2) contains a "=" separator

How would I write a body rule to match on

"
Succes=
sful women
"
and all the possible line-broken and "="-delimited variations?  There's obviously a lot of them.

Does SA *already* do some sort of fuzzy matching?

Jason

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by David B Funk <db...@engineering.uiowa.edu>.

On Thu, 16 Jun 2016, RW wrote:

> On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
> John Hardin wrote:
>
>> On Wed, 15 Jun 2016, jasonsu@mail-central.com wrote:
>
>>> and all the possible line-broken and "="-delimited variations?
>>> There's obviously a lot of them.
>>
>> That would have to be a rawbody rule
>
> AFAIK QP is decoded even in the rawbody.
>

That is correct, you need to use 'full' rules (which come before "rawbody") to
get the undecoded (really-raw) message.


-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by RW <rw...@googlemail.com>.

On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
John Hardin wrote:

> On Wed, 15 Jun 2016, jasonsu@mail-central.com wrote:

> > and all the possible line-broken and "="-delimited variations?
> > There's obviously a lot of them.  
> 
> That would have to be a rawbody rule

AFAIK QP is decoded even in the rawbody.

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by Groach <gr...@yahoo.com>.

On 15/06/2016 22:42, Dianne Skoll wrote:
> On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
> John Hardin <jh...@impsec.org> wrote:
>
>> That's (more or less) "Quoted Printable" encoding.
> AFAIK, SpamAssassin "body" rules are applied after the
> Content-Transfer-Encoding: has been decoded.  So the QP equal signs
> are a red herring.
>
> Regards,
>
> Dianne.
Yes, I thought that too.

I have written my own rules occasionally and being a total novice I just 
set about it using trial and error without understanding all this 
encoding stuff.  And in so doing I found that 'line-wrapped' words 
(delimited with the equals sign) are deciphered and applied to the rule 
accordingly.

Here is a real example:

body     __MY_PHISH_CIRCUMVENT_ATTEMPT3 
/((?!account)(\xD0\xB0|a)(\xD1\x81|c){2}(\xD0\xBE|o)u(\xD5\xB8|n)t|(?!customer)(\xE1\xB4\x84|c)u(\xD1\x95|S)t(\xD0\xBE|o)mer|(?!verif(y|i))ver(\xD1\x96|i)f((\xD1\x83|y)|
(\xD1\x96|i)))/i

(effectively looking for sneaky encrypted characters to look-like real 
letters to make words such as "account", "customer" and 
"verify"/"verifi") - definitely phishing and dodgy if this exists).

And this is REAL body text from an email:

-- SNIP ---------------------------------------
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

.
.
.
<TD style=3D"FONT-FAMILY: Helvetica, Arial, sans-serif; COLOR: 
rgb(102,102,1=
02); PADDING-BOTTOM: 15px; PADDING-TOP: 15px; PADDING-LEFT: 0px; 
PADDING-RIG=
HT: 0px" width=3D471 align=3Dleft><FONT size=3D2 face=3D"Arial,elv 
Hetica, s=
ans-serif"><STRONG>derek@mycompany.com</STRONG>&nbsp;- =D0=85=D0=B5=D1=81=
ur=D1=96t=D1=83 m=D0=B5=D0=B0=D1=95ur=D0=B5=D1=95 h=D0=B0=D1=95 b=D0=B5=D0=
=B5n =D0=B0=D1=80=D1=80=D3=8F=D1=96=D0=B5d t=D0=BE =D1=83=D0=BEur =D0=B0=D1=
=81=D1=81=D0=BEu=D5=B8t.</FONT></TD>
-----------------------------------------

I can tell you that the very last word/sequence of characters:

=D0=B0=D1=
=81=D1=81=D0=BEu=D5=B8t

get caught despite being separated and line-wrapped with an equals sign 
(FYI they look like "\u0430\u0441\u0441\u043eu\u0578t." - account).

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by Bill Cole <sa...@billmail.scconsult.com>.

On 15 Jun 2016, at 16:42, Dianne Skoll wrote:

> On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
> John Hardin <jh...@impsec.org> wrote:
>
>> That's (more or less) "Quoted Printable" encoding.
>
> AFAIK, SpamAssassin "body" rules are applied after the
> Content-Transfer-Encoding: has been decoded.  So the QP equal signs
> are a red herring.

Yes, and 'rawbody' rules are also CTE-decoded but you can get the REAL 
pristine non-decoded message (including headers) with 'full' rules.

Think about how versatile a double-edged straight razor could be...

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by John Hardin <jh...@impsec.org>.

On Wed, 15 Jun 2016, Groach wrote:

> Here is a real example:
>
> body     __MY_PHISH_CIRCUMVENT_ATTEMPT3 
> /((?!account)(\xD0\xB0|a)(\xD1\x81|c){2}(\xD0\xBE|o)u(\xD5\xB8|n)t|(?!customer)(\xE1\xB4\x84|c)u(\xD1\x95|S)t(\xD0\xBE|o)mer|(?!verif(y|i))ver(\xD1\x96|i)f((\xD1\x83|y)|
> (\xD1\x96|i)))/i
>
> (effectively looking for sneaky encrypted characters to look-like real 
> letters to make words such as "account", "customer" and "verify"/"verifi") - 
> definitely phishing and dodgy if this exists).
>
> And this is REAL body text from an email:
>
> -- SNIP ---------------------------------------
> Content-Type: text/html; charset="utf-8"
> Content-Transfer-Encoding: quoted-printable
>
> .
> .
> .
> <TD style=3D"FONT-FAMILY: Helvetica, Arial, sans-serif; COLOR: rgb(102,102,1=
> 02); PADDING-BOTTOM: 15px; PADDING-TOP: 15px; PADDING-LEFT: 0px; PADDING-RIG=
> HT: 0px" width=3D471 align=3Dleft><FONT size=3D2 face=3D"Arial,elv Hetica, s=
> ans-serif"><STRONG>...</STRONG>&nbsp;- =D0=85=D0=B5=D1=81=
> ur=D1=96t=D1=83 m=D0=B5=D0=B0=D1=95ur=D0=B5=D1=95 h=D0=B0=D1=95 b=D0=B5=D0=
> =B5n =D0=B0=D1=80=D1=80=D3=8F=D1=96=D0=B5d t=D0=BE =D1=83=D0=BEur =D0=B0=D1=
> =81=D1=81=D0=BEu=D5=B8t.</FONT></TD>
> -----------------------------------------
>
>
> I can tell you that the very last word/sequence of characters:
>
> =D0=B0=D1=
> =81=D1=81=D0=BEu=D5=B8t
>
>
> get caught despite being separated and line-wrapped with an equals sign (FYI 
> they look like "\u0430\u0441\u0441\u043eu\u0578t." - account).

This is a hugely common obfuscation technique.

Take a look at 
https://svn.apache.org/viewvc/spamassassin/trunk/rules/25_replace.cf

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The social contract exists so that everyone doesn't have to squat
   in the dust holding a spear to protect his woman and his meat all
   day every day. It does not exist so that the government can take
   your spear, your meat, and your woman because it knows better what
   to do with them.                           -- Dagny @ Ace of Spades
-----------------------------------------------------------------------
  3 days until SWMBO's Birthday

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by Groach <gr...@yahoo.com>.

On 15/06/2016 22:42, Dianne Skoll wrote:
> On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
> John Hardin <jh...@impsec.org> wrote:
>
>> That's (more or less) "Quoted Printable" encoding.
> AFAIK, SpamAssassin "body" rules are applied after the
> Content-Transfer-Encoding: has been decoded.  So the QP equal signs
> are a red herring.
>
> Regards,
>
> Dianne.
Yes, I thought that too.

I have written my own rules occasionally and being a total novice I just 
set about it using trial and error without understanding all this 
encoding stuff.  And in so doing I found that 'line-wrapped' words 
(delimited with the equals sign) are deciphered and applied to the rule 
accordingly.

Here is a real example:

body     __MY_PHISH_CIRCUMVENT_ATTEMPT3 
/((?!account)(\xD0\xB0|a)(\xD1\x81|c){2}(\xD0\xBE|o)u(\xD5\xB8|n)t|(?!customer)(\xE1\xB4\x84|c)u(\xD1\x95|S)t(\xD0\xBE|o)mer|(?!verif(y|i))ver(\xD1\x96|i)f((\xD1\x83|y)|
(\xD1\x96|i)))/i

(effectively looking for sneaky encrypted characters to look-like real 
letters to make words such as "account", "customer" and 
"verify"/"verifi") - definitely phishing and dodgy if this exists).

And this is REAL body text from an email:

-- SNIP ---------------------------------------
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

.
.
.
<TD style=3D"FONT-FAMILY: Helvetica, Arial, sans-serif; COLOR: rgb(102,102,1=
02); PADDING-BOTTOM: 15px; PADDING-TOP: 15px; PADDING-LEFT: 0px; PADDING-RIG=
HT: 0px" width=3D471 align=3Dleft><FONT size=3D2 face=3D"Arial,elv Hetica, s=
ans-serif"><STRONG>...</STRONG>&nbsp;- =D0=85=D0=B5=D1=81=
ur=D1=96t=D1=83 m=D0=B5=D0=B0=D1=95ur=D0=B5=D1=95 h=D0=B0=D1=95 b=D0=B5=D0=
=B5n =D0=B0=D1=80=D1=80=D3=8F=D1=96=D0=B5d t=D0=BE =D1=83=D0=BEur =D0=B0=D1=
=81=D1=81=D0=BEu=D5=B8t.</FONT></TD>
-----------------------------------------

I can tell you that the very last word/sequence of characters:

=D0=B0=D1=
=81=D1=81=D0=BEu=D5=B8t

get caught despite being separated and line-wrapped with an equals sign 
(FYI they look like "\u0430\u0441\u0441\u043eu\u0578t." - account).

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Wed, 15 Jun 2016 13:40:25 -0700 (PDT)
John Hardin <jh...@impsec.org> wrote:

> That's (more or less) "Quoted Printable" encoding.

AFAIK, SpamAssassin "body" rules are applied after the
Content-Transfer-Encoding: has been decoded.  So the QP equal signs
are a red herring.

Regards,

Dianne.

Re: how to write body rules to match 'tortured html' variations of text phrases?

Posted by John Hardin <jh...@impsec.org>.

On Wed, 15 Jun 2016, jasonsu@mail-central.com wrote:

> For example, here's a body snippet from one of those 'tortured' spams
>
> -----
> #hearthrugs-tablecloths-dishcovers-coalscuttles-a {
>    pl=
> ay-during: auto;
>    page-break-before: auto
>    }</style><title>Succes=
> sful women join us and become even more successful...</title><DEFANGED_meta content=
> =3D"IE=3Dedge" http-equiv=3D"X-UA-Compatible"/><DEFANGED_meta content=3D"width=3Ddev=
> ice-width, initial-scale=3D1" name=3D"viewport"/>
> -----
>
> Notice that the phrase "Successful women" is (1) line-broken, and (2) contains a "=" separator

That's (more or less) "Quoted Printable" encoding. I don't think that by 
itself will be at all useful as a spam sign unless you're looking for QP 
line breaks at something less than the QP spec line length, and ISTR 
there's already a rule for that.

> How would I write a body rule to match on
>
> "
> Succes=
> sful women
> "
> and all the possible line-broken and "="-delimited variations?  There's obviously a lot of them.

That would have to be a rawbody rule and would be hugely inefficient 
because (1) you can't predict some small set of words that will be broken 
that way and (2) all the possible break locations in all those words.

Strongly discouraged.

> Does SA *already* do some sort of fuzzy matching?

No, for the reasons noted above.

There is something else in that sample that *may* be a somewhat useful 
spam sign, the style name:

> #hearthrugs-tablecloths-dishcovers-coalscuttles-a {

A long style name consisting of long dash-broken subwords *might* be 
unusual enough for a while to give a point.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #9: Accuracy is relative: most combat
   shooting standards will be more dependent on "pucker factor" than
   the inherent accuracy of the gun.
-----------------------------------------------------------------------
  3 days until SWMBO's Birthday