You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by da...@chaosreigns.com on 2011/10/14 23:21:19 UTC

SPOOFED_URL Re: antiphishing

On 10/14, darxus@chaosreigns.com wrote:
> rawbody  __SPOOFED_URL	m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i

> I agree it seems like we should be able to improve it.  Maybe make
> exceptions for known marketing trackers, as Adam Katz mentioned it has
> problems with.  

I dug some of the hits out of my own corpora.  Of the 9 emails I looked at
*all* cases where it looked like this rule could have hit, matched at the
host name level.  So I think there is definite room for improvement there -
just check for a matching host name, ignore all the extra gunk after it.
Although I'm not certain it doesn't already try to do that, maybe I should
take more time to try to read it.  Okay, it's starting to sink in, and
looks like it's trying to match the whole url.  

Several examples where cases where somebody with a gmail account replied to
an email of mine and gmail converted the url in my plain text signature
to html:

throats.&quot;<br>=A0- Henry Louis Mencken (1880-1956)<br><a href=3D"http:/=
/www.chaosreigns.com/" target=3D"_blank">http://www.ChaosReigns.com</a><br>

And I did get to see lots of gross html.  Particularly from yahoo groups.
So maybe it would help to do some more html parsing (un-escaping) before
this rule.  I don't know how much work that would take.

But I didn't find any of the marketing trackers Adam mentioned.  

-- 
"Think, or I will set you on fire."
http://www.ChaosReigns.com

Re: SPOOFED_URL Re: antiphishing

Posted by da...@chaosreigns.com.

On 10/18, Matus UHLAR - fantomas wrote:
> Very nice, however due to these and other circumstances mentioned I
> think that a plugin would be better, since it could define where to

Thanks.  It didn't work out, the results were worse than the older rule:

http://ruleqa.spamassassin.org/?daterev=20111018-r1185533-n&rule=%2Fspoofed_url

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME   WHO/AGE
      0   1.6825   1.0301   0.620    0.55    0.01  T_SPOOFED_URL  
      0   1.2441   0.9989   0.555    0.53    0.01  T_SPOOFED_URL_HOST  
      0   2.1419   7.9151   0.213    0.42   (n/a)  __SPOOFED_URL  
      0   1.6915   7.7045   0.180    0.41   (n/a)  __SPOOFED_URL_HOST  

And yes, a plugin might be good to use
Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain() to use the
domain instead of the host.  But I doubt that's the biggest problem.

And I need to find out why my corpora aren't being included in the nightly
non-net ruleqa runs.

-- 
"Blades don't need reloading." - The Zombie Survival Guide by Max Brooks
http://www.ChaosReigns.com

Re: SPOOFED_URL Re: antiphishing

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 14.10.11 18:07, darxus@chaosreigns.com wrote:
>Existing rule:
>
>rawbody  __SPOOFED_URL	m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>
>
>How about this, to only check for a changed domain part instead?
>
>rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>
>It matches this:
>
>  <a href="http://www.chaosreigns.com/">http://www.example.com</a>
>
>But does not match this (example from actual non-spam):
>
>  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>
>
>
>A very simplified form of this new one:
>
>rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i
>
>That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
>set of parentheses).  In the perlre man page: "A zero-width negative
>look-ahead assertion."

Very nice, however due to these and other circumstances mentioned I 
think that a plugin would be better, since it could define where to 
skip host name (and up to which level) and e.g. it could define whitelists
- who can spoof who, e.g. which mail company may "spoof" which bank.

However until then, this should still be worth trying.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
You have the right to remain silent. Anything you say will be misquoted,
then used against you.

Re: SPOOFED_URL Re: antiphishing

Posted by da...@chaosreigns.com.

Not relevant to the subject.  We're talking about where somebody is
maliciously making you think you're clicking on "www.youtube.com" when in
fact you're clicking on "www.ILikeSpam.com".

Somebody linking to one domain with an image hosted on another domain has
plenty of possibility to be legit.

You could do it.  You're welcome to try.  Maybe it'll even hit a usefully
larger percentage of spam than ham.  But it's not what we've been talking
about.

On 10/14, Christian Grunfeld wrote:
> you should be able to check against img src content, right?
> 
> 
> 2011/10/14 Christian Grunfeld <ch...@gmail.com>:
> > and what about when there is no anchor text in the link ? eg. paypal
> > image button
> >
> >
> > 2011/10/14  <da...@chaosreigns.com>:
> >> Existing rule:
> >>
> >> rawbody  __SPOOFED_URL  m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
> >>
> >>
> >> How about this, to only check for a changed domain part instead?
> >>
> >> rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
> >>
> >> It matches this:
> >>
> >>  <a href="http://www.chaosreigns.com/">http://www.example.com</a>
> >>
> >> But does not match this (example from actual non-spam):
> >>
> >>  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>
> >>
> >>
> >> A very simplified form of this new one:
> >>
> >> rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i
> >>
> >> That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
> >> set of parentheses).  In the perlre man page: "A zero-width negative
> >> look-ahead assertion."
> >>
> >> --
> >> "Every normal man must be tempted at times to spit upon his hands,
> >> hoist the black flag, and begin slitting throats."
> >>  - Henry Louis Mencken (1880-1956)
> >> http://www.ChaosReigns.com
> >>
> >
> 

-- 
"Every normal man must be tempted at times to spit upon his hands,
hoist the black flag, and begin slitting throats."
 - Henry Louis Mencken (1880-1956)
http://www.ChaosReigns.com

Re: SPOOFED_URL Re: antiphishing

Posted by Christian Grunfeld <ch...@gmail.com>.

you should be able to check against img src content, right?


2011/10/14 Christian Grunfeld <ch...@gmail.com>:
> and what about when there is no anchor text in the link ? eg. paypal
> image button
>
>
> 2011/10/14  <da...@chaosreigns.com>:
>> Existing rule:
>>
>> rawbody  __SPOOFED_URL  m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>>
>>
>> How about this, to only check for a changed domain part instead?
>>
>> rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>>
>> It matches this:
>>
>>  <a href="http://www.chaosreigns.com/">http://www.example.com</a>
>>
>> But does not match this (example from actual non-spam):
>>
>>  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>
>>
>>
>> A very simplified form of this new one:
>>
>> rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i
>>
>> That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
>> set of parentheses).  In the perlre man page: "A zero-width negative
>> look-ahead assertion."
>>
>> --
>> "Every normal man must be tempted at times to spit upon his hands,
>> hoist the black flag, and begin slitting throats."
>>  - Henry Louis Mencken (1880-1956)
>> http://www.ChaosReigns.com
>>
>

Re: SPOOFED_URL Re: antiphishing

Posted by da...@chaosreigns.com.

None of these rules will hit that.  That's what the second "http" is for.
"Hit the host name part of the href value of an anchor tag, then do *not*
match the same host name in the value part of the anchor, then hit 'href'".

I should've called it SPOOFED_URL_HOST, because this one is matching the
full host name, not just the domain.  I don't even know if we can get the
TLD logic for domain matching into a regex.  Without a modification to the
perl interpreter.

On 10/14, Christian Grunfeld wrote:
> and what about when there is no anchor text in the link ? eg. paypal
> image button
> 
> 
> 2011/10/14  <da...@chaosreigns.com>:
> > Existing rule:
> >
> > rawbody  __SPOOFED_URL  m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
> >
> >
> > How about this, to only check for a changed domain part instead?
> >
> > rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
> >
> > It matches this:
> >
> >  <a href="http://www.chaosreigns.com/">http://www.example.com</a>
> >
> > But does not match this (example from actual non-spam):
> >
> >  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>
> >
> >
> > A very simplified form of this new one:
> >
> > rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i
> >
> > That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
> > set of parentheses).  In the perlre man page: "A zero-width negative
> > look-ahead assertion."
> >
> > --
> > "Every normal man must be tempted at times to spit upon his hands,
> > hoist the black flag, and begin slitting throats."
> >  - Henry Louis Mencken (1880-1956)
> > http://www.ChaosReigns.com
> >
> 

-- 
"I finally figured out the only reason to be alive is to enjoy it."
- Rita Mae Brown
http://www.ChaosReigns.com

Re: SPOOFED_URL Re: antiphishing

Posted by Christian Grunfeld <ch...@gmail.com>.

and what about when there is no anchor text in the link ? eg. paypal
image button


2011/10/14  <da...@chaosreigns.com>:
> Existing rule:
>
> rawbody  __SPOOFED_URL  m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>
>
> How about this, to only check for a changed domain part instead?
>
> rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i
>
> It matches this:
>
>  <a href="http://www.chaosreigns.com/">http://www.example.com</a>
>
> But does not match this (example from actual non-spam):
>
>  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>
>
>
> A very simplified form of this new one:
>
> rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i
>
> That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
> set of parentheses).  In the perlre man page: "A zero-width negative
> look-ahead assertion."
>
> --
> "Every normal man must be tempted at times to spit upon his hands,
> hoist the black flag, and begin slitting throats."
>  - Henry Louis Mencken (1880-1956)
> http://www.ChaosReigns.com
>

Re: SPOOFED_URL Re: antiphishing

Posted by da...@chaosreigns.com.

Existing rule:

rawbody  __SPOOFED_URL	m/<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:[^>"'\# ]{8,29}[^>"'\# :\/?&=])[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i


How about this, to only check for a changed domain part instead?

rawbody SPOOFED_URL_DOMAIN /<a\s[^>]{0,2048}\bhref=(?:3D)?.?(https?:\/\/?[^\/>"'\# ]{8,29})[^>]{0,2048}>(?:[^<]{0,1024}<(?!\/a)[^>]{1,1024}>){0,99}\s{0,10}(?!\1)https?[^\w<]{1,3}[^<]{5}/i

It matches this:

  <a href="http://www.chaosreigns.com/">http://www.example.com</a>

But does not match this (example from actual non-spam):

  <a href="http://www.jr.com/tracking?ord_q_num=105725494&ord_q_zip=03076">http://www.jr.com/tracking</a>


A very simplified form of this new one:

rawbody SPOOFED_URL_DOMAIN /<a href="(https?:\/\/[^\/">]+)[^>]*>(?!\1)http/i

That "(?!\1)" bit is nice and fancy.  It means "not what was in the first
set of parentheses).  In the perlre man page: "A zero-width negative
look-ahead assertion."

-- 
"Every normal man must be tempted at times to spit upon his hands,
hoist the black flag, and begin slitting throats."
 - Henry Louis Mencken (1880-1956)
http://www.ChaosReigns.com