You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Marc Perkel <su...@junkemailfilter.com> on 2012/11/10 15:57:48 UTC

Regex Help

Need a rule to catch this:

HtTp://goOGleplAcESSEOopTimiZaTIonx.cOm

Mixed case links


-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
Actually - I think that will do as is. I'm going to test it.

Thanks for your help.

On 11/10/2012 8:57 AM, John Hardin wrote:
> uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i 

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
That should have been:

uri  URI_PROTO_MC  /^[Hh](?!(?-i:ttps?:))ttps?:/i

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Tue, 13 Nov 2012, Marc Perkel wrote:

> So far working good. Caught 4620 spams since sunday morning with these mixed 
> case rules.

Cool.

> I added this as a separate rule.
>
> /^(?!(?-i:[Hh]ttps?:\/\/www))https?:\/\/www/i
>
> Found some cases where the HTTP was lower case but the WWW was mixed.

I will add a rule for that to my sandbox. Thanks for letting me know.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Windows Genuine Advantage (WGA) means that now you use your
   computer at the sufferance of Microsoft Corporation. They can
   kill it remotely without your consent at any time for any reason;
   it also shuts down in sympathy when the servers at Microsoft crash.
-----------------------------------------------------------------------
  166 days since the first successful private support mission to ISS (SpaceX)

Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Tue, 13 Nov 2012, Alex wrote:

>> So far working good. Caught 4620 spams since sunday morning with these mixed
>> case rules.
>
> Can you really make scoring decisions based on a mixed-case URI? Do
> you have it as part of a meta with the other rules that John provided?
>
> I'm looking at John's sandbox entries, and wondering if there is a
> rule to be made from those URIs he's created, or are you just probing
> to see if they are tagged at this point?

At the moment the rules should just be exploratory. The masscheck shows 
_very_ weak S/O and disproportionate FPs; I am tuning them but at the 
moment they don't look too good. I'd put effort into figuring out likely 
metas, but they are barely hitting any spam in the masscheck corpora.

Marc, do you have any stats on how much of your _ham_ those rules are 
hitting?

And it would be helpful if you could contribute masscheck results since 
you seem to be seeing a lot more of these than any other contributors.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Where are my space habitats? Where is my flying car?
   It's 2010 and all I got from the SF books of my youth
   is the lousy dystopian government.                      -- perlhaqr
-----------------------------------------------------------------------
  166 days since the first successful private support mission to ISS (SpaceX)

Re: Regex Help

Posted by Alex <my...@gmail.com>.
Hi,

>> This is what you want:
>>
>>   uri  URI_PROTO_MC  /^(?!(?-i:[Hh]ttps?:))https?:/i
>>
>> The string inside the parentheses is what you want to _not_ hit, and that
>> part is _not_ case-insensitive, even though the rest of the expression _is_
>> case-insensitive.
>>
>> Also, for the TLD rule: after a bit of thought I realized it would be very
>> unlikely a spammer would be doing this to a .gov URI, so I substituted .biz:
>>
>>   uri  __URI_TLD_MC
>> /\.(?!(?-i:com|net|org|biz|info))(?:com|net|org|biz|info)\b/i
...
>
> So far working good. Caught 4620 spams since sunday morning with these mixed
> case rules. I added this as a separate rule.
>
> /^(?!(?-i:[Hh]ttps?:\/\/www))https?:\/\/www/i
>
> Found some cases where the HTTP was lower case but the WWW was mixed.

Can you really make scoring decisions based on a mixed-case URI? Do
you have it as part of a meta with the other rules that John provided?

I'm looking at John's sandbox entries, and wondering if there is a
rule to be made from those URIs he's created, or are you just probing
to see if they are tagged at this point?

Thanks,
Alex

Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
On 11/10/2012 11:13 AM, John Hardin wrote:
> On Sat, 10 Nov 2012, Marc Perkel wrote:
>
>> Just a thought, I changed this:
>>
>> uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i
>>
>> into this:
>>
>> uri  URI_PROTO_MC  /^(?!(?-i:ttps?:))ttps?:/i
>>
>> Some people capitalize the H - but the rest of it being mixed case 
>> should be 100% accurate.
>
> That breaks it. Note the RE is anchored at the beginning of the URI.
>
> This is what you want:
>
>   uri  URI_PROTO_MC  /^(?!(?-i:[Hh]ttps?:))https?:/i
>
> The string inside the parentheses is what you want to _not_ hit, and 
> that part is _not_ case-insensitive, even though the rest of the 
> expression _is_ case-insensitive.
>
> Also, for the TLD rule: after a bit of thought I realized it would be 
> very unlikely a spammer would be doing this to a .gov URI, so I 
> substituted .biz:
>
>   uri  __URI_TLD_MC 
> /\.(?!(?-i:com|net|org|biz|info))(?:com|net|org|biz|info)\b/i
>
>
> -- 
>  John Hardin KA7OHZ http://www.impsec.org/~jhardin/
>  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>   The fetters imposed on liberty at home have ever been forged out
>   of the weapons provided for defense against real, pretended, or
>   imaginary dangers from abroad.               -- James Madison, 1799
> -----------------------------------------------------------------------
>  Tomorrow: Veterans Day
>
>
>

So far working good. Caught 4620 spams since sunday morning with these 
mixed case rules. I added this as a separate rule.

/^(?!(?-i:[Hh]ttps?:\/\/www))https?:\/\/www/i

Found some cases where the HTTP was lower case but the WWW was mixed.



-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Sat, 10 Nov 2012, Marc Perkel wrote:

> Just a thought, I changed this:
>
> uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i
>
> into this:
>
> uri  URI_PROTO_MC  /^(?!(?-i:ttps?:))ttps?:/i
>
> Some people capitalize the H - but the rest of it being mixed case should be 
> 100% accurate.

That breaks it. Note the RE is anchored at the beginning of the URI.

This is what you want:

   uri  URI_PROTO_MC  /^(?!(?-i:[Hh]ttps?:))https?:/i

The string inside the parentheses is what you want to _not_ hit, and that 
part is _not_ case-insensitive, even though the rest of the expression 
_is_ case-insensitive.

Also, for the TLD rule: after a bit of thought I realized it would be very 
unlikely a spammer would be doing this to a .gov URI, so I substituted 
.biz:

   uri  __URI_TLD_MC  /\.(?!(?-i:com|net|org|biz|info))(?:com|net|org|biz|info)\b/i


--
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The fetters imposed on liberty at home have ever been forged out
   of the weapons provided for defense against real, pretended, or
   imaginary dangers from abroad.               -- James Madison, 1799
-----------------------------------------------------------------------
  Tomorrow: Veterans Day

Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
On 11/10/2012 10:51 AM, John Hardin wrote:
> On Sat, 10 Nov 2012, Marc Perkel wrote:
>
>> What would you have to do to show the URI in the description?
>
> ...it would have to be a plugin. There's no general-purpose model for 
> putting a capturing expression into a rule and having the captured 
> match appear in the description, and if there was there would be no 
> easy way for something like that to propagate up through metas.
>
> Mixed-case URIs like that shouldn't be able to avoid URIBL lookups, so 
> if you are doing URIBL lookups and they hit then the domain name would 
> be shown in the description for that hit.
>

Yeah - I don't know what the spammers are trying to do but it sure makes 
it easy to catch them.

Just a thought, I changed this:

uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i


into this:

uri  URI_PROTO_MC  /^(?!(?-i:ttps?:))ttps?:/i

Some people capitalize the H - but the rest of it being mixed case 
should be 100% accurate.


-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Sat, 10 Nov 2012, Marc Perkel wrote:

> What would you have to do to show the URI in the description?

...it would have to be a plugin. There's no general-purpose model for 
putting a capturing expression into a rule and having the captured match 
appear in the description, and if there was there would be no easy way for 
something like that to propagate up through metas.

Mixed-case URIs like that shouldn't be able to avoid URIBL lookups, so 
if you are doing URIBL lookups and they hit then the domain name would be 
shown in the description for that hit.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Any time law enforcement becomes a revenue center, the system
   becomes corrupt.
-----------------------------------------------------------------------
  Tomorrow: Veterans Day

Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
I think your original solution is good enough. I'm testing it now. What 
would you have to do to show the URI in the description?




On 11/10/2012 10:36 AM, John Hardin wrote:
> On Sat, 10 Nov 2012, Marc Perkel wrote:
>> On 11/10/2012 8:57 AM, John Hardin wrote:
>>>
>>>  How much are you seeing these in real traffic?
>>
>> I'm seeing a lot of these. They are coming from stolen Yahoo accounts 
>> from back when Yahoo leaked their data base. They appear to come from 
>> friends of mine.
>
> Oh, good (for certain values of "good"). I've added those rules to my 
> sandbox so maybe they will perform well enough to be published.
>
>> Can you refine it so that there has to be something like at least 4 
>> upper case characters in the URI to avoid false positives? For example.
>>
>> http://WellsFargo.com ok
>> HttP: //WeLlSfaRgo.cOm  not OK
>
> Hrm. I'll have to think about that, that's fairly nontrivial.
>
> If you are seeing specific domain names a lot then more rules like 
> URI_GOOG_MC could be written to catch them. Do they seem to 
> concentrate on some limited list of domain names (or variants like 
> stuff containing "google"), or are they all over the place? Feel free 
> to contact me offlist with a list of domain names and examples if they 
> seem to be limited...
>

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Sat, 10 Nov 2012, Marc Perkel wrote:
> On 11/10/2012 8:57 AM, John Hardin wrote:
>>
>>  How much are you seeing these in real traffic?
>
> I'm seeing a lot of these. They are coming from stolen Yahoo accounts from 
> back when Yahoo leaked their data base. They appear to come from friends of 
> mine.

Oh, good (for certain values of "good"). I've added those rules to my 
sandbox so maybe they will perform well enough to be published.

> Can you refine it so that there has to be something like at least 4 upper 
> case characters in the URI to avoid false positives? For example.
>
> http://WellsFargo.com ok
> HttP: //WeLlSfaRgo.cOm  not OK

Hrm. I'll have to think about that, that's fairly nontrivial.

If you are seeing specific domain names a lot then more rules like 
URI_GOOG_MC could be written to catch them. Do they seem to concentrate on 
some limited list of domain names (or variants like stuff containing 
"google"), or are they all over the place? Feel free to contact me offlist 
with a list of domain names and examples if they seem to be limited...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Any time law enforcement becomes a revenue center, the system
   becomes corrupt.
-----------------------------------------------------------------------
  Tomorrow: Veterans Day

Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
On 11/10/2012 8:57 AM, John Hardin wrote:
> On Sat, 10 Nov 2012, Marc Perkel wrote:
>
>> Need a rule to catch this:
>>
>> HtTp://goOGleplAcESSEOopTimiZaTIonx.cOm
>>
>> Mixed case links
>
> Mixed-case protocol:
>
>    uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i
>
> Note: this _will_trigger on HTTP and HTTPS but I expect they are rare 
> in legitimate URIs
>
> Mixed case TLD:
>
>    uri  URI_TLD_MC 
> /\.(?!(?-i:com|net|org|gov|info))(?:com|net|org|gov|info)\b/i
>
> Add TLDs as needed. Again, this _will_ trigger on totally UC TLDs. If 
> that's a problem just add the fully-uppercase TLD to the first TLD 
> list (the case-insensitive zero-width lookahead assertion).
>
> Common domain name parts or subparts:
>
>    uri  URI_GOOG_MC   /(?!(?-i:google))google/i
>
> HTH.
>
> How much are you seeing these in real traffic?
>


I'm seeing a lot of these. They are coming from stolen Yahoo accounts 
from back when Yahoo leaked their data base. They appear to come from 
friends of mine.

Can you refine it so that there has to be something like at least 4 
upper case characters in the URI to avoid false positives? For example.

http://WellsFargo.com ok
HttP://WeLlSfaRgo.cOm  not OK


-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by John Hardin <jh...@impsec.org>.
On Sat, 10 Nov 2012, Marc Perkel wrote:

> Need a rule to catch this:
>
> HtTp://goOGleplAcESSEOopTimiZaTIonx.cOm
>
> Mixed case links

Mixed-case protocol:

    uri  URI_PROTO_MC  /^(?!(?-i:https?:))https?:/i

Note: this _will_trigger on HTTP and HTTPS but I expect they are rare in 
legitimate URIs

Mixed case TLD:

    uri  URI_TLD_MC    /\.(?!(?-i:com|net|org|gov|info))(?:com|net|org|gov|info)\b/i

Add TLDs as needed. Again, this _will_ trigger on totally UC TLDs. If 
that's a problem just add the fully-uppercase TLD to the first TLD list 
(the case-insensitive zero-width lookahead assertion).

Common domain name parts or subparts:

    uri  URI_GOOG_MC   /(?!(?-i:google))google/i

HTH.

How much are you seeing these in real traffic?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Perfect Security and Absolute Safety are unattainable; beware
   those who would try to sell them to you, regardless of the cost,
   for they are trying to sell you your own slavery.
-----------------------------------------------------------------------
  Tomorrow: Veterans Day

Re: Regex Help

Posted by Marc Perkel <su...@junkemailfilter.com>.
I meant a rule to catch mixed case URIs in general. That was just one 
example.

On 11/10/2012 7:44 AM, darxus@chaosreigns.com wrote:
> On 11/10, Marc Perkel wrote:
>> Need a rule to catch this:
>>
>> HtTp://goOGleplAcESSEOopTimiZaTIonx.cOm
> body GOOGLEMIXED /HtTp:\/\/goOGleplAcESSEOopTimiZaTIonx.cOm/
>
> Untested, because I kind of expect that's not actually what you want.  If
> you want something to match things that look similar to this, you need to
> provide multiple examples.
>

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


Re: Regex Help

Posted by da...@chaosreigns.com.
On 11/10, Marc Perkel wrote:
> Need a rule to catch this:
> 
> HtTp://goOGleplAcESSEOopTimiZaTIonx.cOm

body GOOGLEMIXED /HtTp:\/\/goOGleplAcESSEOopTimiZaTIonx.cOm/

Untested, because I kind of expect that's not actually what you want.  If
you want something to match things that look similar to this, you need to
provide multiple examples.

-- 
"it's not how good you are, it's how bad you want it" - no fear
http://www.ChaosReigns.com