You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Skip <sk...@pelorus.org> on 2008/08/31 13:32:47 UTC

Another "this should have triggered more rules" post

Got this one today.  Never seen anything like this before.
http://pelorus.org/mix

(I couldn't even paste into pastebin--their spam catcher caught it)  
This one only scored a 2.9 on my installation, as you can see.  I do 
have some custom rules (Saught and SARE) but no hits there.

Skip

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc


Re: Another "this should have triggered more rules" post

Posted by John Hardin <jh...@impsec.org>.
On Sun, 2008-08-31 at 14:33 -0400, Skip wrote:

> >> describe TO_HARVESTED To: obviously harvested
> >> header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
> >> some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
> >> \.machine)\b/
>
> Can you tell me how this rule works?  Or give a more realistic example 
> (in my case I would use pelorus.org, so feel free to demonstrate with that)

It checks for any of the following domains in the To: list of addresses:

@example.domain
@your.domain
@some.domain
@example.com
@your.domain.com
@your.favorite.machine

It's essentially a set of nested OR'd substring comparisons. An
equivalent RE would be:

/@(?:example\.domain|your\.domain|some\.domain|example\.com|your\.domain
\.com|your\.favorite\.machine)\b/i

That rule is the actual rule you'd use. You wouldn't need to change it
based on your own domain, as all of those domains are bogus. They either
refer to nonexistent domains commonly used in examples, or real domains
(e.g. example.com) explicitly registered only for use in examples. If
you see one of those domains in a recipient list, it's a pretty clear
indication of automatic address harvesting and sloppy list cleaning.
That's the spam sign this rule is checking for.


-- 
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 Obama is a three-year senator without a single important
 legislative achievement to his name, a former Illinois state
 senator who voted "present" nearly 130 times. As president of the
 Harvard Law Review, as law professor and as legislator, has he ever
 produced a single notable piece of scholarship? Written a single
 memorable article? His most memorable work is a biography of his
 favorite subject: himself.                    -- Charles Krauthammer
-----------------------------------------------------------------------
 65 days until the Presidential Election


Re: Another "this should have triggered more rules" post

Posted by Sahil Tandon <sa...@tandon.net>.
mouss <mo...@netoyen.net> wrote:
 
> You can check the DKIM signature if you have an unaltered copy of the 
> message. but whether it's good or not, the IP is belongs to google.

Yep.  Google has become a consistent source of spam, much like (though to 
a lesser degree) Yahoo.  Their postmasters and other administrative 
contacts have not been responsive.

-- 
Sahil Tandon <sa...@tandon.net>

Re: Another "this should have triggered more rules" post

Posted by mouss <mo...@netoyen.net>.
Skip wrote:
> 
>> How about these rules? (watch the line wrap)
>>>
>>> describe TO_HARVESTED To: obviously harvested
>>> header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
>>> some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
>>> \.machine)\b/
>>>
>>>
>>
> Can you tell me how this rule works? 

it catches mail with a To header containing invalid email addresses that 
were obviosuly harvested, such as "foo@your.domain" (literally, do not 
replace with your own domain name) or "bar@your.favourite.machine". 
These addresses are invalid because there is no "domain" or "machine" TLD.

> Or give a more realistic example 

it is realistic. copy-paste without edit.

> (in my case I would use pelorus.org, 

No. use the rule literally.

> so feel free to demonstrate with that)


> 
>>
>> How can google let this go out?
>>
>>
> I was wondering that too.  Did it really come from gmail?

if it doesn't, you have a serious problem. your Received header says it 
comes from 72.14.204.173, and
$ host 72.14.204.173
173.204.14.72.in-addr.arpa domain name pointer qb-out-1314.google.com.
$ host qb-out-1314.google.com
...
qb-out-1314.google.com has address 72.14.204.173
...

$ whois 72.14.204.173

OrgName:    Google Inc.
...


so the IP "belongs" to google.

You can check the DKIM signature if you have an unaltered copy of the 
message. but whether it's good or not, the IP is belongs to google.



Re: Another "this should have triggered more rules" post

Posted by Skip <sk...@pelorus.org>.
> How about these rules? (watch the line wrap)
>>
>> describe TO_HARVESTED To: obviously harvested
>> header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
>> some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
>> \.machine)\b/
>>
>>
>
Can you tell me how this rule works?  Or give a more realistic example 
(in my case I would use pelorus.org, so feel free to demonstrate with that)

>
> How can google let this go out?
>
>
I was wondering that too.  Did it really come from gmail?

Skip

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc


Re: Another "this should have triggered more rules" post

Posted by mouss <mo...@netoyen.net>.
Skip wrote:
> 
>> can you be more explicit. you got FPs with how many ','? did you have 
>> an FP with 100?
>>
> [snip] ... Funny thing 
> is, when I ran the script against my spam folder, it had exactly ONE 
> hit--just this email in question.  I have never seen a spam like that 
> before.
> 

I only saw very few. which is why I believe the rule isn't a good spam 
detector. it detects "bad practices" (using an addr book list instead of 
a éreal" mailing list).

>>> Just thinking aloud here: wouldn't it be a good idea to also the the 
>>> CC headers for the same conditions?
>>>
> When I asked this question, my intention was to stimulate discussion as 
> to the worth of adding rules to my SA setup to also check the CC 
> header.  This thread has been focused on the To: header, but I think I 
> will also include the CC rules.  Thanks for the updated code though.
> 

yes, in general, you check both. do that by using ToCc instead of To (in 
SA rules I mean).

> 
> 
> describe TO_HARVESTED To: obviously harvested
> header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
> some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
> \.machine)\b/

which becomes:

describe TO_HARVESTED To or Cc: obviously harvested
header   TO_HARVESTED ToCc =~ /\@(?:(?:(?:example|your| 
some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
  \.machine)\b/

> 
> The more I think about it, the "HARVESTED" rule really seems quite safe, 
> and I think it could be made more robust.  Anyone sending mail to you 
> along with obvious made up email addresses like that is certainly up to 
> no good.
> 

I don't think it will catch a lot of spam. so it's not worth the pain IMHO.



Re: Another "this should have triggered more rules" post

Posted by Skip <sk...@pelorus.org>.

Skip wrote:
>
>> can you be more explicit. you got FPs with how many ','? did you have 
>> an FP with 100?
>>
> Sure.  When I ran it against my inbox, with 4587 "good" emails, I had 
> 130 hits on MATCH20 and 2 hits on MATCH50, or 2.877% (0 with 
> MATCH100).  The interesting thing is, if you think about it, people 
> who routinely send emails to lots of people (jokes, family updates, 
> whatever--you know who I mean), well, I think they will be on most 
> people's whitelists in the first place.  A compete stranger, or even 
> someone who you do know, probably isn't going to send you an email 
> along with 49 of his/her closest friends as his first email to you.  
> Although, it is not beyond the realm of possibility.  For instance, I 
> am starting a new job tomorrow (true--I just retired from the military 
> after 20 years of service).  Let's say there's a person who sends out 
> a certain report and it goes to 100+ people.  Normally, I will get 
> this at my work address.  Now, a few weeks from now, I need him to 
> send it to my home address, just that once.  Now, he has never sent me 
> anything and this comes in.  Bang.  So there is definitely risk.  I 
> would assign it a relatively low score, probably no more than 1/3 of 
> your spam threshold.  Funny thing is, when I ran the script against my 
> spam folder, it had exactly ONE hit--just this email in question.  I 
> have never seen a spam like that before.
>
I just realized I forgot to add the data for CC headers:
I had a total of 5 hits on the MATCH20 out of 4587 good emails for 
0.109% and that's it--no other hits.  The above data (2.877%) was for 
the To: header only.

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc


Re: Another "this should have triggered more rules" post

Posted by Skip <sk...@pelorus.org>.
> can you be more explicit. you got FPs with how many ','? did you have 
> an FP with 100?
>
Sure.  When I ran it against my inbox, with 4587 "good" emails, I had 
130 hits on MATCH20 and 2 hits on MATCH50, or 2.877% (0 with MATCH100).  
The interesting thing is, if you think about it, people who routinely 
send emails to lots of people (jokes, family updates, whatever--you know 
who I mean), well, I think they will be on most people's whitelists in 
the first place.  A compete stranger, or even someone who you do know, 
probably isn't going to send you an email along with 49 of his/her 
closest friends as his first email to you.  Although, it is not beyond 
the realm of possibility.  For instance, I am starting a new job 
tomorrow (true--I just retired from the military after 20 years of 
service).  Let's say there's a person who sends out a certain report and 
it goes to 100+ people.  Normally, I will get this at my work address.  
Now, a few weeks from now, I need him to send it to my home address, 
just that once.  Now, he has never sent me anything and this comes in.  
Bang.  So there is definitely risk.  I would assign it a relatively low 
score, probably no more than 1/3 of your spam threshold.  Funny thing 
is, when I ran the script against my spam folder, it had exactly ONE 
hit--just this email in question.  I have never seen a spam like that 
before.

>> Just thinking aloud here: wouldn't it be a good idea to also the the 
>> CC headers for the same conditions?
>>
When I asked this question, my intention was to stimulate discussion as 
to the worth of adding rules to my SA setup to also check the CC 
header.  This thread has been focused on the To: header, but I think I 
will also include the CC rules.  Thanks for the updated code though.



describe TO_HARVESTED To: obviously harvested
header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
\.machine)\b/

The more I think about it, the "HARVESTED" rule really seems quite safe, 
and I think it could be made more robust.  Anyone sending mail to you 
along with obvious made up email addresses like that is certainly up to 
no good.

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc


Re: Another "this should have triggered more rules" post

Posted by mouss <mo...@netoyen.net>.
Skip wrote:
> 
>>
>> perl script.pl *
>>
> That did it!  Thanks!  I would definitely have had some FPs now that I 
> have checked.
> 

can you be more explicit. you got FPs with how many ','? did you have an 
FP with 100?

> Just thinking aloud here: wouldn't it be a good idea to also the the CC 
> headers for the same conditions?
> 

yes. just replace To with ToCc.

and in the script, replace the
foreach my $_file (@files) {
     my %header = read_headers($_file) ;

     if ($header{"to"} =~ /(?:,[^,]{1,80}){100}/) {
         print "$_file: MATCH100\n";
     } elsif ($header{"to"} =~ /(?:,[^,]{1,80}){50}/) {
         print "$_file: MATCH50\n";
     } elsif ($header{"to"} =~ /(?:,[^,]{1,80}){20}/) {
         print "$_file: MATCH20\n";
     }


}

with


my %stats = ();

foreach my $_file (@files) {
     my %header = read_headers($_file) ;

     my $tocc = $header{"to"} . ", " . $header{"cc"};

     $tocc =~s/\,\s*$//;
     $tocc =~s/^\s*\,//;

     my $commas = $tocc;

     $commas =~ s/[^\,]//g;
     my $comma_count = length($commas);

     $stats{$comma_count}++;

     if ($comma_count >= 100) {
         print stderr "$_file: MATCH100\n";
     } elsif ($comma_count >= 50) {
         print stderr "$_file: MATCH50\n";
     } elsif ($comma_count >= 20) {
         print stderr  "$_file: MATCH20\n";
     }
}


foreach my $_count (sort {$a <=> $b } keys %stats) {
     print "::$_count: $stats{$_count}\n";
}


you can share the output of the lines starting with '::' (at least for 
large values).

I have few folders with messages sent to a large number of addresses (up 
to 400) but these are "mailing-lists in disguise" and I could easily 
identify them.

so yes, the rule is not for everybody. and given that it won't catch a 
lot of spam, it's not worth the trouble.





Re: Another "this should have triggered more rules" post

Posted by Skip <sk...@pelorus.org>.
>
> perl script.pl *
>
That did it!  Thanks!  I would definitely have had some FPs now that I 
have checked.

Just thinking aloud here: wouldn't it be a good idea to also the the CC 
headers for the same conditions?

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc


Re: Another "this should have triggered more rules" post

Posted by mouss <mo...@netoyen.net>.
Skip wrote:
>>   
> What would be a command line equivalent that I can test this expression against 
> my current inbox in order to see if I would have had any FPs?  Something like
> for file in *; do egrep ^To:.*(?:,[^,]{1,80}){20} $file;done
> but this will only check one line (the To: header is obviously many, many lines 
> long) and generates a syntax error as is.
> 


perl script.pl *

==== script.pl
#! perl

use strict;


my @files = @ARGV;


foreach my $_file (@files) {
     my %header = read_headers($_file) ;

     if ($header{"to"} =~ /(?:,[^,]{1,80}){100}/) {
         print "$_file: MATCH100\n";
     } elsif ($header{"to"} =~ /(?:,[^,]{1,80}){50}/) {
         print "$_file: MATCH50\n";
     } elsif ($header{"to"} =~ /(?:,[^,]{1,80}){20}/) {
         print "$_file: MATCH20\n";
     }


}


# This reads all the headers, should we need them
sub read_headers()
{
     my $_file = $_[0];

     my $cur_hdr = ":"; # invalid
     my %header = ();

     open (IN, $_file) or die "Cannot open $_file: $@\n";
     while (<IN>) {
         # compress whitespace (also remove trailing space)
         $_ =~ s/\s+$//;
         $_ =~ s/\s+/ /g;

         # blank line: end of header
         if (! /\S/) {
             last;
         }

         # new header
         if (/^(\S+):/) {
             $cur_hdr = lc($1);
             $header{$cur_hdr} .= $';
             next;
         }

         # header continuation
         if (/^\s+/) {
             $header{$cur_hdr} .= $';
             next;
         }

         # missing blank line.
         last;
     }
     close(IN);

     return %header;
}

Re: Another "this should have triggered more rules" post

Posted by John Hardin <jh...@impsec.org>.
On Sun, 2008-08-31 at 19:50 +0200, mouss wrote:
> John Hardin wrote:
> >
> > How about these rules? (watch the line wrap)
> > 
> > describe TO_HARVESTED To: obviously harvested
> > header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
> > some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
> > \.machine)\b/
> > 
> > describe TO_TOO_MANY To: too many recipients
> > header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){20}/
> > 
> > describe TO_WAY_TOO_MANY To: way too many recipients
> > header   TO_WAY_TOO_MANY To =~ /(?:,[^,]{1,80}){50}/
> 
> The {20} variant will cause "normal" FPs. I don't think the {50} would 
> really cause FPs. but then
> 
> header   TO_WAY_TOO_MANY To =~ /(?:,[^,]{1,80}){100}/
> 
> should more than conservative.

Of course. The threshold for "too many" is naturally something that will
vary for different people and situations.

> Anyway, this is worth an MTA reject

Good point - I added some tests to my milter-regex.

However, not everyone can do MTA rejects on this, so SA rules do have
utility.


-- 
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 Those in the media have donated to Obama at a 100:1 ratio compared
 to McCain. Are we to believe that this bias does not in any way
 taint their coverage of the campaign?
-----------------------------------------------------------------------
 65 days until the Presidential Election


Re: Another "this should have triggered more rules" post

Posted by mouss <mo...@netoyen.net>.
John Hardin wrote:
> On Sun, 2008-08-31 at 07:32 -0400, Skip wrote:
>> Got this one today.  Never seen anything like this before.
>> http://pelorus.org/mix
>>
>> (I couldn't even paste into pastebin--their spam catcher caught it)  

I've noticed that too. it's annoying. time to setup a post-bin...

>> This one only scored a 2.9 on my installation, as you can see.  I do 
>> have some custom rules (Saught and SARE) but no hits there.
> 
> I've noticed more spams lately coming in with huge TO: lists that
> haven't been washed for even obviously bogus addresses; yours is an
> example of such.
> 
> How about these rules? (watch the line wrap)
> 
> describe TO_HARVESTED To: obviously harvested
> header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
> some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
> \.machine)\b/
> 
> describe TO_TOO_MANY To: too many recipients
> header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){20}/
> 
> describe TO_WAY_TOO_MANY To: way too many recipients
> header   TO_WAY_TOO_MANY To =~ /(?:,[^,]{1,80}){50}/
> 
> The latter two may have FPs if you're prone to getting infinitely
> forwarded jokes and such from relatives and friends - but that might
> actually be viewed as a benefit. :)

The {20} variant will cause "normal" FPs. I don't think the {50} would 
really cause FPs. but then

header   TO_WAY_TOO_MANY To =~ /(?:,[^,]{1,80}){100}/

should more than conservative.

Anyway, this is worth an MTA reject for more than one reason. not only 
it has too many To: addresses, but some of these addresses don't deserve 
  any time for scanning:

	-request@informatik.rwth-aachen.de
	please.write.a.new.mail.instead.of.replying@first.word.archive
	your_login_name@your.favourite.machine
	someone@somewhere.com
	alan@example.domain
	pizza@your.domain
	thelist-request@some.domain
	me@somewhere.tld
	someone.i.dont.like@somewhere.org
	myaddress@myhost.mydomain.org

How can google let this go out?


Re: Another "this should have triggered more rules" post

Posted by John Hardin <jh...@impsec.org>.
On Sun, 2008-08-31 at 07:32 -0400, Skip wrote:
> Got this one today.  Never seen anything like this before.
> http://pelorus.org/mix
> 
> (I couldn't even paste into pastebin--their spam catcher caught it)  
> This one only scored a 2.9 on my installation, as you can see.  I do 
> have some custom rules (Saught and SARE) but no hits there.

I've noticed more spams lately coming in with huge TO: lists that
haven't been washed for even obviously bogus addresses; yours is an
example of such.

How about these rules? (watch the line wrap)

describe TO_HARVESTED To: obviously harvested
header   TO_HARVESTED To =~ /\@(?:(?:(?:example|your|
some)\.domain)|(?:(?:example|your\.domain)\.com)|your\.favou?rite
\.machine)\b/

describe TO_TOO_MANY To: too many recipients
header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){20}/

describe TO_WAY_TOO_MANY To: way too many recipients
header   TO_WAY_TOO_MANY To =~ /(?:,[^,]{1,80}){50}/

The latter two may have FPs if you're prone to getting infinitely
forwarded jokes and such from relatives and friends - but that might
actually be viewed as a benefit. :)

-- 
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 Those in the media have donated to Obama at a 100:1 ratio compared
 to McCain. Are we to believe that this bias does not in any way
 taint their coverage of the campaign?
-----------------------------------------------------------------------
 65 days until the Presidential Election


Re: Another "this should have triggered more rules" post

Posted by Chris <cp...@embarqmail.com>.
On Sunday 31 August 2008 7:18 am, Skip wrote:
> > This one only scored a 2.9 on my installation, as you can see.  I do
> > have some custom rules (Saught and SARE) but no hits there.
> >
> > Skip
>
> Oops... I meant to include this the first time.  These were the rules
> that it triggered on my installation:
>
> X-Spam-Report:
> 	*  2.5 HEAD_LONG Message headers are very long
> 	*  0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature
> 	* -0.0 SPF_PASS SPF: sender matches SPF record
> 	*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
> 	*  0.0 HTML_MESSAGE BODY: HTML included in message

Scored as below on my setup, I don't see any bayes score on yours.

Content analysis details:   (13.0 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 1.0 FREEMAIL_FROM          From-address is freemail domain
-0.0 SPF_PASS               SPF: sender matches SPF record
 2.5 HEAD_LONG              Message headers are very long
 0.0 DK_SIGNED              Domain Keys: message has a signature
 0.4 URI_HEX                URI: URI hostname has long hexadecimal sequence
 0.0 HTML_MESSAGE           BODY: HTML included in message
 1.0 BAYES_50               BODY: Bayesian spam probability is 40 to 60%
                            [score: 0.5000]
 0.5 RAZOR2_CHECK           Listed in Razor2 (http://razor.sf.net/)
 1.5 RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level
                            above 50%
                            [cf: 100]
 0.5 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
                            [cf: 100]
 2.2 DCC_CHECK              listed in DCC (http://rhyolite.com/anti-spam/dcc/)
                            [cpollock 1085; Body=1 Fuz1=292]
                            [Fuz2=many]
 0.0 DIGEST_MULTIPLE        Message hits more than one network digest check
 2.5 L_UNVERIFIED_GMAIL     L_UNVERIFIED_GMAIL
 1.0 SAGREY                 Adds 1.0 to spam from first-time senders

-- 
Chris
KeyID 0xE372A7DA98E6705C

Re: Another "this should have triggered more rules" post

Posted by Skip <sk...@pelorus.org>.
> This one only scored a 2.9 on my installation, as you can see.  I do 
> have some custom rules (Saught and SARE) but no hits there.
>
> Skip
>
Oops... I meant to include this the first time.  These were the rules 
that it triggered on my installation:

X-Spam-Report: 
	*  2.5 HEAD_LONG Message headers are very long
	*  0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature
	* -0.0 SPF_PASS SPF: sender matches SPF record
	*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
	*  0.0 HTML_MESSAGE BODY: HTML included in message

-- 
Get my PGP Public key here:
http://pelorus.org/skip@pelorus.org_public_key.asc