You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Mabry Tyson <Ty...@AI.SRI.COM> on 2004/12/04 03:34:45 UTC

3 suggested rules regarding forged local addresses

(We use SA (currently 2.64) called from procmail-delivered sendmail on 
Solaris systems.  We get something over 100K msgs/day.  Most of our mail 
is addressed using @ our local domain.)

Three suggested rules:
   1)  Detect mail allegedly from a local address that is invalid    
(should get a high score)
   2)  Detect mail that has multiple invalid local addresses in the To: 
and CC: fields  (should get a medium score for 2 or more)
   3)  Detect mail for which the From:, To:, and CC: fields contain 
known or unknown display-names corresponding to local addresses.

We are seeing a flood of forged mail of various flavors.   Some of the 
forgers are foolish enough to try to forge the host name or address that 
shows up in the RECEIVED lines (which I had already been detecting in 
procmail, but I see SA 3.0 detects as well).   Others deliver the forged 
mail without any hanky-panky in the headers.

I am bothered by the mail that has a FROM:  (not the SMTP envelope 
address) that is allegedly an address in our local domain (example.com) 
but is not a legal address in our domain.

For instance, (for these examples, I'm using "example.com" as though it 
were our local domain) a message may claim to be
   From: NoSuchUser@example.com
which is not a legal address (which we know because we know all legal 
addresses for our local domain).

I'd like to detect such mail.  I would expect such mail would deserve a 
high score (but because it is site-specific, it can't easily be adjusted 
as other SA rules are tested).

This can not be done properly by whitelists or blacklists.  It can not 
be done in a reasonable fashion by user-added rules.   I (or someone 
more familiar with SA) would need to write Perl to support this.   
Before I do this, I wanted to check to see if anyone else has worked on 
this.   A quick glance at the code, the mailing lists, and Bugzilla 
didn't have any hits, but I'm not confident that my search is complete.

The concept is that the site would supply its local domains (eg, 
example.com or perhaps *.example.com) and a file (or db) for each domain 
with the valid local parts (eg, NoSuchUser) for that domain.  (This only 
works where all valid addresses for each supplied domain are known.)    
When mail is detected as having a From address using one of those 
domains, then it would check to see if the local part of the From 
address was legal.   (I would want to have this file/db to be able to be 
updated while SA is running.)

In the sendmail world, this db would be populated by the 
/etc/mail/aliases file.

[I can imagine an MTA that detects and rejects such mail, but see the 
next section for something related but less appropriate for a MTA.   I 
am currently detecting such mail by a procmail rule.]

[By the way, RFC2822 allows a *list* of mailboxes in the From: and 
Reply-To: fields.  Does SA properly handle that?]

====================

A significant number of the spam that we get has invalid local addresses 
(e.g., NoSuchUser@example.com) in the To: or CC: lists.    Some spam is 
delivered to a mailbox (as though by a BCC) and has only invalid local 
addresses in its To: and CC: lists.   Some spam has several addresses in 
its To: and CC: lists, some of which are invalid and some are valid.

I would like to detect such mail and adjust its score appropriately.   
Because of the possibility of typos by legitimate senders, I would 
expect this will require some thought.  It may be that there would be 
rules for  (1) Some invalid local addresses and no valid local 
addresses, and (2)  Two invalid local addresses, and (3) Three or more 
invalid local addresses.

(It appears that spammers, with their disregard for how much mail they 
send, will take a valid address such as SomeUser and try variants on it, 
such as omeUser or SomeUse.   Other types of invalid local addresses 
include common names (eg, john@example.com) or formerly valid addresses.)

====================

Finally, I notice that a number of the spammers are adding bogus 
display-names to addresses.  Suppose we have a user John Smith who has 
an address of  js@example.com and he has his mailer(s) set up to send 
mail from
   John Smith <js...@example.com>
and  John Q. Smith <js...@example.com>

Some spammers will send mail from or to   "Jane Doe <js...@example.com>", 
where the display-name is completely bogus.

If the site creates a database with entries such as
    js -> "John Smith", "John Q. Smith"
then when mail arrives from "Jane Doe" <js...@example.com>, SA should be 
able to give it a moderate hit on its spam score.

As before, the From: field is the most sensitive for this.   Mail from 
"Jane Doe" <js...@example.com>  or even "Smith, John Q." <js...@example.com>  
should earn a moderate positive score.   Mail from "John Smith" 
<js...@example.com> should earn a slightly negative score.

However, the To: and CC: fields could also be scored, but with lower 
scores.    After all, someone might legitimately send mail to  "Smith, 
Mr. John Q." <js...@example.com>.    I would hope that bad mail (such as 
To: "Jane Doe" <js...@example.com>) might get a small positive score while 
good mail (such as  To: "John Smith" <js...@example.com>) would get a 
negative score with a somewhat larger absolute value, summing the scores 
across all recipient addresses.

This database requires more maintenance and more user participation that 
the database in the first two parts of this message.  I could certainly 
understand a site implementing the first database but not this one.  I 
would probably implement this at our site, perhaps automatically pulling 
valid display-names out of mail that is delivered from logged-in SMTP users.




Re: 3 suggested rules regarding forged local addresses

Posted by Mabry Tyson <Ty...@AI.SRI.COM>.
There were 3 replies to my suggested rules.  I am responding to those 
comments, plus giving statistics on what I've found in the spam & ham 
I've collected.

Combining these techniques may distinguish about 27% of the spam in my 
sample (100%- ((100 - 18%) * (100 - 11%)) = 27%) or maybe 36% of spam 
scoring 9 or less.  ("Distinguish" is not meant to mean "reject", but is 
to indicate it would contribute to the score.)  Of course it does little 
good to detect mail that is already known as spam, but the 
characteristics of these tests should help separate the spam from the 
ham, when the spam may be relatively low scoring or the ham is 
relatively high scoring.


=========
Invalid local addresses in mail.  (This is to catch dictionary attack or 
spammers with lists of formerly valid, now invalid, addresses.  Only 
local addresses are considered as those are exactly the ones that a site 
can know exhaustively.)

Statistics:  SPAM:
For 131,101 spam that I've received since August:
  1431 (1.1%)  had an From: field with an invalid local address
  13596 (10.4%) had exactly one To/CC: entry that was a local address, 
and it was invalid
  8921 (6.8%) had 2 or more invalid local addresses in the To:/CC: fields.
That represents about 17-18% of the total spam.  (24.4% in the spam that 
scored from 5 to 9.)

Ham:
While my spam gets shoveled into just 3 mailboxes (scores 5-9, 9-15, 
15+), my real mail gets distributed over a bunch of mailboxes.  Those 
mailboxes tend to have quite old mail in them sometimes.   So, for a ham 
reference, I used only one mailbox that catches all the miscellaneous 
mail (manually confirmed as ham).  It has 2088 messages in it dating 
back to the beginning of the year.  Some of these are CC's of mail that 
I have sent.   I don't claim this is a representative sample of the ham 
I receive, but it is the best sample I had that was easy to run 
statistics on.  

In the ham, I did find some valid mail that was from invalid 
addresses.   Two utilities used invalid addresses:  Our SAVSMTP (virus 
scanner for incoming email) and our scanner that emails PDFs of 
documents use invalid From: addresses in the email they generate.  (I'm 
fixing both of those, by the way).   These would have triggered my 
suggested rule #1 (but won't in the future).

In the To/CC fields, I found one user sent 4 emails to the same invalid 
local address.  These mails were one invalid local address out of 
several and do not fall into any of the categories I suggested. 

There were no other issues in the ham.

=========
Display name statistics:

Since I haven't collected the data on the other local users for their 
display names, I looked only at the mail that had display names for my 
email address.  "Standard display name" means the display name(s) that I 
use on outgoing mail.  Non-standard display name means anything else 
where a display name is given.  So, as you see by this mail, my standard 
display name is "Mabry Tyson".   Mail without a display name is not 
considered.

Note:  SA has the NO_REAL_NAME test (score 0.1 - 0.3).   This encourages 
spammers to put some display name in the from field, even if it is 
completely bogus.  
Likewise, the TO_ADDRESS_EQ_REAL rule (score 0 - 0.5)  encourages 
spammers to put some display name in the To: field (or none), even if it 
is completely bogus.
My suggested rules try to catch the use of a bogus display names and 
helps to make those rules not lose their value.

A number of the sources of email addresses that are reaped by spammers 
may not have display names associated with them.  It appears that the 
spammers have not been collecting (or at least have not used) display 
names for the addresses they use.   Of course there are no valid display 
names for dictionary-based spam (to smith, jones, john, tom, aaa, aab, 
...).   Having a rule that uses display names would further minimize the 
value of old lists of email addresses.


SPAM (131,101 messages):
Non-standard display name in From: field:  223  (0.2%)  (out of 261 that 
were from my email address)
Standard display name in From: field:   0   (that surprised me)
Non-Standard display name in To/CC field: 6648 (5% of messages, assuming 
that each message has my address at most once, 17.9% of the spam with 
scores 5-9)
Standard display name in To/CC field:  0
Local ddresses (for anyone, valid or invalid) in To/CC field: 254,999  
(1.94 per message)
Local addresses (for anyone, valid or invalid) in To/CC field with some 
display name:  29161 (11.44% of addresses,  15.6% for spam with scores 5-9)


HAM (2088 messages):
Non-standard display name in From: field:  0
Standard display name in From: field:   173  (8.3%) (the CC's or 
messages to myself)
Non-Standard display name in To/CC field: 30 (1.4%)
Standard display name in To/CC field:  695 (33.3% of messages)  (the 
messages to me, of which 173 were from me)
Local addresses (for anyone) in To/CC field: 2720  (1.3 per message)
Local addresses (for anyone) in To/CC field with some display name:  
1500 (55% of addresses)

This was actually better than I expected.    1/3 of the good mail is 
marked by having the correct display name.  5% of the spam is marked by 
having the incorrect display name for me.   If the results for other 
local users' display names are the same as mine, I could expect about 
11% of the spam have a bad display name.

The 30 non-standard display names in the ham were things like "Dad", or 
"Tyson, Mr. Mabry", etc.  All perfectly reasonable.

I wouldn't recommend it, but for this sample, having the correct display 
name could be a whitelist.
Clearly, having the wrong display name shouldn't be a blacklist, but I 
could imagine that it could have a positive score.

As a rough guide, I could see having all correct display names for the 
local addresses (for which display names were given) in To/CC might 
contribute a score of -1 or -2, while having at least one incorrect 
display name for any local address might contribute a score of 1.  
(Those are maximum scores, not scores per correct/incorrect display 
name.)  I would set the score for having a non-standard display name in 
the From: field as being at least 3.

Your mileage may vary....


Note:  a user can add a rule to easily check that the use of his address 
has his standard display name:
Also, I'd like to detect  someone sending mail to me with a display name 
of "John Smith" or "Jane Doe" but that is more cumbersome to do:
(THESE ARE UNTESTED -- This is just for noting that this can be done for 
a single or few addresses.)

header SOME_DISPLAY_NAME_FOR_ME  ToCC =~ 
/[a-z0-9][^>]*<my...@example.com>/i
score SOME_DISPLAY_NAME_FOR_ME 1
header PROPER_DISPLAY_NAME_FOR_ME  ToCC =~ /My 
Name[^>]*<my...@example.com>/i
score PROPER_DISPLAY_NAME_FOR_ME -3
header DAD_DISPLAY_NAME_FOR_ME  ToCC =~ /dad[^>]*<my...@example.com>/i
score DAD_DISPLAY_NAME_FOR_ME -5
I *think* these will add 1 to the score if there is any (good or bad) 
display name, and will subtract 3 if the display name matches "My Name" 
(resulting in a net of -2 for a good display name).

header SOME_DISPLAY_NAME_FOR_ME_FROM  From =~ 
/[a-z0-9][^>]*<my...@example.com>/i
score SOME_DISPLAY_NAME_FOR_ME_FROM 3
header PROPER_DISPLAY_NAME_FOR_ME_FROM  From =~ /My 
Name[^>]*<my...@example.com>/i
score PROPER_DISPLAY_NAME_FOR_ME_FROM -5
header PROPER_DISPLAY_NAME_FOR_ME_FROM2  From =~ /My Full 
Name[^>]*<my...@example.com>/i
score PROPER_DISPLAY_NAME_FOR_ME_FROM2 -5

I *think* these will add 3 to the score if there is a bad display name 
for me in the From: field, and will subtract 2 (net) if the display name 
matches "My Name" or "My Full Name" (if I use both forms in my outgoing 
mail).



What I want to do is to do this across our entire site.


===========

Comments on responses:
My general comment:  None of these rules were meant to be forced on the 
SA user.  This capability *could* be utilized by a site if it works for 
them.

Jerry Bell:  He states that rule #1 could be handled by SPF.    I don't 
really know SPF, but its web page indicates it fights "return-path 
address forger".   My rule is about the From: field in the header, not 
the MAIL FROM address used in the SMTP envelope.   It may be that SPF 
looks inside the text of the message at the body, but I didn't notice 
that it claimed that it did in a cursory glance at the site.
He wrote that rule #2 "sounds pretty involved".  Actually not really.  
All you need is some way to check whether a mail address is valid or 
not.   He corrrectly is concerned about a domain where all addresses map 
to a single address.  I will add that any wild-card mappings (such as 
mail to "foo+bar" being delivered to "foo") must be accounted for in the 
matching of addresses.
He was concerned about rule #3 having a problem with arbitrary display 
names.  That would be a problem if it were a blacklist-type rule, but I 
think of it as more of a negative scoring rule.

Loren Wilton:  Felt that rule #2 could work at large ISP (such as 
Earthlink), but questionable at a business where many people may be on a 
CC list.   I don't understand the reasoning for that qualification.
Felt that rule #3 is probably more trouble than it is worth.   I would 
say that if the display names can be automatically collected from valid 
outgoing mail, then it shouldn't be much trouble.  It appears that the 
use of proper display names is a good indicator of valid mail.

Wolfgang Hamann:  He has his MTA "refuse unauthenticated mails from 
local senders" (with exceptions for trusted relays).   He states that 
the rejection of such mails does not cost bandwidth.   Based on his 
statement about bandwidth, it appears he is only looking at the SMTP 
envelope address (MAIL FROM ...) rather than the From: line in the 
header (for which he'd have to get the entire message first).    What he 
has done seems good and appropriate.  However, it is not what my 
suggested rule #1 (and part of #3) covers.   It also doesn't speak to 
rule #2.


Since two of the responses dealt with blocking the forgery of the SMTP 
envelope MAIL FROM address (aka Return-Path), I feel I should address 
why this isn't sufficient.    There are a number of reasons why the MAIL 
FROM address is different from the From: address.   If you get this 
message as an individual message (vs in a digest), its MAIL FROM address 
is different than the From: address.  Also, the From: address may be 
different because the user may be using, for instance, his home ISP but 
wants to send the mail with his work email address.   etc.    Thus you 
can't depend upon the MAIL FROM address matching the From: address.   
Filtering on the MAIL FROM address is not the same as filtering on the 
From: address.  

I am seeing mail with the From: address forged as a local address and 
the MAIL FROM address as something completely valid (but not local).


Mabry Tyson wrote:

> (We use SA (currently 2.64) called from procmail-delivered sendmail on 
> Solaris systems.  We get something over 100K msgs/day.  Most of our 
> mail is addressed using @ our local domain.)
>
> Three suggested rules:
>   1)  Detect mail allegedly from a local address that is invalid    
> (should get a high score)
>   2)  Detect mail that has multiple invalid local addresses in the To: 
> and CC: fields  (should get a medium score for 2 or more)
>   3)  Detect mail for which the From:, To:, and CC: fields contain 
> known or unknown display-names corresponding to local addresses.
>
> We are seeing a flood of forged mail of various flavors.   Some of the 
> forgers are foolish enough to try to forge the host name or address 
> that shows up in the RECEIVED lines (which I had already been 
> detecting in procmail, but I see SA 3.0 detects as well).   Others 
> deliver the forged mail without any hanky-panky in the headers.
>
> I am bothered by the mail that has a FROM:  (not the SMTP envelope 
> address) that is allegedly an address in our local domain 
> (example.com) but is not a legal address in our domain.
>
> For instance, (for these examples, I'm using "example.com" as though 
> it were our local domain) a message may claim to be
>   From: NoSuchUser@example.com
> which is not a legal address (which we know because we know all legal 
> addresses for our local domain).
>
> I'd like to detect such mail.  I would expect such mail would deserve 
> a high score (but because it is site-specific, it can't easily be 
> adjusted as other SA rules are tested).
>
> This can not be done properly by whitelists or blacklists.  It can not 
> be done in a reasonable fashion by user-added rules.   I (or someone 
> more familiar with SA) would need to write Perl to support this.   
> Before I do this, I wanted to check to see if anyone else has worked 
> on this.   A quick glance at the code, the mailing lists, and Bugzilla 
> didn't have any hits, but I'm not confident that my search is complete.
>
> The concept is that the site would supply its local domains (eg, 
> example.com or perhaps *.example.com) and a file (or db) for each 
> domain with the valid local parts (eg, NoSuchUser) for that domain.  
> (This only works where all valid addresses for each supplied domain 
> are known.)    When mail is detected as having a From address using 
> one of those domains, then it would check to see if the local part of 
> the From address was legal.   (I would want to have this file/db to be 
> able to be updated while SA is running.)
>
> In the sendmail world, this db would be populated by the 
> /etc/mail/aliases file.
>
> [I can imagine an MTA that detects and rejects such mail, but see the 
> next section for something related but less appropriate for a MTA.   I 
> am currently detecting such mail by a procmail rule.]
>
> [By the way, RFC2822 allows a *list* of mailboxes in the From: and 
> Reply-To: fields.  Does SA properly handle that?]
>
> ====================
>
> A significant number of the spam that we get has invalid local 
> addresses (e.g., NoSuchUser@example.com) in the To: or CC: lists.    
> Some spam is delivered to a mailbox (as though by a BCC) and has only 
> invalid local addresses in its To: and CC: lists.   Some spam has 
> several addresses in its To: and CC: lists, some of which are invalid 
> and some are valid.
>
> I would like to detect such mail and adjust its score appropriately.   
> Because of the possibility of typos by legitimate senders, I would 
> expect this will require some thought.  It may be that there would be 
> rules for  (1) Some invalid local addresses and no valid local 
> addresses, and (2)  Two invalid local addresses, and (3) Three or more 
> invalid local addresses.
>
> (It appears that spammers, with their disregard for how much mail they 
> send, will take a valid address such as SomeUser and try variants on 
> it, such as omeUser or SomeUse.   Other types of invalid local 
> addresses include common names (eg, john@example.com) or formerly 
> valid addresses.)
>
> ====================
>
> Finally, I notice that a number of the spammers are adding bogus 
> display-names to addresses.  Suppose we have a user John Smith who has 
> an address of  js@example.com and he has his mailer(s) set up to send 
> mail from
>   John Smith <js...@example.com>
> and  John Q. Smith <js...@example.com>
>
> Some spammers will send mail from or to   "Jane Doe <js...@example.com>", 
> where the display-name is completely bogus.
>
> If the site creates a database with entries such as
>    js -> "John Smith", "John Q. Smith"
> then when mail arrives from "Jane Doe" <js...@example.com>, SA should be 
> able to give it a moderate hit on its spam score.
>
> As before, the From: field is the most sensitive for this.   Mail from 
> "Jane Doe" <js...@example.com>  or even "Smith, John Q." 
> <js...@example.com>  should earn a moderate positive score.   Mail from 
> "John Smith" <js...@example.com> should earn a slightly negative score.
>
> However, the To: and CC: fields could also be scored, but with lower 
> scores.    After all, someone might legitimately send mail to  "Smith, 
> Mr. John Q." <js...@example.com>.    I would hope that bad mail (such as 
> To: "Jane Doe" <js...@example.com>) might get a small positive score 
> while good mail (such as  To: "John Smith" <js...@example.com>) would get 
> a negative score with a somewhat larger absolute value, summing the 
> scores across all recipient addresses.
>
> This database requires more maintenance and more user participation 
> that the database in the first two parts of this message.  I could 
> certainly understand a site implementing the first database but not 
> this one.  I would probably implement this at our site, perhaps 
> automatically pulling valid display-names out of mail that is 
> delivered from logged-in SMTP users.
>
>


Re: 3 suggested rules regarding forged local addresses

Posted by Loren Wilton <lw...@earthlink.net>.
> > Three suggested rules:
> >    2)  Detect mail that has multiple invalid local addresses in the To:
> > and CC: fields  (should get a medium score for 2 or more)

This one can be made to work at a large ISP, at least in many cases.  It is
highly questionable at a business where many people may be on a cc list.  I
have three rules to catch increasing numbers of recipients at earthlink,
giving increasing scores.  The chance that there would be a real mail to me
with more than two other recipients on earthlink is nil.  (Note that mailing
lists don't usually include the whole distribution list in the addresses.)

> >    3)  Detect mail for which the From:, To:, and CC: fields contain
> > known or unknown display-names corresponding to local addresses.
> >

Again, this can generally be made to work, but there will certainly be
exceptions.  And this is probably not something where a general rule could
be set up, but would require rules tailored by the individual recipients -
which many won't (or won't be able to) do.  In general it is probably more
trouble than it is worth.


Re: 3 suggested rules regarding forged local addresses

Posted by Jerry Bell <jb...@stelesys.com>.
1. This can be done really effectively using SPF.  I believe spamassassin
can use spf, and most MTA's can too.  I highly recommend it.  You would
not believe the number of viruses that get turned away by using SPF.  It
seems that many of the recent ones send mails to a target domain with a
from address of the target domain.
2. Sounds pretty involved, and for many domains where all addresses are
routed to a single address, it can't work.  If you are using ldap or
active directory or something like that, you may be able to get your MTA
to check the destination address as the mail is coming in, and reject
those to invalid addresses.  Even if you aren't using AD, it sounds like
you may be willing to set up something like a database or ldap directory.
3. I think this one may cause you more trouble than you anticipate. 
Internally, you set up all of your display names to adhere to some sort of
policy, but someone externally who adds you as a contact in outlook with
"My friend Mabry" as the name will potentially be picked up as spam
because the display name on the email is going to be "My Friend Mabry".

Just my thoughts.

Jerry
http://www.syslog.org

> (We use SA (currently 2.64) called from procmail-delivered sendmail on
> Solaris systems.  We get something over 100K msgs/day.  Most of our mail
> is addressed using @ our local domain.)
>
> Three suggested rules:
>    1)  Detect mail allegedly from a local address that is invalid
> (should get a high score)
>    2)  Detect mail that has multiple invalid local addresses in the To:
> and CC: fields  (should get a medium score for 2 or more)
>    3)  Detect mail for which the From:, To:, and CC: fields contain
> known or unknown display-names corresponding to local addresses.
>