You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by LuKreme <kr...@kreme.com> on 2009/04/28 16:43:46 UTC

'anti' AWL

OK, working on my first cup of coffee this morning, so maybe this has  
potential.

The way the AWL works is by keeping track of the origin of emails,  
both the address and the server (the top line Received header?) that  
send the email.  So, lets say that I have a lot of email from foo@example.com 
  and that foo's email is sent to me via mail.example.com.

Now, I get an email claiming to be from foo@example.com but sent to me  
from suspiciousserver.tld, so the AWL is not applied.

But if I've gotten 50 emails from foo@example.com and all came through  
mail.example.com it seems that it would be beneficial to have a 'anti'  
AWL score score applied to this particular email, since it claims to  
be from one place, but doesn't match the AWL entry. This, naturally  
would start of a new AWL entry, but with a slightly higher score than  
otherwise.

This would even be useful if the original AWL entry is spammish since  
multiple servers might be a sign of a botnet or host hopping, so  
applying a little spammish nudge to these messages is probably going  
to help out a lot, especially if spammer@fakedoamin.tld is sending  
mails from, say, 10 different server then all those AWL mismatches are  
going to feed each other into moving that AWL up very very fast.

-- 
The Germans wore gray, you wore blue.

Re: 'anti' AWL

Posted by Jeff Mincy <je...@delphioutpost.com>.

   From: LuKreme <kr...@kreme.com>
   Date: Tue, 28 Apr 2009 08:43:46 -0600

   OK, working on my first cup of coffee this morning, so maybe this has  
   potential.

   The way the AWL works is by keeping track of the origin of emails,  
   both the address and the server (the top line Received header?) that  
   send the email.  So, lets say that I have a lot of email from foo@example.com 
     and that foo's email is sent to me via mail.example.com.

   Now, I get an email claiming to be from foo@example.com but sent to me  
   from suspiciousserver.tld, so the AWL is not applied.

Your idea will FP anytime anybody adds a new email device or the ISP
changes (etc).

You could use the sagrey plugin to add a point to email from a new
email address+ip pairs.

-jeff

Re: 'anti' AWL

Posted by James Wilkinson <sa...@aprilcottage.co.uk>.

Charles Gregory wrote:
> Though again, legit senders that average negative are relatively rare  
> (well, on my system, anyways).

For what it’s worth, I’ve set up SA to identify replies to the
organisation’s email. It looks at the In-Reply-To and References headers
(our Message-IDs have a distinctive domain that’s not in public DNS and
isn’t easily guessable) and looks for the organisation standard
signature (again, this is very unlikely to come up in spam).

Most replies have one or the other, and it’s fairly common for a
correspondent to have an average score of less than -10.

It means that AWL really does work as an auto-white-list for us.

James.

-- 
E-mail:     james@ | ... clueless he is not. He's just selective about which
aprilcottage.co.uk | clues to pay attention to.
                   |     -- Shmuel (Seymour J.) Metz

Re: 'anti' AWL

Posted by Charles Gregory <cg...@hwcn.org>.

On Thu, 30 Apr 2009, LuKreme wrote:
> No, the senders AWL HURTS new spam.  If the score is -2 from the AWL 
> then -2 > * -0.2 = 0.4

Ah. Missed the negative. Then this particular piece of the logic is good.
The odds of any AWL(perIP) other than the legit sender having a negative 
average are vanishingly small. So you would gain the benefit of positive 
adjusting spam with almost no chance of an FP....

Though again, legit senders that average negative are relatively rare 
(well, on my system, anyways).

>> So in the unlikely event that spam (from a different server) precedes 
>> legitimate mail, the legit sender gets a postitive adjustment before they 
>> have a chance to score negative...
> As I understand it the AWL is added after all others, but yes, the FIRST 
> legitimate mail will be penalized.

Why only the first? Unless the user's message (and continuing average) 
scores negative, all messages will continue to be affected....

>> Note that this logic will also be problematic when sender has multiple mail 
>> servers. Many senders get a few points positive...
> This will only be an issue if those multiple servers have positive AWL 
> scores.

Which is very likely. Spamassassin is constructed on the premise that all 
mail has a 'few' spam signs, but does not score high enough to be 
considered 'spam'.

>> Now let's presume that the sender is spoofed by spammers on ten different
>> IP's, producing ten different AWL entries. How will you distinguish the 
>> legit sender's IP (except by hoping they have scored negative?)... You will 
>> simply add up ALL the IP AWL's and score *any* mail from the sender
>> with a significant positive adjustment....
> As far as I can tell, though it's not easy to be sure, legitimate senders 
> have negative AWL scores.

No, the *effect* of their average may be a negative adjustment to messages 
that otherwise score high, but the stored 'average' is most likely 
positive. And for me, it's easy to be sure because I have the score 
printed on the subject line of all my mail. Less than half my ham scores 
zero, and very few (other than the messages from this list which are 
helped by a DNS whitelist) score negative.

>> But how often does that really happen? As I said, most people get a *few* 
>> points on legit mail.
> But it's not the points on the mail, it is only the AWL listing that we're 
> looking at.

And the AWL listing is an average of the points on all mail. Yes?

> OK, how do we parse out the AWL numbers then so we can see what sorts of 
> AWL numbers exist for legit senders.  As I understand it, if an email 
> comes in from a know sender who was average 0.8 and this email scores 
> 3.0, a negative AWL will be applied to normalize the email closer to 
> 0.8, right? The AWL score is not 0.8, but 3.0 - (AWL value)?

As I understand it, if the AWL has recorded 20 messages (arbitrary number, 
always increasing) with an average of 0.8, and a new message scores 3.0 
then the AWL function does a bit of math and the new average (now on 
21 messages) will be something like 0.9 while the AWL's effect on that 
one message will be to apply a negative adjustment. But the average stored
in the database would be the average of all scores.

> Er.. ok.  Perhaps I am misunderstanding the AWL.  As I understand it, if a 
> bunch of spam comes in from a server with average scores of 7.0 and a new 
> message comes in with a score of 4, it will have a POSITIVE AWL applied to 
> normalize at 7.0.  If a message comes from a know sender with an average 
> score of 2, and this email scores 4, it will get a NEGATIVE AWL score to 
> normalize closer to 2.0, right?  Since this is a negative AWL 2.a.ii would 
> not apply because the AWL is negative, so section 2 is skipped entirely and 
> we are at 3. AWL is negative => {crickets}.

But in the long term, a user's messages will be distributed around the 
average, and so half their mail will score 'positive AWL' using your above 
terminology. Still not a good way to determine how/when to apply an 
adjustment.

Also, please keep in mind that the whole reason we are discussing this 
addition to the rules is because we are looking for a way to deal with 
messages that otherwise score very low. So for the 'target class' of mail, 
we are MORE likely to have the spam score equal or lower to the legit 
sender's mail.... not a pretty picture.... :(

> OK, if the value is 0.1 then it would take up to 50 outbound servers with 
> even distribution to add 5.0 points.

But they are adding it to an existing score that may already be slightly 
spammy. So that mail may only need another 2 points to exceed someone's 
chosen threshold.

> That's quite possible.  As I said initially, it's jut an idea I had to 
> make the AWL penalize botnets much more.  If it can't be done, that's 
> fine.  I think there's some promise here though.

While it's easy to think of rules that fit 'most cases', the exceptions 
really make it difficult. Like the user who sends mail normally via 
Outlook via a primary server, but occasionally uses an alternate provider 
or webmail (with the same address). They would not 'expect' to have their 
mail penalized because they used a different server.

The idea of AWL is actually that it protects a frequent coreespondent from 
an 'accidental' high score. But as per another argument recently seen on 
this list, NOW I know why I get 'blank' spams sometimes. They are setting 
up my spam filter to AWL the score downwards when they send their real 
spam..... :-P

So I think perhaps that AWL should not record an 'average' for a new IP 
until it has at least 5 or 10 messages to work with. Or perhaps it should 
not permit negative adjustments on the first ten messages.....

And so on..... :)

- Charles

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 30-Apr-2009, at 11:50, Charles Gregory wrote:
> On Thu, 30 Apr 2009, LuKreme wrote:
>> First off, I suppose that if you get real mail from someone who has  
>> only ever been seen as a spam sender, then yes, the first mail  
>> would be penalized.  But is this ever the case?
>
> (nod) Any time someone's address has been used as a spoofed sender  
> before that legitimate sender makes first contact with a new  
> correspondent. But as I understand your logic, there is no 'rule' to  
> distinguish the 'first' AWL entry as 'special' from all the rest...  
> just that 'others' exist...

Right.

>> Let's lay out the logic here:
>> 2 AWL is positive or does not exist
>> a Check for other AWL entries using same address but different hosts.
>>   i   If there is an AWL with a negative score, then multiply by  
>> -0.2 and
>>   add to score
>
> So any AWL with a negative score still helps the new mail be negative?
> The sender's legit mail helps new spam?

No, the senders AWL HURTS new spam.  I fthe score is -2 from the AWL  
then -2 * -0.2 = 0.4

>>   ii  If there is an AWL with a positive score, under 5.0, then  
>> multiply by
>>   0.1 and add
>>   iii If there is an AWL with a positive score over 5.0, then  
>> multiply it
>>   by 0.4 and add
>
> So in the unlikely event that spam (from a different server)  
> precedes legitimate mail, the legit sender gets a postitive  
> adjustment before they have a chance to score negative...

As I understand it the AWL is added after all others, but yes, the  
FIRST legitimate mail will be penalized.

> Note that this logic will also be problematic when sender has  
> multiple mail servers. Many senders get a few points positive...

This will only be an issue if those multiple servers have positive AWL  
scores.

>> c if total amount added is over some threshold, normalize on that  
>> threshold
>> (3 points? 5? 8?)
>
> Now let's presume that the sender is spoofed by spammers on ten  
> different
> IP's, producing ten different AWL entries. How will you distinguish  
> the legit sender's IP (except by hoping they have scored  
> negative?)... You will simply add up ALL the IP AWL's and score  
> *any* mail from the sender
> with a significant positive adjustment....

As far as I can tell, though it's not easy to be sure, legitimate  
senders have negative AWL scores.

>> 3 AWL is negative
>> { crickets }
>
> But how often does that really happen? As I said, most people get a  
> *few* points on legit mail.

But it's not the points on the mail, it is only the AWL listing that  
we're looking at.

> The idea being that an average score of 0.8 will 'average' with a  
> fluke spammy mail and keep the score lower.... But your way is  
> adding those small scores to essentially ALL mail unless the lucky  
> sender never mentioned viag.... ooops. There goes *my* score.... LOL

OK, how do we parse out the AWL numbers then so we can see what sorts  
of AWL numbers exist for legit senders.  As I understand it, if an  
email comes in from a know sender who was average 0.8 and this email  
scores 3.0, a negative AWL will be applied to normalize the email  
closer to 0.8, right? The AWL score is not 0.8, but 3.0 - (AWL value)?

>> Maybe it makes sense to only do this check if the message has at  
>> least scored positive?
>
> Again, a significant proportion of ham gets a few points.
>
>> So yes, if bob@example.com has never emailed me except for a bunch  
>> of spam, then yeah, the message is going to get bumped up in its  
>> score, but how often does that happen?  Does that ever happen?
>
> Happens for me all the time. I get dictionary spam with a random  
> client's address as sender, and then I get an inquiry from the  
> client about all these 'bounces' they are receiving. Naturally, they  
> quote the bounce, which includes some spam sign, and the client is  
> off to a good start with a moderately spammy mail to me. (smile)
>
> But bob could also e-mail you three or four times, getting a small  
> positive score, then you get spammed "from Bob" with high scores  
> from a botnet (and I usually get several copies of a spam like  
> that), and the next time bob e-mails, he gets logic 2.a.ii spplied  
> above for each and every AWL for his address. Could be hefty....

Er.. ok.  Perhaps I am misunderstanding the AWL.  As I understand it,  
if a bunch of spam comes in from a server with average scores of 7.0  
and a new message comes in with a score of 4, it will have a POSITIVE  
AWL applied to normalize at 7.0.  If a message comes from a know  
sender with an average score of 2, and this email scores 4, it will  
get a NEGATIVE AWL score to normalize closer to 2.0, right?  Since  
this is a negative AWL 2.a.ii would not apply because the AWL is  
negative, so section 2 is skipped entirely and we are at 3. AWL is  
negative => {crickets}.

>> Also, lets say bob@example.com sends a message after a bunch of  
>> spams have been sent, and say that message scores -1.0, plus an AWL  
>> adjustment of 5.0 based on the above.
>
> I'm sure there are some people who *would* 'fit your model' and have  
> negative scores on their legit mail and not be hurt by the proposed  
> rule.

I think we are talking at cross purposes, and that's likely my fault.  
I am talking about the AWL adjustment being either positive or  
negative.  Mail that is more spammy than usual will get penalized up.  
Mail that is less spammy than usual will not be affected.

> Which, for any yahoo mailing list will be a different server many  
> times.
> And so if your yahoo list scores slightly positive, all those  
> different yahoo servers will all add to the score. Ditto hotmail,  
> gmail, etc.

OK, if the value is 0.1 then it would take up to 50 outbound servers  
with even distribution to add 5.0 points.

> I can see what you *want* to do. I just don't see a practical way to  
> do it.

That's quite possible.  As I said initially, it's jut an idea I had to  
make the AWL penalize botnets much more.  If it can't be done, that's  
fine.  I think there's some promise here though.

I'm not married to this idea, I just think there's something here that  
might be worth trying.

-- 
These budget numbers are not just estimates, these are the actual
	results for the fiscal year that ended February the 30th.
	- GWB

Re: 'anti' AWL

Posted by Charles Gregory <cg...@hwcn.org>.

On Thu, 30 Apr 2009, LuKreme wrote:
> First off, I suppose that if you get real mail from someone who has only 
> ever been seen as a spam sender, then yes, the first mail would be 
> penalized.  But is this ever the case?

(nod) Any time someone's address has been used as a spoofed sender before 
that legitimate sender makes first contact with a new correspondent. But 
as I understand your logic, there is no 'rule' to distinguish the 'first' 
AWL entry as 'special' from all the rest... just that 'others' exist...

> Let's lay out the logic here:
> 2 AWL is positive or does not exist
>  a Check for other AWL entries using same address but different hosts.
>    i   If there is an AWL with a negative score, then multiply by -0.2 and
>    add to score

So any AWL with a negative score still helps the new mail be negative?
The sender's legit mail helps new spam?

>    ii  If there is an AWL with a positive score, under 5.0, then multiply by
>    0.1 and add
>    iii If there is an AWL with a positive score over 5.0, then multiply it
>    by 0.4 and add

So in the unlikely event that spam (from a different server) precedes 
legitimate mail, the legit sender gets a postitive adjustment before 
they have a chance to score negative...

Note that this logic will also be problematic when sender has multiple 
mail servers. Many senders get a few points positive...

>  c if total amount added is over some threshold, normalize on that threshold
>  (3 points? 5? 8?)

Now let's presume that the sender is spoofed by spammers on ten different
IP's, producing ten different AWL entries. How will you distinguish the 
legit sender's IP (except by hoping they have scored negative?)... You 
will simply add up ALL the IP AWL's and score *any* mail from the sender
with a significant positive adjustment....

> 3 AWL is negative
>  { crickets }

But how often does that really happen? As I said, most people get a *few* 
points on legit mail. The idea being that an average score of 0.8 will 
'average' with a fluke spammy mail and keep the score lower.... But your 
way is adding those small scores to essentially ALL mail unless the lucky 
sender never mentioned viag.... ooops. There goes *my* score.... LOL

> Maybe it makes sense to only do this check if the message has at least 
> scored positive?

Again, a significant proportion of ham gets a few points.

> So yes, if bob@example.com has never emailed me except for a bunch of 
> spam, then yeah, the message is going to get bumped up in its score, but 
> how often does that happen?  Does that ever happen?

Happens for me all the time. I get dictionary spam with a random client's 
address as sender, and then I get an inquiry from the client about all 
these 'bounces' they are receiving. Naturally, they quote the bounce, 
which includes some spam sign, and the client is off to a good start with a 
moderately spammy mail to me. (smile)

But bob could also e-mail you three or four times, getting a small 
positive score, then you get spammed "from Bob" with high scores from a 
botnet (and I usually get several copies of a spam like that), and the 
next time bob e-mails, he gets logic 2.a.ii spplied above for each and 
every AWL for his address. Could be hefty....

> Also, lets say bob@example.com sends a message after a bunch of spams 
> have been sent, and say that message scores -1.0, plus an AWL adjustment 
> of 5.0 based on the above.

I'm sure there are some people who *would* 'fit your model' and have 
negative scores on their legit mail and not be hurt by the proposed rule.
But there would be too many with positive scores that would be hurt....

> The point is (as it seems to me) that people who send mail from 
> 'accounts@bankofamerica.com' from their botnets will very quickly scale up 
> the AWL modification to the maximal threshold.

And the people who get legit mail from bank of america will also very 
quickly scale up - I doubt BoA mail scores negative. :)

> This all assumes that the server that is checked is the last non-local 
> server (that is, the first one listed in the headers in typical order)

Which, for any yahoo mailing list will be a different server many times.
And so if your yahoo list scores slightly positive, all those different 
yahoo servers will all add to the score. Ditto hotmail, gmail, etc.

I can see what you *want* to do. I just don't see a practical way to do 
it.

Though I'm toying with a few ideas... I'll start a separate thread.

Thanks.

- Charles

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 30-Apr-2009, at 09:40, Charles Gregory wrote:
> On Wed, 29 Apr 2009, LuKreme wrote:
>> On 29-Apr-2009, at 15:31, Charles Gregory wrote:
>>> Apologies for original brevity, but my comment was a criticism of  
>>> the proposal to start weighing *all* mail from a specific sender  
>>> according to whether the IP was the 'most common' used for that  
>>> address.... Essentially changing it from what you state above.
>> But the only way it would really penalize a legitimate sender is i  
>> there mail is quite sammy to begin with.
>
> This statement is untrue in the context of the OP's suggestion to  
> START
> weighting one server IP be based upon the scores from OTHER server  
> IP's. Please read the original post if this is unclear. I quoted the  
> relevant portion in my critique.

I am the OP.

First off, I suppose that if you get real mail from someone who has  
only ever been seen as a spam sender, then yes, the first mail would  
be penalized.  But is this ever the case?

Let's lay out the logic here:

1 Check AWL

2 AWL is positive or does not exist
   a Check for other AWL entries using same address but different hosts.
     i   If there is an AWL with a negative score, then multiply by  
-0.2 and add to score
     ii  If there is an AWL with a positive score, under 5.0, then  
multiply by 0.1 and add
     iii If there is an AWL with a positive score over 5.0, then  
multiply it by 0.4 and add
   b go to a
   c if total amount added is over some threshold, normalize on that  
threshold (3 points? 5? 8?)
3 AWL is negative
   { crickets }

Maybe it makes sense to only do this check if the message has at least  
scored positive?

So yes, if bob@example.com has never emailed me except for a bunch of  
spam, then yeah, the message is going to get bumped up in its score,  
but how often does that happen?  Does that ever happen?

Also, lets say bob@example.com sends a message after a bunch of spams  
have been sent, and say that message scores -1.0, plus an AWL  
adjustment of 5.0 based on the above.

Well, now bob@example.com has his own AWL entry, and it's at -1.0  
since AWL scores are not counted toward the AWL, right?

(of course -0.2 and 0.1 and 0.4 are just numbers I made up, and I'm  
not suggesting these are the appropriate numbers).

The point is (as it seems to me) that people who send mail from 'accounts@bankofamerica.com 
' from their botnets will very quickly scale up the AWL modification  
to the maximal threshold.

This all assumes that the server that is checked is the last non-local  
server (that is, the first one listed in the headers in typical order)

Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mail.covisp.net (Postfix) with SMTP id 27D5A118B989
	for <kr...@kreme.com>; Tue, 28 Apr 2009 11:14:30 -0600 (MDT)

140.211.11.3 should be the server checked because it is the only non- 
local server who's address I am sure of.

-- 
Imagine all the people
Sharing all the world

Re: [-4.0] Re: 'anti' AWL

Posted by Charles Gregory <cg...@hwcn.org>.

On Wed, 29 Apr 2009, LuKreme wrote:
> On 29-Apr-2009, at 15:31, Charles Gregory wrote:
>> Apologies for original brevity, but my comment was a criticism of the 
>> proposal to start weighing *all* mail from a specific sender according to 
>> whether the IP was the 'most common' used for that address.... Essentially 
>> changing it from what you state above.
> But the only way it would really penalize a legitimate sender is i 
> there mail is quite sammy to begin with.

This statement is untrue in the context of the OP's suggestion to START
weighting one server IP be based upon the scores from OTHER server 
IP's. Please read the original post if this is unclear. I quoted the 
relevant portion in my critique.

- Charles

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 29-Apr-2009, at 15:31, Charles Gregory wrote:
> Apologies for original brevity, but my comment was a criticism of  
> the proposal to start weighing *all* mail from a specific sender  
> according to whether the IP was the 'most common' used for that  
> address.... Essentially changing it from what you state above.


But the only way it would really penalize a legitimate sender is i  
there mail is quite sammy to begin with.

-- 
Well boys, we got three engines out, we got more holes in us than a
	horse trader's mule, the radio is gone and we're leaking fuel
	and if we was flying any lower why we'd need sleigh bells on
	this thing... but we got one little budge on those Roosskies.
	At this height why they might harpoon us but they dang sure
	ain't gonna spot us on no radar screen!

Re: [0.0] Re: 'anti' AWL

Posted by Charles Gregory <cg...@hwcn.org>.

On Wed, 29 Apr 2009, Jeff Mincy wrote:
>  .... *someone* is getting their AWL reputation trashed every time a
>   spammer forges their e-mail.
> AWL stores the IP/16 address with the email address.   So your awl
> reputation is not being trashed by forged e-mail that comes from a
> different IP address.

Apologies for original brevity, but my comment was a criticism of the 
proposal to start weighing *all* mail from a specific sender according to 
whether the IP was the 'most common' used for that address.... Essentially 
changing it from what you state above.

>   .... In this case, as I only send a half dozen messages per month from
>   that account, the spammer would get the favored rating?
> Only if the spammer uses the same server that you do.

Again, you are describing "things as they are now", whereas I was citing a 
negative consequence of doing things differently....

- C

Re: 'anti' AWL

Posted by Jeff Mincy <je...@delphioutpost.com>.

   From: Charles Gregory <cg...@hwcn.org>
   Date: Wed, 29 Apr 2009 14:31:22 -0400 (EDT)

   I just turned off my AWL today, because of FP issues.... but....

   > foo@example.com sends me lots of mail.  Say it's over 100.  It's all ham and 
   > it all comes from mail.example.com. The AWL for this email couplet is , say 
   > -2.1.  An email comes in from foo@example.com but sent from spam.spammer.tld 
   > and score 7.0.  It gets an additional, say, .42 (20% of the AWL) to score 
   > 7.42 instead. Now, another mail from foo@example.com comes in from 
   > mail.spam2.tld, this one scores 4.3. It gets a +.42 for missing the match on 
   > mail.example com, and gets a +.288 for missing the match on spam.spammer.tld

   This sounds like an attempt to mimic the effects of SPF records by noting 
   which servers send "most" of the mail for a given address. Sadly, this 
   logic breaks down when the spammers 'get there first' and/or send a 
   greater volume of mail than the genuine sender. Admittedly the latter 
   situation is a low probability for any single sender, but in the big 
   picture, *someone* is getting their AWL reputation trashed every time a 
   spammer forges their e-mail.

AWL stores the IP/16 address with the email address.   So your awl
reputation is not being trashed by forged e-mail that comes from a
different IP address.

   Just this Monday I had a phishing attack againstmy clients, with *dozens* 
   of e-mails, all purporting to come from ME that came from the *same* 
   server! In this case, as I only send a half dozen messages per month from 
   that account, the spammer would get the favored rating?

Only if the spammer uses the same server that you do.
-jeff

Re: 'anti' AWL

Posted by Charles Gregory <cg...@hwcn.org>.

I just turned off my AWL today, because of FP issues.... but....

> foo@example.com sends me lots of mail.  Say it's over 100.  It's all ham and 
> it all comes from mail.example.com. The AWL for this email couplet is , say 
> -2.1.  An email comes in from foo@example.com but sent from spam.spammer.tld 
> and score 7.0.  It gets an additional, say, .42 (20% of the AWL) to score 
> 7.42 instead. Now, another mail from foo@example.com comes in from 
> mail.spam2.tld, this one scores 4.3. It gets a +.42 for missing the match on 
> mail.example com, and gets a +.288 for missing the match on spam.spammer.tld

This sounds like an attempt to mimic the effects of SPF records by noting 
which servers send "most" of the mail for a given address. Sadly, this 
logic breaks down when the spammers 'get there first' and/or send a 
greater volume of mail than the genuine sender. Admittedly the latter 
situation is a low probability for any single sender, but in the big 
picture, *someone* is getting their AWL reputation trashed every time a 
spammer forges their e-mail.

Just this Monday I had a phishing attack againstmy clients, with *dozens* 
of e-mails, all purporting to come from ME that came from the *same* 
server! In this case, as I only send a half dozen messages per month from 
that account, the spammer would get the favored rating?

No, I think I will return to my earlier request/question and suggest that 
perhaps whitelisting should do just that: It should only be allowed to
*reduce* a score for a sender/server that suddenly sends a 'spammy' 
message. It should not be allowed to *raise* scores. Thus, FP's will at 
worst cancel out an existing negative adjustment.

Given that historically this is a very different behaviour, I would 
ask/suggest that this be added as an 'option' that could be enabled
by people experiencing false positives because of the AWL....

AWL_reduce_scores_only 1

- Charles

Re: 'anti' AWL

Posted by mouss <mo...@ml.netoyen.net>.

RW a écrit :
> On Wed, 29 Apr 2009 20:49:29 +0200
> mouss <mo...@ml.netoyen.net> wrote:
> 
> 
>> on the other hand, a spammer can forge Received headers. and this is a
>> serious problem. Using "untrusted" received headers is broken.
> 
> The point of AWL is to tweak ham scores towards the mean to avoid
> outlying high-scores causing FPs. 

The "W" in AWL is a (historical) misnomer. ARL (automatic reputation
list) is probably a better name. in short, it works in both directions.

> The AWL score arithmetic doesn't
> involve BAYES scores or whitelisting scores, so a spammer that
> spoofs an existing AWL entry isn't going to pickup all that much
> advantage.

if you check the archives, you'll find that sometimes, some entries in
AWL get a very significant score, enough to move the message to the
wrong class.

and since Mark named it, AWL poisoning is not hard if using untrusted
headers.

> Most spam either wouldn't be protected by spoofing an
> entry, or scores low-enough without it. And spammers don't know
> much about your AWL database in the first place.
> 

while it's not trivial, the risk is here. and I personally don't feel
confortable. maybe someone can do a better assessment and qualify the
real risk. but I don't see the benefit of using an untrusted header.
yes, I understand the issue with large *SPs but this can be fixed, and I
believe it should be anyway: currently the trust path parsing is
(almost) binary. it could be either extended (bu adding more layers than
internal and trusted) or made "dynamic" (adding code that handles
different situations).

> [snip]

Re: 'anti' AWL

Posted by Matt Kettler <mk...@verizon.net>.

RW wrote:
> On Wed, 29 Apr 2009 20:49:29 +0200
> mouss <mo...@ml.netoyen.net> wrote:
>
>
>   
>> on the other hand, a spammer can forge Received headers. and this is a
>> serious problem. Using "untrusted" received headers is broken.
>>     
>
> The point of AWL is to tweak ham scores towards the mean to avoid
> outlying high-scores causing FPs. The AWL score arithmetic doesn't
> involve BAYES scores or whitelisting scores, so a spammer that
> spoofs an existing AWL entry isn't going to pickup all that much
> advantage. Most spam either wouldn't be protected by spoofing an
> entry, or scores low-enough without it. And spammers don't know
> much about your AWL database in the first place.
>
> If a spammer wants to exploit AWL the easiest way is to send some
> low-scoring dummy spams ahead of the real one - this doesn't require
> forging headers.
>   
Yes, the existing algorithm may fix gmail, but it also breaks road warriors.

The AWL could be re-designed to use the trust boundary, AND work
correctly for gmail.

See some of my discussion of this topic in bug 6015, Particularly point
numbers 6 and 7, which would fix gmail problems.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6105

Re: 'anti' AWL

Posted by RW <rw...@googlemail.com>.

On Wed, 29 Apr 2009 20:49:29 +0200
mouss <mo...@ml.netoyen.net> wrote:

> on the other hand, a spammer can forge Received headers. and this is a
> serious problem. Using "untrusted" received headers is broken.

The point of AWL is to tweak ham scores towards the mean to avoid
outlying high-scores causing FPs. The AWL score arithmetic doesn't
involve BAYES scores or whitelisting scores, so a spammer that
spoofs an existing AWL entry isn't going to pickup all that much
advantage. Most spam either wouldn't be protected by spoofing an
entry, or scores low-enough without it. And spammers don't know
much about your AWL database in the first place.

If a spammer wants to exploit AWL the easiest way is to send some
low-scoring dummy spams ahead of the real one - this doesn't require
forging headers.

> another approach would be to check both (the last external hop and the
> first possibly-fake "out relay") and use "the worst" result. but this
> is easier to say than to assess...

RE: 'anti' AWL

Posted by Mark <ad...@asarian-host.net>.

-----Original Message-----
From: mouss [mailto:mouss@ml.netoyen.net] 
Sent: woensdag 29 april 2009 20:53
To: users@spamassassin.apache.org
Subject: Re: 'anti' AWL

> on the other hand, a spammer can forge Received headers. and this is
> a serious problem. Using "untrusted" received headers is broken.

I've been following this discussion for a while; and while I can go along
with some of the latest rationale 'pro' AWL, I really do have to agree
with mouss (and others) here, that trusting any Received header other
than that of your own mail server(s) is inherently broken behavior, and
asking for trouble. It opens a whole can of worms, like
'Received-header-poisoning' (not so much to get oneself whitelisted, but
to give a legit mail server a bad rep over time). At least with DNS
poisoning you'd have to be reasonably knowledgeable to exploit it:
'Received-header-poisoning', on the other hand, requires as little as ill
will.

> another approach would be to check both (the last external hop and the
> first possibly-fake "out relay") and use "the worst" result. but this
> is easier to say than to assess...

I'm sure some meaningful statistical correlation between the two could be
established over time (meaningful enough to predict fakes and all). But
somehow I feel that's still like adding a bad element to otherwise clean
waters, and then adding lots of extra water to dilute the end result
again; in other words: let's just not poison the well to begin with.

- Mark

Re: 'anti' AWL

Posted by mouss <mo...@ml.netoyen.net>.

RW a écrit :
> On Tue, 28 Apr 2009 22:14:21 -0400
> Matt Kettler <mk...@verizon.net> wrote:
> 
>> Matt Kettler wrote:
>>> LuKreme wrote:
>>>   
> 
>>> Of course, first, or last depends on your perspective. I assume RW
>>> was thinking of "first" from a "starting at the inside, working
>>> backwards in time" approach. This is backwards, if you think about
>>> the chronology of the headers, like SA does. However, it makes
>>> sense from a "I'm at my server looking outward at the world" point
>>> of view that most folks work from when thinking about network
>>> topologies. 
>> Darnit, I should have checked before sending.
>>
>> The AWL uses the LAST non-private..
> 
> Maybe one of us is reading the perl wrong (and it could well be me), or
> we are talking at cross purposes. As I see it, it's going through the
> list of IP address, starting with the mail client and working its way
> towards the SA Server. When it finds a routable IP address it sets
> origip and breaks-out of the loop.
> 
> By your cronological definition of first and last (which is the same as
> mine), that's the the FIRST non-private address.
> 
> It makes sense to me, if I send you an email, the AWL entry should use
> my IP address not a random gmail server.
> 

gmail and the like are special cases and could be handled via DNSWL or
the like.

on the other hand, a spammer can forge Received headers. and this is a
serious problem. Using "untrusted" received headers is broken.

another approach would be to check both (the last external hop and the
first possibly-fake "out relay") and use "the worst" result. but this is
easier to say than to assess...


>> This is, IMO, completely broken. Why are we allowing folks to declare
>> internal_networks if we're not going to use it, and assume the last
>> non-private is "external". (which, mind you, is different from what
>> the trust-path guesser does. It assumes that IP is your MX.)
> 
>

Re: 'anti' AWL

Posted by Matt Kettler <mk...@verizon.net>.

RW wrote:
>
> Maybe one of us is reading the perl wrong (and it could well be me), or
> we are talking at cross purposes. As I see it, it's going through the
> list of IP address, starting with the mail client and working its way
> towards the SA Server. When it finds a routable IP address it sets
> origip and breaks-out of the loop.
>   
You're right.. I've been working it out with Theo overnight... It's the
first public (ie: closest to the client).

This is ontradictory to my earlier impressions, and much of what I've
written in the wiki about trust paths is wrong as a result.

> By your cronological definition of first and last (which is the same as
> mine), that's the the FIRST non-private address.
>
> It makes sense to me, if I send you an email, the AWL entry should use
> my IP address not a random gmail server.
That only makes sense if your IP is stable.. if you're a roadwarrior
working from hotels, it changes constantly and you'd be better off using
the server.

I've got some ideas for making the entries more useful, even in the
gmail case though.. however, I'm going to have to go learn perl to
implement them :)

Re: 'anti' AWL

Posted by Jonas Eckerman <jo...@frukt.org>.

RW wrote:

> By your cronological definition of first and last (which is the same as
> mine), that's the the FIRST non-private address.

Or the address in the fake Received header the spambot put in the mail?

I hope this is not how it works...

> It makes sense to me, if I send you an email, the AWL entry should use
> my IP address not a random gmail server.

Considering that lots of people have dynamic routable addresses, this 
seems like a bad idea for a big group of people not using WebMail.

Regards
/Jonas
-- 
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: 'anti' AWL

Posted by RW <rw...@googlemail.com>.

On Tue, 28 Apr 2009 22:14:21 -0400
Matt Kettler <mk...@verizon.net> wrote:

> Matt Kettler wrote:
> > LuKreme wrote:
> >   

> > Of course, first, or last depends on your perspective. I assume RW
> > was thinking of "first" from a "starting at the inside, working
> > backwards in time" approach. This is backwards, if you think about
> > the chronology of the headers, like SA does. However, it makes
> > sense from a "I'm at my server looking outward at the world" point
> > of view that most folks work from when thinking about network
> > topologies. 
> 
> Darnit, I should have checked before sending.
> 
> The AWL uses the LAST non-private..

Maybe one of us is reading the perl wrong (and it could well be me), or
we are talking at cross purposes. As I see it, it's going through the
list of IP address, starting with the mail client and working its way
towards the SA Server. When it finds a routable IP address it sets
origip and breaks-out of the loop.

By your cronological definition of first and last (which is the same as
mine), that's the the FIRST non-private address.

It makes sense to me, if I send you an email, the AWL entry should use
my IP address not a random gmail server.

> This is, IMO, completely broken. Why are we allowing folks to declare
> internal_networks if we're not going to use it, and assume the last
> non-private is "external". (which, mind you, is different from what
> the trust-path guesser does. It assumes that IP is your MX.)

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 28-Apr-2009, at 20:14, Matt Kettler wrote:
> The AWL uses the LAST non-private..
>
> This is, IMO, completely broken.


Yep, have to agree.  This is seriously retarded.


-- 
I love as only I can, with all my heart

Re: 'anti' AWL

Posted by Matt Kettler <mk...@verizon.net>.

Matt Kettler wrote:
> LuKreme wrote:
>   
>> On 28-Apr-2009, at 15:38, RW wrote:
>>     
>>> It's based on the first routable IP address,
>>>       
>> Well, that's a very silly thing for it to be looking at.  It should be
>> looking at the LAST routable IP address outside of the trusted
>> network. Looking at the first routable address is completely worthless.
>>     
> It's actually based on the last IP not matching your internal_networks.
> If you haven't declared internal_networks or trusted_networks manually,
> then the auto-guesser is going to set it to be the second-to-last
> routable IP (it assumes the last routable is your MX, which may or may
> not be correct depending on how you route/firewall your DMZ.)
>
> Of course, first, or last depends on your perspective. I assume RW was
> thinking of "first" from a "starting at the inside, working backwards in
> time" approach. This is backwards, if you think about the chronology of
> the headers, like SA does. However, it makes sense from a "I'm at my
> server looking outward at the world" point of view that most folks work
> from when thinking about network topologies.
>   

Darnit, I should have checked before sending.

The AWL uses the LAST non-private..

This is, IMO, completely broken. Why are we allowing folks to declare
internal_networks if we're not going to use it, and assume the last
non-private is "external". (which, mind you, is different from what the
trust-path guesser does. It assumes that IP is your MX.)


Relevant code:

    foreach my $rly (reverse (@{$pms->{relays_trusted}}, @{$pms->{relays_untrusted}}))
    {
      next if ($rly->{ip_private});
      if ($rly->{ip}) {
	$origip = $rly->{ip}; last;
      }
    }






>
>
>
>
>
>
>
>
>
>
>

Re: 'anti' AWL

Posted by Matt Kettler <mk...@verizon.net>.

LuKreme wrote:
> On 28-Apr-2009, at 15:38, RW wrote:
>> It's based on the first routable IP address,
>
>
> Well, that's a very silly thing for it to be looking at.  It should be
> looking at the LAST routable IP address outside of the trusted
> network. Looking at the first routable address is completely worthless.
It's actually based on the last IP not matching your internal_networks.
If you haven't declared internal_networks or trusted_networks manually,
then the auto-guesser is going to set it to be the second-to-last
routable IP (it assumes the last routable is your MX, which may or may
not be correct depending on how you route/firewall your DMZ.)

Of course, first, or last depends on your perspective. I assume RW was
thinking of "first" from a "starting at the inside, working backwards in
time" approach. This is backwards, if you think about the chronology of
the headers, like SA does. However, it makes sense from a "I'm at my
server looking outward at the world" point of view that most folks work
from when thinking about network topologies.

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 28-Apr-2009, at 15:38, RW wrote:
> It's based on the first routable IP address,


Well, that's a very silly thing for it to be looking at.  It should be  
looking at the LAST routable IP address outside of the trusted  
network. Looking at the first routable address is completely worthless.


-- 
Adolescence is the period between childhood and adultery

Re: 'anti' AWL

Posted by RW <rw...@googlemail.com>.

On Tue, 28 Apr 2009 11:13:56 -0600
LuKreme <kr...@kreme.com> wrote:

> On 28-Apr-2009, at 08:56, Matus UHLAR - fantomas wrote:
> > We have more servers users send mail through. Users can't choose
> > which server will they connect.
> 
> That already happens now.

I think his point is that that doesn't currently cause a problem, but
would with your scheme. 

>  The AWL has a confidence based on number of
> messages received, right? If I get messages from bar@example.com that
> come from a variety of servers, the confidence is much lower than if
> they all come from the same server, so the adjustment is lower.

I'm not aware that it has any such concept, AFAIK the AWL score is a
 configurable fraction of average-score - current-score. 

> No, if they get spam from the SAME senders on DIFFERENT servers, the  
> AWL would go up even faster.

It's based on the first routable IP address, not the last-hop into the
trusted network, so someone using other people's wireless networks could
go through a huge number of addresses even with the same
outgoing smtp-server.

Note also that the email address and ip address used by AWL are
both forgable by spammers.

Re: 'anti' AWL

Posted by LuKreme <kr...@kreme.com>.

On 28-Apr-2009, at 08:56, Matus UHLAR - fantomas wrote:
> We have more servers users send mail through. Users can't choose which
> server will they connect.

That already happens now.

> It can also happen when user switched ISP, mail provider, or the mail
> provider changes IP address, DNS names or what is used there.
> This would require much more logic that is curerntly in AWL.

No it wouldn't.  The AWL has a confidence based on number of messages  
received, right? If I get messages from bar@example.com that come from  
a variety of servers, the confidence is much lower than if they all  
come from the same server, so the adjustment is lower.

>> This would even be useful if the original AWL entry is spammish since
>> multiple servers might be a sign of a botnet or host hopping, so
>> applying a little spammish nudge to these messages is probably  
>> going to
>> help out a lot, especially if spammer@fakedoamin.tld is sending mails
>> from, say, 10 different server then all those AWL mismatches are  
>> going to
>> feed each other into moving that AWL up very very fast.
>
> The question is if users tend to repeatedly get spam from the same  
> sender
> through the same servers.

No, if they get spam from the SAME senders on DIFFERENT servers, the  
AWL would go up even faster.

On 28-Apr-2009, at 09:07, Jeff Mincy wrote:

> Your idea will FP anytime anybody adds a new email device or the ISP
> changes (etc).

That's why the adjustment would be, initially, small.

foo@example.com sends me lots of mail.  Say it's over 100.  It's all  
ham and it all comes from mail.example.com. The AWL for this email  
couplet is , say -2.1.  An email comes in from foo@example.com but  
sent from spam.spammer.tld and score 7.0.  It gets an additional,  
say, .42 (20% of the AWL) to score 7.42 instead. Now, another mail  
from foo@example.com comes in from mail.spam2.tld, this one scores  
4.3. It gets a +.42 for missing the match on mail.example com, and  
gets a +.288 for missing the match on spam.spammer.tld (1% of the AWL,  
double for being positive, doubled again for being over 5), for a  
total score of 4.3+.288+.42 = 5.08, pushing it over the spam threshold.

Now, say example.com adds a second mail server, mail2.example.com. It  
will start off with a 'penalty' of +0.708 for being an unknown  
sender.  But, if the message scores under 0, we don't adjust the AWL  
at all. If the message is over 0, yes it will have an initial penalty  
but the AWL is pretty darn good at adjusting.

Now, say another AWL entry is based on only 20 emails, instead of  
adjusting by 20% of the awl, we adjust only 4%.  (or something.  the  
point is, the more emails the AWL is based on, the more confident it  
is, and that confidence should count AGAINST messages that don't match  
the AWL).

-- 
When we woke up that morning we had no way of knowing that in a
	matter of hours we'd changed the way we were going.  Where would
	I be now? Where would I be now if we'd never met?  Would I be
	singing this song to someone else instead?

Re: 'anti' AWL

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 28.04.09 08:43, LuKreme wrote:
> OK, working on my first cup of coffee this morning, so maybe this has  
> potential.
>
> The way the AWL works is by keeping track of the origin of emails, both 
> the address and the server (the top line Received header?) that send the 
> email.  So, lets say that I have a lot of email from foo@example.com and 
> that foo's email is sent to me via mail.example.com.
>
> Now, I get an email claiming to be from foo@example.com but sent to me  
> from suspiciousserver.tld, so the AWL is not applied.
>
> But if I've gotten 50 emails from foo@example.com and all came through  
> mail.example.com it seems that it would be beneficial to have a 'anti'  
> AWL score score applied to this particular email, since it claims to be 
> from one place, but doesn't match the AWL entry. This, naturally would 
> start of a new AWL entry, but with a slightly higher score than  
> otherwise.

We have more servers users send mail through. Users can't choose which
server will they connect. 
It can also happen when user switched ISP, mail provider, or the mail
provider changes IP address, DNS names or what is used there.
This would require much more logic that is curerntly in AWL.

> This would even be useful if the original AWL entry is spammish since  
> multiple servers might be a sign of a botnet or host hopping, so  
> applying a little spammish nudge to these messages is probably going to 
> help out a lot, especially if spammer@fakedoamin.tld is sending mails 
> from, say, 10 different server then all those AWL mismatches are going to 
> feed each other into moving that AWL up very very fast.

The question is if users tend to repeatedly get spam from the same sender
through the same servers. 
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
If Barbie is so popular, why do you have to buy her friends?