You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Warren Togami <wt...@redhat.com> on 2009/12/17 00:10:12 UTC

Whitelists, not directly useful to spamassassin...

I made a discovery today that surprised even myself.  Using the rescore 
masscheck and weekly masscheck logs while working on Bug #6247 I found 
some interesting details that throws a wrench into this lively debate.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
It turns out that the ReturnPath and DNSWL whitelists have a 
statistically insignificant impact on spamassassin's ability to 
determine ham vs. spam.  Meanwhile, both whitelists have high levels of 
accuracy.

How can both of these statements be true?  I suspect this is because the 
scores are balanced by the rescoring algorithm to be "safe" in the 
majority case where no whitelist rule has triggered.  Thus whitelists 
are not needed or relied upon to prevent false positive classification.

While whitelists are not directly effective (statistically, when 
averaged across a large corpus), whitelists are powerful tools in 
indirect ways including:

* Pushing the score beyond the auto-learn threshold for things like 
Bayes to function without manual intervention.
* The albeit controversial method where some automated spam trap 
blacklists use whitelists to help determine if they really should list 
an IP address.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
spamassassin-3.3.0 has reduced the score impact of these whitelists to 
more modest levels, maxing out at -5 points.  -5 is PLENTY for 
spamassassin, as 5 points is the level which the scoreset is tuned. 
Mail from a whitelisted host would need greater than 10 points to be 
blocked, which is statistically very rare for ham.  I believe that we 
are striking the right balance with these modest whitelist scores in 
this release.

That being said, whitelists should be constantly policed to maintain 
their reputation and trust levels.  For example, while I currently am 
impressed by DNSWL's performance, I am not pleased that they seem to 
lack automated trap-based enforcement.  Relying only on manual reports 
and manual intervention requires too much effort in the long-term for 
any organization, be it company or volunteer run.

Warren Togami
wtogami@redhat.com

Re: Whitelists, not directly useful to spamassassin...

Posted by Charles Gregory <cg...@hwcn.org>.
Thank you, Warren. That (finally) gives some real perspective to this 
mess, and gets some of the 'real' questions answered.

- C

On Wed, 16 Dec 2009, Warren Togami wrote:
> I made a discovery today that surprised even myself.  Using the rescore 
> masscheck and weekly masscheck logs while working on Bug #6247 I found some 
> interesting details that throws a wrench into this lively debate.
>
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
> It turns out that the ReturnPath and DNSWL whitelists have a statistically 
> insignificant impact on spamassassin's ability to determine ham vs. spam. 
> Meanwhile, both whitelists have high levels of accuracy.
>
> How can both of these statements be true?  I suspect this is because the 
> scores are balanced by the rescoring algorithm to be "safe" in the majority 
> case where no whitelist rule has triggered.  Thus whitelists are not needed 
> or relied upon to prevent false positive classification.
>
> While whitelists are not directly effective (statistically, when averaged 
> across a large corpus), whitelists are powerful tools in indirect ways 
> including:
>
> * Pushing the score beyond the auto-learn threshold for things like Bayes to 
> function without manual intervention.
> * The albeit controversial method where some automated spam trap blacklists 
> use whitelists to help determine if they really should list an IP address.
>
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
> spamassassin-3.3.0 has reduced the score impact of these whitelists to more 
> modest levels, maxing out at -5 points.  -5 is PLENTY for spamassassin, as 5 
> points is the level which the scoreset is tuned. Mail from a whitelisted host 
> would need greater than 10 points to be blocked, which is statistically very 
> rare for ham.  I believe that we are striking the right balance with these 
> modest whitelist scores in this release.
>
> That being said, whitelists should be constantly policed to maintain their 
> reputation and trust levels.  For example, while I currently am impressed by 
> DNSWL's performance, I am not pleased that they seem to lack automated 
> trap-based enforcement.  Relying only on manual reports and manual 
> intervention requires too much effort in the long-term for any organization, 
> be it company or volunteer run.
>
> Warren Togami
> wtogami@redhat.com
>
>

Re: Whitelists, not directly useful to spamassassin...

Posted by Per Jessen <pe...@computer.org>.
Warren Togami wrote:

> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
> It turns out that the ReturnPath and DNSWL whitelists have a
> statistically insignificant impact on spamassassin's ability to
> determine ham vs. spam.  Meanwhile, both whitelists have high levels
> of accuracy.
> 
> How can both of these statements be true?  I suspect this is because
> the scores are balanced by the rescoring algorithm to be "safe" in the
> majority case where no whitelist rule has triggered.  Thus whitelists
> are not needed or relied upon to prevent false positive
> classification.

I concur, that is what my analysis of HABEAS hits over the last four
months showed too. 


/Per Jessen, Zürich


Re: Whitelists, not directly useful to spamassassin...

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Warren Togami wrote:
>> While whitelists are not directly effective (statistically, when  
>> averaged across a large corpus), whitelists are powerful tools in  
>> indirect ways including:
>>
>> * Pushing the score beyond the auto-learn threshold for things like  
>> Bayes to function without manual intervention.

On 17.12.09 11:27, Jason Bertoch wrote:
> This does not sound like a positive thing to me.  E-mail from any sender  
> that is malformed enough to skip auto-learning should not be forced into  
> Bayes as ham simply because some 3rd party promises, for their own  
> monetary benefit, that the sender is a nice guy.  Why should any sender  
> that I have not intentionally added to my local whitelist get a break?

If you _want_ the mail and whitelist the sender, I think its characteristics
should be pushed into the bayes.
If you don't want the mail, then autolearning it as spam is least of your
problems.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux IS user friendly, it's just selective who its friends are...

Re: Whitelists, not directly useful to spamassassin...

Posted by Warren Togami <wt...@redhat.com>.
On 12/17/2009 11:27 AM, Jason Bertoch wrote:
>
> If whitelists are to be enabled by default, I believe their score should
> be moved considerably more toward zero.
>
> /Jason

I don't necessarily disagree with this desire, as now we know the 
whitelists actually are making almost zero difference to spamassassin's 
results.

We did at least reduce the scores from their default values that were in 
spamassassin-3.2.x as a reasonable compromise.

Warren

Re: Whitelists, not directly useful to spamassassin...

Posted by Jason Bertoch <ja...@i6ix.com>.
Warren Togami wrote:

> 
> While whitelists are not directly effective (statistically, when 
> averaged across a large corpus), whitelists are powerful tools in 
> indirect ways including:
> 
> * Pushing the score beyond the auto-learn threshold for things like 
> Bayes to function without manual intervention.

This does not sound like a positive thing to me.  E-mail from any sender 
that is malformed enough to skip auto-learning should not be forced into 
Bayes as ham simply because some 3rd party promises, for their own 
monetary benefit, that the sender is a nice guy.  Why should any sender 
that I have not intentionally added to my local whitelist get a break?

I've had enough problems with DNSWL, HABEAS, and JMF that they have all 
been disabled here.  Unfortunately, that also means I have no recent 
data to add to the debate.  Although I believe that whitelists should be 
included in the default install for those that want them, I also believe 
they should be disabled by default so that an admin must knowingly 
enable them after reading the manual and considering the consequences.

The argument has also been made that whitelists should be included 
simply because blacklists are.  I think that argument is flawed. 
Blacklists are part of the spam fighting community while whitelists are 
part of the bulk delivery community.  Their goals and motives are 
completely different.  For one, blacklists will normally have evidence 
of abuse to support their listing.  Whitelists only have policies and 
promises.  Second, the scoring of whitelists is currently favored over 
blacklists, and will continue to be at the proposed settings for 3.3.0. 
  Why can a whitelist override the score of a blacklist when it is the 
blacklist that has evidence of abuse?


After reading up on Bug6247, I found that ReturnPath included 
interesting stats on their lists:

Certified
Active: 4407
Suspended: 1300
Total: 5707

Safe
Active: 6561
Suspended: 283
Total: 6844


The Certified list is supposedly difficult to get on so I'm not sure how 
to interpret these results.  Is 1/5 of the list suspended because of due 
diligence on the part of ReturnPath?  If so, how did they get certified 
in the first place?

If whitelists are to be enabled by default, I believe their score should 
be moved considerably more toward zero.

/Jason

Re: Whitelists, not directly useful to spamassassin...

Posted by "J.D. Falk" <jd...@cybernothing.org>.
Very interesting data indeed -- and a testament to the accuracy of the SpamAssassin rules weighting process.

On Dec 16, 2009, at 4:10 PM, Warren Togami wrote:

> While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including:
> 
> * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention.
> * The albeit controversial method where some automated spam trap blacklists use whitelists to help determine if they really should list an IP address.

Another indirect benefit (according to other users of our whitelists) is that when they implement a new spam-blocking method, the whitelists serve as kind of a safety valve to let legitimate mail through even when the new rule turns out to have false positives.

Site-specific whitelists are important for this, too.

> That being said, whitelists should be constantly policed to maintain their reputation and trust levels.

Agreed.

--
J.D. Falk <jd...@returnpath.net>
Return Path Inc