You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Warren Togami <wt...@redhat.com> on 2009/12/17 00:10:12 UTC
Whitelists, not directly useful to spamassassin...
I made a discovery today that surprised even myself. Using the rescore
masscheck and weekly masscheck logs while working on Bug #6247 I found
some interesting details that throws a wrench into this lively debate.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
It turns out that the ReturnPath and DNSWL whitelists have a
statistically insignificant impact on spamassassin's ability to
determine ham vs. spam. Meanwhile, both whitelists have high levels of
accuracy.
How can both of these statements be true? I suspect this is because the
scores are balanced by the rescoring algorithm to be "safe" in the
majority case where no whitelist rule has triggered. Thus whitelists
are not needed or relied upon to prevent false positive classification.
While whitelists are not directly effective (statistically, when
averaged across a large corpus), whitelists are powerful tools in
indirect ways including:
* Pushing the score beyond the auto-learn threshold for things like
Bayes to function without manual intervention.
* The albeit controversial method where some automated spam trap
blacklists use whitelists to help determine if they really should list
an IP address.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
spamassassin-3.3.0 has reduced the score impact of these whitelists to
more modest levels, maxing out at -5 points. -5 is PLENTY for
spamassassin, as 5 points is the level which the scoreset is tuned.
Mail from a whitelisted host would need greater than 10 points to be
blocked, which is statistically very rare for ham. I believe that we
are striking the right balance with these modest whitelist scores in
this release.
That being said, whitelists should be constantly policed to maintain
their reputation and trust levels. For example, while I currently am
impressed by DNSWL's performance, I am not pleased that they seem to
lack automated trap-based enforcement. Relying only on manual reports
and manual intervention requires too much effort in the long-term for
any organization, be it company or volunteer run.
Warren Togami
wtogami@redhat.com
Re: Whitelists, not directly useful to spamassassin...
Posted by Charles Gregory <cg...@hwcn.org>.
Thank you, Warren. That (finally) gives some real perspective to this
mess, and gets some of the 'real' questions answered.
- C
On Wed, 16 Dec 2009, Warren Togami wrote:
> I made a discovery today that surprised even myself. Using the rescore
> masscheck and weekly masscheck logs while working on Bug #6247 I found some
> interesting details that throws a wrench into this lively debate.
>
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
> It turns out that the ReturnPath and DNSWL whitelists have a statistically
> insignificant impact on spamassassin's ability to determine ham vs. spam.
> Meanwhile, both whitelists have high levels of accuracy.
>
> How can both of these statements be true? I suspect this is because the
> scores are balanced by the rescoring algorithm to be "safe" in the majority
> case where no whitelist rule has triggered. Thus whitelists are not needed
> or relied upon to prevent false positive classification.
>
> While whitelists are not directly effective (statistically, when averaged
> across a large corpus), whitelists are powerful tools in indirect ways
> including:
>
> * Pushing the score beyond the auto-learn threshold for things like Bayes to
> function without manual intervention.
> * The albeit controversial method where some automated spam trap blacklists
> use whitelists to help determine if they really should list an IP address.
>
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
> https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
> spamassassin-3.3.0 has reduced the score impact of these whitelists to more
> modest levels, maxing out at -5 points. -5 is PLENTY for spamassassin, as 5
> points is the level which the scoreset is tuned. Mail from a whitelisted host
> would need greater than 10 points to be blocked, which is statistically very
> rare for ham. I believe that we are striking the right balance with these
> modest whitelist scores in this release.
>
> That being said, whitelists should be constantly policed to maintain their
> reputation and trust levels. For example, while I currently am impressed by
> DNSWL's performance, I am not pleased that they seem to lack automated
> trap-based enforcement. Relying only on manual reports and manual
> intervention requires too much effort in the long-term for any organization,
> be it company or volunteer run.
>
> Warren Togami
> wtogami@redhat.com
>
>
Re: Whitelists, not directly useful to spamassassin...
Posted by Per Jessen <pe...@computer.org>.
Warren Togami wrote:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
> It turns out that the ReturnPath and DNSWL whitelists have a
> statistically insignificant impact on spamassassin's ability to
> determine ham vs. spam. Meanwhile, both whitelists have high levels
> of accuracy.
>
> How can both of these statements be true? I suspect this is because
> the scores are balanced by the rescoring algorithm to be "safe" in the
> majority case where no whitelist rule has triggered. Thus whitelists
> are not needed or relied upon to prevent false positive
> classification.
I concur, that is what my analysis of HABEAS hits over the last four
months showed too.
/Per Jessen, Zürich
Re: Whitelists, not directly useful to spamassassin...
Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> Warren Togami wrote:
>> While whitelists are not directly effective (statistically, when
>> averaged across a large corpus), whitelists are powerful tools in
>> indirect ways including:
>>
>> * Pushing the score beyond the auto-learn threshold for things like
>> Bayes to function without manual intervention.
On 17.12.09 11:27, Jason Bertoch wrote:
> This does not sound like a positive thing to me. E-mail from any sender
> that is malformed enough to skip auto-learning should not be forced into
> Bayes as ham simply because some 3rd party promises, for their own
> monetary benefit, that the sender is a nice guy. Why should any sender
> that I have not intentionally added to my local whitelist get a break?
If you _want_ the mail and whitelist the sender, I think its characteristics
should be pushed into the bayes.
If you don't want the mail, then autolearning it as spam is least of your
problems.
--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux IS user friendly, it's just selective who its friends are...
Re: Whitelists, not directly useful to spamassassin...
Posted by Warren Togami <wt...@redhat.com>.
On 12/17/2009 11:27 AM, Jason Bertoch wrote:
>
> If whitelists are to be enabled by default, I believe their score should
> be moved considerably more toward zero.
>
> /Jason
I don't necessarily disagree with this desire, as now we know the
whitelists actually are making almost zero difference to spamassassin's
results.
We did at least reduce the scores from their default values that were in
spamassassin-3.2.x as a reasonable compromise.
Warren
Re: Whitelists, not directly useful to spamassassin...
Posted by Jason Bertoch <ja...@i6ix.com>.
Warren Togami wrote:
>
> While whitelists are not directly effective (statistically, when
> averaged across a large corpus), whitelists are powerful tools in
> indirect ways including:
>
> * Pushing the score beyond the auto-learn threshold for things like
> Bayes to function without manual intervention.
This does not sound like a positive thing to me. E-mail from any sender
that is malformed enough to skip auto-learning should not be forced into
Bayes as ham simply because some 3rd party promises, for their own
monetary benefit, that the sender is a nice guy. Why should any sender
that I have not intentionally added to my local whitelist get a break?
I've had enough problems with DNSWL, HABEAS, and JMF that they have all
been disabled here. Unfortunately, that also means I have no recent
data to add to the debate. Although I believe that whitelists should be
included in the default install for those that want them, I also believe
they should be disabled by default so that an admin must knowingly
enable them after reading the manual and considering the consequences.
The argument has also been made that whitelists should be included
simply because blacklists are. I think that argument is flawed.
Blacklists are part of the spam fighting community while whitelists are
part of the bulk delivery community. Their goals and motives are
completely different. For one, blacklists will normally have evidence
of abuse to support their listing. Whitelists only have policies and
promises. Second, the scoring of whitelists is currently favored over
blacklists, and will continue to be at the proposed settings for 3.3.0.
Why can a whitelist override the score of a blacklist when it is the
blacklist that has evidence of abuse?
After reading up on Bug6247, I found that ReturnPath included
interesting stats on their lists:
Certified
Active: 4407
Suspended: 1300
Total: 5707
Safe
Active: 6561
Suspended: 283
Total: 6844
The Certified list is supposedly difficult to get on so I'm not sure how
to interpret these results. Is 1/5 of the list suspended because of due
diligence on the part of ReturnPath? If so, how did they get certified
in the first place?
If whitelists are to be enabled by default, I believe their score should
be moved considerably more toward zero.
/Jason
Re: Whitelists, not directly useful to spamassassin...
Posted by "J.D. Falk" <jd...@cybernothing.org>.
Very interesting data indeed -- and a testament to the accuracy of the SpamAssassin rules weighting process.
On Dec 16, 2009, at 4:10 PM, Warren Togami wrote:
> While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including:
>
> * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention.
> * The albeit controversial method where some automated spam trap blacklists use whitelists to help determine if they really should list an IP address.
Another indirect benefit (according to other users of our whitelists) is that when they implement a new spam-blocking method, the whitelists serve as kind of a safety valve to let legitimate mail through even when the new rule turns out to have false positives.
Site-specific whitelists are important for this, too.
> That being said, whitelists should be constantly policed to maintain their reputation and trust levels.
Agreed.
--
J.D. Falk <jd...@returnpath.net>
Return Path Inc