You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2010/02/23 16:18:20 UTC

[Bug 6344] New: ReturnPath and DNSWL rules should not autolearn

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

           Summary: ReturnPath and DNSWL rules should not autolearn
           Product: Spamassassin
           Version: 3.3.0
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jason@electronet.net


Due to the risk of false positives poisoning Bayes, and the fact that the other
SA whitelist and blacklist rules already skip autolearning, this should be
appended to the RP and DNSWL rules.

Suggest modifying the following rules:

tflags RCVD_IN_RP_CERTIFIED     net nice
tflags RCVD_IN_RP_SAFE          net nice
tflags RCVD_IN_DNSWL_HI         nice net
tflags RCVD_IN_DNSWL_LOW        nice net
tflags RCVD_IN_DNSWL_MED        nice net

with:

tflags RCVD_IN_RP_CERTIFIED     net nice noautolearn
tflags RCVD_IN_RP_SAFE          net nice noautolearn
tflags RCVD_IN_DNSWL_HI         nice net noautolearn
tflags RCVD_IN_DNSWL_LOW        nice net noautolearn
tflags RCVD_IN_DNSWL_MED        nice net noautolearn

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #2 from Jason Bertoch <ja...@i6ix.com> 2010-12-15 13:53:40 UTC ---
While I still believe this bug is legitimate, you should understand that adding
"noautolearn" to the rules' tflags doesn't prevent a message from being
auto-learned.  Instead, it only means this test is ignored when calculating
scores for the learning system.  While it may help prevent messages from being
auto-learned, it doesn't guarantee it.

In the mean time, feel free to add the suggestions above to your local.cf, or
even disable the RP rules by setting their score to zero.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #6 from RW <rw...@googlemail.com> ---
(In reply to comment #4)
> This thread is mainly about Bayes and that DNSWL may be decreasing the socre
> somewhat, but DNSWL is not the culprit.

The bayes score is a symptom 

I think it's very likely that  DNSWL is the reason BAYES is failing in the
first place. If you ignore bayes and look at the other rules hit, they would
all of had scores well above the threshold if it weren't for DNSWL.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #9 from RW <rw...@googlemail.com> ---
(In reply to comment #8)
> If this is a discussion on the efficacy and scoring of RP, DNSWL or other
> rules, sobeit.  But a discussion of not autolearning specific rules, that
> sounds flawed and unmaintainable to me. Here's my thoughts:
> 
> First, to my understanding, the noautolearn setting in question is a
> masscheck setting.  It doesn't change production systems.

No, autolearning uses a non-Bayes score set and additionally ignores rules
marked as noautolearn or userconf.


> Second, It would seem to me that if you don't trust the set of rules to
> score very high, you change the scores.  

The scores are assigned to distinguish spam from what is not proven to be spam.

> Third, If you think the scores are not accurate, we get more people
> assisting with rule QA and improve the scores.

That works for spam because we optimize for a threshold and then add a safety
margin. It wont work for ham because we don't have a three-way classification.

Even if we did have a three-way classifiction,  we don't have enough "nice"
rules to positively identify ham.

> Finally, the concept of not learning for the bayesian system based on
> certain rules hitting/not-hitting for production systems seems to have
> little merit to me.  

It's more the DNS whitelist rules that are the anomaly. If I add an
authenticated address to a whitelist it's ignored for autolearning, but if a
direct marketer pays money to Return-Path that does contribute.

The DNS whitelists should be seem as a way of avoiding FPs, not as a way of
positively identifying ham.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #8 from Kevin A. McGrail <km...@pccc.com> ---
If this is a discussion on the efficacy and scoring of RP, DNSWL or other
rules, sobeit.  But a discussion of not autolearning specific rules, that
sounds flawed and unmaintainable to me. Here's my thoughts:

First, to my understanding, the noautolearn setting in question is a masscheck
setting.  It doesn't change production systems.

Second, It would seem to me that if you don't trust the set of rules to score
very high, you change the scores.  

Third, If you think the scores are not accurate, we get more people assisting
with rule QA and improve the scores.

Finally, the concept of not learning for the bayesian system based on certain
rules hitting/not-hitting for production systems seems to have little merit to
me.  


Regards,
KAM

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #13 from RW <rw...@googlemail.com> ---
(In reply to comment #12)

>  I continually keep circling back to the fact that
> rules should score appropriately with minimal false hits.  That includes
> hammy rules.

As I said before, we don't have any meaningful QA mechanism for this.

It's not possible to optimize for two thing simultaneously. The score-set that
optimizes the TP rate at 5.0 with an FP constraint, isn't going to be an
optimal score-set for maximizing ham learning at 0.1 with a mislearning
constraint.

In theory it is possible to do it with a single optimization if you close the
loop and allow mistraining to affect the scores at 5.0, but that means that all
the BAYES results would need to be dynamically recomputed from a fresh database
for each set of rule scores, and that's simply impractical.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

Jason Bertoch <ja...@i6ix.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jason@i6ix.com
            Version|3.3.0                       |unspecified

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #14 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to comment #13)
> (In reply to comment #12)
> 
> >  I continually keep circling back to the fact that
> > rules should score appropriately with minimal false hits.  That includes
> > hammy rules.
> 
> As I said before, we don't have any meaningful QA mechanism for this.

Barring an automated mechanism, I think someone to perform SOME analysis.  I
don't think anyone is disagreeing perhaps the scores are too highly weighted
but see some issues with modifying bayes to accommodate scores.

At the very worst, make this noautolearn change and/or the score and report
back on the impact you think it has had would be better than where we are at
now.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #12 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to comment #10)
> I'd suggest that as a general practice _any_ DNS-based rule having a
> negative score should have the "noautolearn" tflag set. It's not so much a
> matter of mistrust as a recognition that a temporary mistake by the DNS
> service could cause Bayes to go off the rails.

Thanks btw for checking the noautolearn impacts Bayes learning. I missed that.

I disagree with this.  I continually keep circling back to the fact that rules
should score appropriately with minimal false hits.  That includes hammy rules.

You are saying that negative DNS based tests should not impact bayes and I
agree that this is more of a symptom.  We should look at lowering the scores of
those rules if they are rippling that badly.

> 
> > Finally, the concept of not learning for the bayesian system based on
> > certain rules hitting/not-hitting for production systems seems to have
> > little merit to me.  
> 
> It's not so much that a DNSWL rule hit would suppress autolearning as, if
> the message is _still hammy_ when DNSWL is not considered, it should be
> autolearned.

To me this implies a lack of trust in the rule efficacy and scoring that needs
to be adjusted not the bayesian system.

> So, +1 from me on the initial suggestion, plus review of other DNS-based
> standard rules for the same change (which will be quick, I don't think many
> reduce the score). I agree with Jason, "users can easily implement the rule
> modifications in their site config" is not an appropriate response to this
> particular case.

Sorry, at best I'm 0 and I'm not going to stand in your way if you do the work,
submit the code and follow-up on it with some analysis.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #11 from AXB <ax...@gmail.com> ---
I'm not convinced this will solve what most ppl are seeing:
few rule hits = low scores and one of the rules includes DNSWL.

>From the reports, bayes alone would seldom raised the score above threshold
either, unless they're constantly feeding bayes from traps or some other
automated method. 

Imo, we're all focusing on DNSWL and RCVD_IN_RP_* but the problem is somewhere
else, and unless we see more samples of the messages which cause these  false
negatives we're pretty much guessing what could help.

I'd prefer to question the trust & scores we give DNSWL and RCVD_IN_RP_*

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

ahayes <ah...@polkaroo.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ahayes@polkaroo.net,
                   |                            |warren@togami.com

--- Comment #1 from ahayes <ah...@polkaroo.net> 2010-12-15 10:18:03 UTC ---
I have just set up SpamAssassin and have had several pieces of spam get through
with "autolearn=ham" thanks to the default Return Path whitelist rules.

I can not find anywhere to report these offending messages to Return Path and
have struggled to make myself a system for easily telling spamassassin to
forget them. I'm also struggling to get the rules disabled in my install while
also benefitting from sa-update.

So +1 from me (a new user) for this bug.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #10 from John Hardin <jh...@impsec.org> ---
(In reply to comment #8)
> First, to my understanding, the noautolearn setting in question is a
> masscheck setting.  It doesn't change production systems.

Apparently that's not true. Per the documentation:

    $score = $status->get_autolearn_points()
        Return the message's score as computed for auto-learning. Certain
        tests are ignored:

          - rules with tflags set to 'learn' (the Bayesian rules)

          - rules with tflags set to 'userconf' (user white/black-listing
rules, etc)

          - rules with tflags set to 'noautolearn'

I'd suggest that as a general practice _any_ DNS-based rule having a negative
score should have the "noautolearn" tflag set. It's not so much a matter of
mistrust as a recognition that a temporary mistake by the DNS service could
cause Bayes to go off the rails.

> Finally, the concept of not learning for the bayesian system based on
> certain rules hitting/not-hitting for production systems seems to have
> little merit to me.  

It's not so much that a DNSWL rule hit would suppress autolearning as, if the
message is _still hammy_ when DNSWL is not considered, it should be
autolearned.


So, +1 from me on the initial suggestion, plus review of other DNS-based
standard rules for the same change (which will be quick, I don't think many
reduce the score). I agree with Jason, "users can easily implement the rule
modifications in their site config" is not an appropriate response to this
particular case.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

John Hardin <jh...@impsec.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhardin@impsec.org

--- Comment #5 from John Hardin <jh...@impsec.org> ---
See also Bug 6828

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

RW <rw...@googlemail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rwmaillists@googlemail.com

--- Comment #3 from RW <rw...@googlemail.com> ---
There is an apparent case of this in the users list "Very spammy messages yield
BAYES_00". A lot of people are reporting problems with DNSWL. I think it would
be a good idea to implement this.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #7 from Jason Bertoch <ja...@i6ix.com> ---
> 
> This thread is mainly about Bayes and that DNSWL may be decreasing the socre
> somewhat, but DNSWL is not the culprit.
> 
> Users can easily implement the rule modifications in their site config.
> 
> -1 for such a change.


I've seen this argument numerous times throughout the development of SA, but
it's extremely arrogant.  It assumes that all SA users follow the dev process
from beginning to end and are also subscribed to all mailing lists.  The truth
is that this product is more far reaching than some people here seem to
respect.  Just because it may be trivial for someone on the list to implement
(or adjust) some feature, doesn't mean it's trivial everywhere SA may be
deployed.  Even though I've followed this project from the beginning, I still
think we have a duty to make sane decisions on default configs.  Just because
you may want the defaults to fit your situation, that doesn't mean those
defaults are appropriate for the project s a whole.  In fact, since you are
clearly able to modify the settings, the defaults should likely differ greatly
from your situation.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6344] ReturnPath and DNSWL rules should not autolearn

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--- Comment #4 from AXB <ax...@gmail.com> ---
(In reply to comment #3)
> There is an apparent case of this in the users list "Very spammy messages
> yield BAYES_00". A lot of people are reporting problems with DNSWL. I think
> it would be a good idea to implement this.

This thread is mainly about Bayes and that DNSWL may be decreasing the socre
somewhat, but DNSWL is not the culprit.

Users can easily implement the rule modifications in their site config.

-1 for such a change.

-- 
You are receiving this mail because:
You are the assignee for the bug.