You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/11/05 00:52:27 UTC

[Bug 6155] generate new scores for 3.3.0 release

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz <an...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #4561|0                           |1
        is obsolete|                            |

--- Comment #145 from Adam Katz <an...@khopis.com> 2009-11-04 15:52:15 UTC ---
Created an attachment (id=4564)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4564)
Checker for rules that match more ham than spam

Updated my checker to use S/O (now that I understand that stat).  It also
supports specifying the DateRev for the specific masscheck run.  Since today's
run was sparse, here are yesterday's results.

$ ./sa33badrules.pl 20091103-r832343-n
 S/O RANK HAM%    SPAM%   Score in attachment 4558 Rule
.008 .12  1.2401  0.0105  0.001                    MSGID_MULTIPLE_AT
.011 .22  0.3066  0.0035  0                        OBSCURED_EMAIL
.012 .25  0.2058  0.0025  0.000 2.099 0.001 1.212  MISSING_MIME_HB_SEP
.014 .17  0.5822  0.0080  0.001 0.001 0.699 0.699  TVD_RCVD_SPACE_BRACKET
.028 .20  0.4339  0.0125  unknown                  TVD_FUZZY_SECTOR
.042 .28  0.1732  0.0075  0                        SUBJECT_FUZZY_TION
.048 .77  4.4862  0.2279  -0.001                   SPF_HELO_PASS
.052 .29  0.1476  0.0080  1.494 1.699 1.591 1.516  X_IP
.055 .22  0.3914  0.0226  2.205 0.174 1.299 1.806  FRT_SOMA2
.062 .74  5.1484  0.3424  -0.001                   SPF_PASS
.077 .25  0.2643  0.0221  0.987 0.750 0.943 1.318  CTYPE_001C_B
.079 .36  0.0640  0.0055  0.001 0.001 0.605 0.378  HTML_NONELEMENT_30_40
.080 .28  0.1742  0.0151  0.001 2.499 0.268 0.516  DRUGS_MUSCLE
.084 .36  0.0660  0.0060  0                        FORGED_IMS_TAGS
.090 .32  0.1114  0.0110  0.033 0.001 0.365 0.413  WEIRD_PORT
.092 .21  0.8712  0.0878  1.499 0.419 0.904 0.798  MIME_BASE64_BLANKS
.102 .37  0.0577  0.0065  0                        HTML_IFRAME_SRC
.123 .34  0.0821  0.0115  0.003 0.978 0.100 1.515  TVD_FW_GRAPHIC_NAME_LONG
.128 .37  0.0614  0.0090  0                        RCVD_BAD_ID
.130 .29  0.1851  0.0276  0.001 0.020 0.001 1.799  MIME_BASE64_TEXT
.178 .28  0.4948  0.1069  0 1.200 0 2.514          SPF_HELO_FAIL
.202 .32  0.1590  0.0402  0.1                      ANY_BOUNCE_MESSAGE
.205 .35  0.0817  0.0211  2.199 1.622 2.199 1.086  LONGWORDS
.213 .34  0.1186  0.0321  0                        BLANK_LINES_80_90
.216 .32  0.1474  0.0407  2.199 2.199 1.246 2.090  WEIRD_QUOTING
.218 .32  0.1445  0.0402  0.1                      BOUNCE_MESSAGE
.223 .30  0.7605  0.2179  1.799 0.572 1.182 1.138  HTML_IMAGE_RATIO_06
.241 .34  1.3973  0.4438  1.0                      EXTRA_MPART_TYPE
.254 .34  0.1222  0.0417  0.001 2.185 1.936 0.476  FRT_SOMA
.283 .33  0.6883  0.2711  0.539 0.001 0.332 0.488  MIME_HTML_MOSTLY
.299 .36  0.0908  0.0387  0.799 0.001 0.711 0.026  TVD_FW_GRAPHIC_NAME_MID
.303 .34  0.4938  0.2143  1.899 0.496 0.950 0.445  HTML_IMAGE_RATIO_08
.367 .40  1.2775  0.7409  0.001                    TVD_SPACE_RATIO
.379 .37  0.3182  0.1943  0.023 0.887 0.000 0.417  UPPERCASE_50_75
.434 .39  0.3261  0.2505  3.099 1.823 1.802 1.998  BAD_ENC_HEADER
.436 .46 15.3798 11.8920  0.001                    FREEMAIL_FROM
.454 .41  0.5503  0.4573  2.260 0.742 1.199 0.640  MPART_ALT_DIFF
.516 .47  3.6581  3.9024  0.001                    MIME_QP_LONG_LINE
.655 .51  1.9537  3.7036  1.154 1.677 1.198 1.453  SUBJ_ALL_CAPS
.665 .49 42.2269 83.7383  0.001                    HTML_MESSAGE
.692 .52  1.1850  2.6580  0.001                    UNPARSEABLE_RELAY
.922 .58  1.1584 13.7423  0 1.322 0 1.237          RCVD_IN_BL_SPAMCOP_NET
.935 .57  3.5421 50.6034  2.199 0.955 1.215 0.549  MIME_HTML_ONLY
.970 .52  1.5729 51.1430  0 1.1 0 0.7              RDNS_NONE

Note, I hacked RDNS_NONE so that it removes the Enron hits.

"Problem" rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and
BAD_ENC_HEADER (scored 3.099?!).

Food for thought:  while it's good to create workarounds for the problematic
outcomes from the genetic algorithm, I think that these should be examples with
which to troubleshoot the algorithm itself while this might just be an early
sign of over-fitting (which is largely fine as long as we comb through the
results with scripts like this), it might also be indicative of a problem in
the system's prioritization.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.