You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2011/09/30 06:22:11 UTC

Re: [SA-dev] bernie-it_batt ham 61% DKIM_ADSP_ALL and other fun in the corpora

On 09/28/2011 01:29 PM, darxus@chaosreigns.com wrote:
> I wrote a script to read in scores of all the rules from 
> /var/lib/spamassassin/3.003002/updates_spamassassin_org/*.cf, then
> read in the corpora from the last mass-check.  It adds up the score
> of each of the emails, and outputs the hits for emails that scored on
> the wrong side of a threshold of 5.

% ... |grep -Eo '[A-Z_]\w{2,}' |sort |uniq |sort -n |sed '/ 1 /d'

      2 HTML_MESSAGE
      2 HTTP_ESCAPED_HOST
      2 MILLION_USD
      2 MIME_HTML_ONLY
      2 NUMERIC_HTTP_ADDR
      2 RCVD_IN_BRBL_LASTEXT
      2 URIBL_RHS_DOB
      2 URIBL_SBL
      3 LOTS_OF_MONEY
      3 URI_OBFU_WWW
      4 FRT_APPROV
      4 SPF_PASS
      4 SPOOF_COM2OTH
      4 URIBL_SC_SURBL
      4 URI_NOVOWEL
      5 DRUGS_ERECTILE
      5 DRUGS_ERECTILE_OBFU
      5 FH_HELO_EQ_D_D_D_D
      5 HELO_DYNAMIC_IPADDR2
      5 RDNS_DYNAMIC
      7 URI_HEX
      9 RP_MATCHES_RCVD
      9 URIBL_DBL_SPAM
     10 URIBL_AB_SURBL
     11 DOS_RCVD_IP_TWICE_C
     11 URIBL_JP_SURBL
     11 URIBL_WS_SURBL
     12 NORMAL_HTTP_TO_IP
     13 RCVD_IN_DNSWL_MED
     14 RDNS_NONE
     14 URIBL_BLACK
     16 DOS_RCVD_IP_TWICE_B
     16 FORGED_RELAY_MUA_TO_MX
     16 RCVD_IN_PBL
     16 DKIM_ADSP_ALL

16x isn't screamingly problematic (out of 208473 hams, it's .0077%,
though I suspect your subset of the ham corpus is smaller), though FP
reduction is always a Good Thing.


I've been sitting on a fix to HELO_DYNAMIC_IPADDR2 for a bit.  Checking
that in now.  It changes a match in last-external HELO

from
\d+[^\d\s]\d+[^\d\s]\d+[^\d\s]\d+[^\d\s][^\.]*\.\S+\.\S+

to
\d{1,3}(?:[\Wx_]\d{1,3}){3}[^\d\s][^\s.]*\.\S+\.\S+

I also added some examples of what this hits.  I can't find too many
exotics at the moment though.

One of the FPs I saw in my ham corpus included a space in the text
matching   [^\.]*  which you can see I have corrected.  Since I'm
picking on this front, note that [\Wx_] does afford a space, but it must
be followed by a digit, so since no attribute of the SA-generated
X-Spam-Relays-External pseudoheader begins with a digit, there is no
risk of it matching a space.


Also avoided ccTLDs breaking the exclusion in e.g.
foo.com.au.s3.amazonaws.com in SPOOF_COM2OTH and friends.