You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2011/09/30 06:22:11 UTC
Re: [SA-dev] bernie-it_batt ham 61% DKIM_ADSP_ALL and other fun in
the corpora
On 09/28/2011 01:29 PM, darxus@chaosreigns.com wrote:
> I wrote a script to read in scores of all the rules from
> /var/lib/spamassassin/3.003002/updates_spamassassin_org/*.cf, then
> read in the corpora from the last mass-check. It adds up the score
> of each of the emails, and outputs the hits for emails that scored on
> the wrong side of a threshold of 5.
% ... |grep -Eo '[A-Z_]\w{2,}' |sort |uniq |sort -n |sed '/ 1 /d'
2 HTML_MESSAGE
2 HTTP_ESCAPED_HOST
2 MILLION_USD
2 MIME_HTML_ONLY
2 NUMERIC_HTTP_ADDR
2 RCVD_IN_BRBL_LASTEXT
2 URIBL_RHS_DOB
2 URIBL_SBL
3 LOTS_OF_MONEY
3 URI_OBFU_WWW
4 FRT_APPROV
4 SPF_PASS
4 SPOOF_COM2OTH
4 URIBL_SC_SURBL
4 URI_NOVOWEL
5 DRUGS_ERECTILE
5 DRUGS_ERECTILE_OBFU
5 FH_HELO_EQ_D_D_D_D
5 HELO_DYNAMIC_IPADDR2
5 RDNS_DYNAMIC
7 URI_HEX
9 RP_MATCHES_RCVD
9 URIBL_DBL_SPAM
10 URIBL_AB_SURBL
11 DOS_RCVD_IP_TWICE_C
11 URIBL_JP_SURBL
11 URIBL_WS_SURBL
12 NORMAL_HTTP_TO_IP
13 RCVD_IN_DNSWL_MED
14 RDNS_NONE
14 URIBL_BLACK
16 DOS_RCVD_IP_TWICE_B
16 FORGED_RELAY_MUA_TO_MX
16 RCVD_IN_PBL
16 DKIM_ADSP_ALL
16x isn't screamingly problematic (out of 208473 hams, it's .0077%,
though I suspect your subset of the ham corpus is smaller), though FP
reduction is always a Good Thing.
I've been sitting on a fix to HELO_DYNAMIC_IPADDR2 for a bit. Checking
that in now. It changes a match in last-external HELO
from
\d+[^\d\s]\d+[^\d\s]\d+[^\d\s]\d+[^\d\s][^\.]*\.\S+\.\S+
to
\d{1,3}(?:[\Wx_]\d{1,3}){3}[^\d\s][^\s.]*\.\S+\.\S+
I also added some examples of what this hits. I can't find too many
exotics at the moment though.
One of the FPs I saw in my ham corpus included a space in the text
matching [^\.]* which you can see I have corrected. Since I'm
picking on this front, note that [\Wx_] does afford a space, but it must
be followed by a digit, so since no attribute of the SA-generated
X-Spam-Relays-External pseudoheader begins with a digit, there is no
risk of it matching a space.
Also avoided ccTLDs breaking the exclusion in e.g.
foo.com.au.s3.amazonaws.com in SPOOF_COM2OTH and friends.