You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2011/10/10 22:28:59 UTC

Interesting rule pairs Re: New Bayes like paradigm

I wrote up a couple scripts to calculate the ratio of the percentage of
rule pair hits between false-negatives (missed spam) and correct-negatives
(correct non-spam), inspired by Marc Perkel's thread that doesn't actually
have anything to do with bayes.

The thing I found most interesting was good ADVANCE_FEE rules that aren't
mutable, with a score of 1.  Why aren't these mutable?  Looks like they
would do us more good if they were included in re-scoring.

        *  1.0 ADVANCE_FEE_3_NEW Appears to be advance fee fraud (Nigerian 419)
        *  1.0 ADVANCE_FEE_4_NEW_MONEY Advance Fee fraud and lots of money
        *  1.0 ADVANCE_FEE_5_NEW_MONEY Advance Fee fraud and lots of money
        *  1.0 ADVANCE_FEE_3_NEW_MONEY Advance Fee fraud and lots of money
        *  1.0 ADVANCE_FEE_2_NEW_MONEY Advance Fee fraud and lots of money
        *  1.0 MONEY_FRAUD_5 Lots of money and many fraud phrases
        *  1.0 MONEY_FRAUD_3 Lots of money and several fraud phrases

Doing a not nice rule of (RCVD_IN_DNSWL_HI && SPF_FAIL) might be fun, or
putting !SPF_FAIL in the DNSWL rules.  Ick... *every* hit for that is in
the dos corpora, so probably not good to add.  (Daryl, what did you do?)

I used masscheck net corpora marked "Date: 20110924T124459Z",  
excluding zmi due his 98% hit rate on ALL_TRUSTED in his *spam*.  He says
he doesn't think this is a misconfiguration.  And I didn't filter for
recent emails, as score generation does (I should have).  In this run I
excluded __* rules.


Taking the first line as an example, ADVANCE_FEE_3_NEW together with
LOTS_OF_MONEY hit 6% of false negatives (missed spam), and 0% of correct
non-spam.  6 divided by 0.002 = 3584, the first column.

So you might say creating a rule to combine them would be good, except
that has already exactly been done, as ADVANCE_FEE_3_NEW_MONEY, which has a
score of 1, which I'm pretty sure should be increased.

3548.91519434629 ADVANCE_FEE_3_NEW LOTS_OF_MONEY (wrong 6.00706713780919% right 0.00169264882614804%)
3548.91519434629 ADVANCE_FEE_3_NEW ADVANCE_FEE_3_NEW_MONEY (wrong 6.00706713780919% right 0.00169264882614804%)
2713.87632508834 RCVD_IN_DNSWL_HI SPF_FAIL (wrong 4.59363957597173% right 0.00169264882614804%)
2713.87632508834 DKIM_VALID_AU HTML_MIME_NO_HTML_TAG (wrong 4.59363957597173% right 0.00169264882614804%)
2087.59717314488 ADVANCE_FEE_2_NEW_MONEY HTML_MESSAGE (wrong 3.53356890459364% right 0.00169264882614804%)
1670.0777385159 DOS_RCVD_IP_TWICE_B RDNS_NONE (wrong 2.82685512367491% right 0.00169264882614804%)
1670.0777385159 DOS_RCVD_IP_TWICE_B HTML_MESSAGE (wrong 2.82685512367491% right 0.00169264882614804%)
1356.93816254417 HTML_MIME_NO_HTML_TAG RCVD_IN_DNSWL_HI (wrong 4.59363957597173% right 0.00338529765229608%)
1356.93816254417 FREEMAIL_FROM MIME_HTML_ONLY (wrong 4.59363957597173% right 0.00338529765229608%)
1356.93816254417 DKIM_VALID HTML_MIME_NO_HTML_TAG (wrong 4.59363957597173% right 0.00338529765229608%)
1356.93816254417 DKIM_SIGNED HTML_MIME_NO_HTML_TAG (wrong 4.59363957597173% right 0.00338529765229608%)
1252.55830388693 ADVANCE_FEE_4_NEW RP_MATCHES_RCVD (wrong 2.12014134275618% right 0.00169264882614804%)
1252.55830388693 ADVANCE_FEE_4_NEW RCVD_IN_DNSWL_NONE (wrong 2.12014134275618% right 0.00169264882614804%)
1182.97173144876 ADVANCE_FEE_4_NEW LOTS_OF_MONEY (wrong 6.00706713780919% right 0.00507794647844412%)
1182.97173144876 ADVANCE_FEE_3_NEW_MONEY LOTS_OF_MONEY (wrong 6.00706713780919% right 0.00507794647844412%)
1043.79858657244 FREEMAIL_FROM HTML_FONT_SIZE_HUGE (wrong 1.76678445229682% right 0.00169264882614804%)
939.418727915194 HTML_FONT_LOW_CONTRAST RCVD_IN_DNSWL_MED (wrong 3.18021201413428% right 0.00338529765229608%)
904.625441696113 FREEMAIL_FROM SPF_FAIL (wrong 4.59363957597173% right 0.00507794647844412%)
835.038869257951 RCVD_IN_DNSWL_NONE TVD_SPACE_RATIO (wrong 1.41342756183746% right 0.00169264882614804%)
709.783038869258 ADVANCE_FEE_2_NEW_MONEY LOTS_OF_MONEY (wrong 6.00706713780919% right 0.00846324413074019%)
695.865724381625 ADVANCE_FEE_4_NEW HTML_MESSAGE (wrong 3.53356890459364% right 0.00507794647844412%)
626.279151943463 FREEMAIL_REPLYTO RP_MATCHES_RCVD (wrong 1.06007067137809% right 0.00169264882614804%)
626.279151943463 FREEMAIL_REPLYTO RCVD_IN_DNSWL_NONE (wrong 1.06007067137809% right 0.00169264882614804%)
626.279151943463 ADVANCE_FEE_3_NEW_MONEY RP_MATCHES_RCVD (wrong 2.12014134275618% right 0.00338529765229608%)
626.279151943463 ADVANCE_FEE_2_NEW_MONEY RCVD_IN_DNSWL_MED (wrong 1.06007067137809% right 0.00169264882614804%)
584.527208480565 HTML_MIME_NO_HTML_TAG MIME_HTML_ONLY (wrong 4.9469964664311% right 0.00846324413074019%)
584.527208480565 HTML_MESSAGE HTML_MIME_NO_HTML_TAG (wrong 4.9469964664311% right 0.00846324413074019%)
521.899293286219 HTML_MESSAGE RCVD_IN_XBL (wrong 1.76678445229682% right 0.00338529765229608%)
417.519434628975 TVD_SPACE_RATIO UNPARSEABLE_RELAY (wrong 1.41342756183746% right 0.00338529765229608%)
417.519434628975 TVD_RCVD_SPACE_BRACKET TVD_SPACE_RATIO (wrong 1.41342756183746% right 0.00338529765229608%)
417.519434628975 MISSING_MID TO_NO_BRKTS_HTML_ONLY (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 MIME_HTML_ONLY TO_NO_BRKTS_HTML_ONLY (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 MIME_BASE64_BLANKS SPF_HELO_PASS (wrong 1.41342756183746% right 0.00338529765229608%)
417.519434628975 HTML_MESSAGE UPPERCASE_50_75 (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 HTML_MESSAGE TO_NO_BRKTS_HTML_ONLY (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 HTML_COMMENT_SAVED_URL RP_MATCHES_RCVD (wrong 1.41342756183746% right 0.00338529765229608%)
417.519434628975 HTML_COMMENT_SAVED_URL HTML_MESSAGE (wrong 1.41342756183746% right 0.00338529765229608%)
417.519434628975 FSL_UA FSL_XM_419 (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 FREEMAIL_REPLYTO HTML_MESSAGE (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 DKIM_VALID FREEMAIL_REPLYTO (wrong 0.706713780918728% right 0.00169264882614804%)
417.519434628975 DKIM_SIGNED FREEMAIL_REPLYTO (wrong 0.706713780918728% right 0.00169264882614804%)
382.726148409894 FREEMAIL_FROM SPF_SOFTFAIL (wrong 3.886925795053% right 0.0101558929568882%)
365.329505300353 HTML_MESSAGE MIME_BASE64_BLANKS (wrong 2.47349823321555% right 0.00677059530459216%)
347.932862190813 ADVANCE_FEE_4_NEW DKIM_SIGNED (wrong 1.76678445229682% right 0.00507794647844412%)
313.139575971731 MIME_HTML_ONLY MISSING_MID (wrong 1.06007067137809% right 0.00338529765229608%)
313.139575971731 ADVANCE_FEE_2_NEW_MONEY RCVD_IN_DNSWL_NONE (wrong 2.12014134275618% right 0.00677059530459216%)
303.650497911982 HTML_MESSAGE SPF_FAIL (wrong 5.65371024734982% right 0.0186191370876284%)
284.672341792483 DKIM_VALID SPF_FAIL (wrong 5.30035335689046% right 0.0186191370876284%)
284.672341792483 DKIM_VALID_AU SPF_FAIL (wrong 5.30035335689046% right 0.0186191370876284%)
278.34628975265 HTML_MESSAGE TO_NO_BRKTS_PCNT (wrong 1.41342756183746% right 0.00507794647844412%)
250.511660777385 ADVANCE_FEE_2_NEW_MONEY RP_MATCHES_RCVD (wrong 2.12014134275618% right 0.00846324413074019%)
208.759717314488 SPF_HELO_PASS SUBJ_ILLEGAL_CHARS (wrong 0.706713780918728% right 0.00338529765229608%)
208.759717314488 RP_MATCHES_RCVD USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RP_MATCHES_RCVD SPF_SOFTFAIL (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RDNS_NONE SUBJ_YOUR_DEBT (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RDNS_NONE SPF_SOFTFAIL (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RCVD_IN_DNSWL_NONE USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RCVD_IN_DNSWL_MED WEIRD_QUOTING (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 RCVD_IN_DNSWL_HI TO_NO_BRKTS_HTML_ONLY (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 NML_ADSP_CUSTOM_MED SPF_PASS (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_MID USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_MID RCVD_IN_DNSWL_HI (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_HEADERS SPF_HELO_PASS (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_DATE SPF_PASS (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_DATE RP_MATCHES_RCVD (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_DATE RCVD_IN_RP_SAFE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MISSING_DATE RCVD_IN_DNSWL_LOW (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_QP_LONG_LINE SPF_NEUTRAL (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_QP_LONG_LINE MISSING_DATE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_HTML_ONLY USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_HTML_ONLY PYZOR_CHECK (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_HTML_ONLY MISSING_DATE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_BASE64_TEXT USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_BASE64_TEXT RCVD_IN_DNSWL_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 MIME_BASE64_TEXT MISSING_MID (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 LOTS_OF_MONEY RCVD_IN_XBL (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 LOTS_OF_MONEY MISSING_MID (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 INVALID_MSGID RDNS_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 INVALID_MSGID RCVD_IN_RP_SAFE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 INVALID_MSGID RCVD_IN_RP_CERTIFIED (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 INVALID_DATE RP_MATCHES_RCVD (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_OBFUSCATE_05_10 RCVD_IN_DNSWL_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_MESSAGE USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_MESSAGE URIBL_SBL (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_MESSAGE MISSING_DATE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_IMAGE_ONLY_20 RCVD_IN_DNSWL_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_IMAGE_ONLY_16 RCVD_IN_DNSWL_HI (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HTML_FONT_SIZE_HUGE LOTS_OF_MONEY (wrong 0.706713780918728% right 0.00338529765229608%)
208.759717314488 HTML_FONT_LOW_CONTRAST SPF_PASS (wrong 1.41342756183746% right 0.00677059530459216%)
208.759717314488 HTML_FONT_LOW_CONTRAST SPF_HELO_PASS (wrong 0.706713780918728% right 0.00338529765229608%)
208.759717314488 HK_NAME_FM_DR RP_MATCHES_RCVD (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 HK_NAME_FM_DR HTML_MESSAGE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_XM_419 RDNS_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_XM_419 NSL_RCVD_FROM_USER (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_UA RDNS_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_UA NSL_RCVD_FROM_USER (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_CTYPE_WIN1251 NSL_RCVD_FROM_USER (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_CTYPE_WIN1251 FSL_XM_419 (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FSL_CTYPE_WIN1251 FSL_UA (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FROM_MISSP_DKIM RDNS_NONE (wrong 0.353356890459364% right 0.00169264882614804%)
208.759717314488 FROM_EXCESS_BASE64 USER_IN_DEF_WHITELIST (wrong 0.353356890459364% right 0.00169264882614804%)


-- 
"Wash daily from nose-tip to tail-tip; drink deeply, but never too deep;
And remember the night is for hunting, and forget not the day is for sleep."
- The Law of the Jungle, Rudyard Kipling
http://www.ChaosReigns.com

Re: Interesting rule pairs Re: New Bayes like paradigm

Posted by Axb <ax...@gmail.com>.
On 2011-10-10 22:28, darxus@chaosreigns.com wrote:
> The thing I found most interesting was good ADVANCE_FEE rules that aren't
> mutable, with a score of 1.  Why aren't these mutable?  Looks like they
> would do us more good if they were included in re-scoring.
>
>          *  1.0 ADVANCE_FEE_3_NEW Appears to be advance fee fraud (Nigerian 419)
>          *  1.0 ADVANCE_FEE_4_NEW_MONEY Advance Fee fraud and lots of money
>          *  1.0 ADVANCE_FEE_5_NEW_MONEY Advance Fee fraud and lots of money
>          *  1.0 ADVANCE_FEE_3_NEW_MONEY Advance Fee fraud and lots of money
>          *  1.0 ADVANCE_FEE_2_NEW_MONEY Advance Fee fraud and lots of money
>          *  1.0 MONEY_FRAUD_5 Lots of money and many fraud phrases
>          *  1.0 MONEY_FRAUD_3 Lots of money and several fraud phrases
>

This is a bug in trunk's sa-update score procedure and has been reported 
to DOS.
The scores aren't getting added to the relevant scores file so they get 
the default 1.0


Re: Interesting rule pairs Re: New Bayes like paradigm

Posted by John Hardin <jh...@impsec.org>.
On Mon, 10 Oct 2011, darxus@chaosreigns.com wrote:

> The thing I found most interesting was good ADVANCE_FEE rules that aren't
> mutable, with a score of 1.  Why aren't these mutable?  Looks like they
> would do us more good if they were included in re-scoring.

What do you mean by "not mutable"? They are subject to nightly masscheck 
and their scores do change over time.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
    "A well educated Electorate, being necessary to the liberty of a
     free State, the Right of the People to Keep and Read Books,
     shall not be infringed."
   ...means only registered voters can read books, and only those books
   obtained with State permission from State-controlled bookstores?
-----------------------------------------------------------------------
  305 days since the first successful private orbital launch (SpaceX)

Re: [SA-dev] Interesting rule pairs Re: New Bayes like paradigm

Posted by Adam Katz <an...@khopis.com>.
On 10/10/2011 01:28 PM, darxus@chaosreigns.com wrote:
> Doing a not nice rule of (RCVD_IN_DNSWL_HI && SPF_FAIL) might be fun,
> or putting !SPF_FAIL in the DNSWL rules.  Ick... *every* hit for that
> is in the dos corpora, so probably not good to add.  (Daryl, what did
> you do?)

khop-bl has a version of this.  Basically, it downplays white-DNSBL hits
that fail to match __NOT_SPOOFED (which is to say, it lacks all of: some
sort of SPF pass, verified DKIM, last-external relay authentication, or
else all relays are trusted).  It boosts the negative score of
white-DNSBL hits that also hit __NOT_SPOOFED (unless there are already
lots of negative points from white-DNSBLs).  This prevents overlap
issues from combining e.g. DNSWL and HOSTKARMA-White.


Re: Interesting rule pairs Re: New Bayes like paradigm

Posted by da...@chaosreigns.com.
68% of ham-net-dos.log hits __UNUSABLE_MSGID (2nd highest is 10%, wt-en3):
http://ruleqa.spamassassin.org/20111008-r1180336-n/__UNUSABLE_MSGID/detail

It would be nice to have a way to detect corpora that are an outlier on
rules like this.

-- 
"theres a lot more to life than chicks
none of it matters but theres a lot of it"
- LeRoy, #motorcycles, #EFNet, 7/18/06
http://www.ChaosReigns.com