You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alex <my...@gmail.com> on 2010/04/17 21:30:51 UTC

Re: cleanup for DNSBLs

Hi Adam,

Some time ago you posted that you were investigating the stats and
effectiveness of a few rules in your masschecks sandbox, and thought I
would see if you had made any progress, and found anything helpful?

Posted below...

Thanks,
Alex

On Mon, Nov 23, 2009 at 8:34 PM, Adam Katz <an...@khopis.com> wrote:
> Unless there are objections, I'm going to add two tests to my sandbox:
>
> RCVD_IN_NIX_SPAM, a new (to us) DNSBL populated by the same source as
> the original [N]iXhash zone, with results on intra2net that look quite
> promising:  72.98:0.12 spam:ham (PSBL has 48.69:0.36),
> http://www.intra2net.com/en/support/antispam/blacklist.php_dnsbl=RCVD_IN_NIX_SPAM.html
>
> RCVD_IN_SPAMCOP, a fix-up of SpamCop to limit it to the last external
> relay (just like every other DNSBL used by SpamAssassin).
>
> While digging around there, I noticed that SpamCop and ham rule
> RCVD_IN_BSP_TRUSTED are the only rules to use check_rbl_txt(), which
> affords it a nicer explanation of what triggered the spam.  For a
> fully apples-to-apples comparison, my fix-up reverts back to plain-old
> check_rbl() ... which unfortunately means a second DNS lookup (since
> we're looking for an A record rather than a TXT record).
>
> Both will be marked "nopublish" until we have stats to motivate us.
>
>
> check_rbl_txt() gives quite informative data, and it's supported by
> every DNSBL I've tried (all below).  RCVD_IN_NIX_SPAM supports it
> (though my test will avoid it until we can determine there isn't a bug
> in lookups here), as do BRBL and others.  Assuming a lack of bugs or
> efficiency, we should probably use it for any index that doesn't
> contain multiple indices (like zen).
>
> Examples:
>
> $ host -t txt 11.70.132.91.ix.dnsbl.manitu.net.
> 11.70.132.91.ix.dnsbl.manitu.net descriptive text "Spam sent to the
> mailhost mx.selfip.biz was detected by NiX Spam at Mon, 23 Nov 2009
> 23:31:24 +0100, see
> http://www.dnsbl.manitu.net/lookup.php?value=91.132.70.11"
> $ host -t txt 11.70.132.91.bb.barracudacentral.org
> 11.70.132.91.bb.barracudacentral.org descriptive text
> "http://www.barracudanetworks.com/reputation/?pr=1&ip=91.132.70.11"
> $ host -t txt 11.70.132.91.bl.spamcop.net.    Mon 23 19:24:48
> 11.70.132.91.bl.spamcop.net descriptive text "Blocked - see
> http://www.spamcop.net/bl.shtml?91.132.70.11"
> $ host -t txt 11.70.132.91.psbl.surriel.com.     [1] 19:32:04
> 11.70.132.91.psbl.surriel.com descriptive text "Listed in PSBL, see
> http://psbl.surriel.com/listing?ip=91.132.70.11"
> $ host -t txt 11.70.132.91.bl.spameatingmonkey.net.
> 11.70.132.91.bl.spameatingmonkey.net descriptive text "listed, see
> http://spameatingmonkey.com/lookup/91.132.70.11"
>
> (If you're wondering, that IP is listed as the #1 offender by spamcop,
> so it hits all of them.  127.0.0.2 gives inaccurate responses since it
> is a test and often is called that.)
>

Re: cleanup for DNSBLs

Posted by Alex <my...@gmail.com>.
Hi Adam,

>> Some time ago you posted that you were investigating the stats and
>> effectiveness of a few rules in your masschecks sandbox, and thought
>> I would see if you had made any progress, and found anything
>> helpful?
>
> Yeah, analysis (and writing it up) is time-consuming and I was putting
> it off.  Here it is.

Thanks for the info. Hope to see further analysis of your efforts in the future.

Best,
Alex

Re: cleanup for DNSBLs

Posted by Adam Katz <an...@khopis.com>.
On 04/17/2010 03:30 PM, Alex wrote:
> Some time ago you posted that you were investigating the stats and 
> effectiveness of a few rules in your masschecks sandbox, and thought
> I would see if you had made any progress, and found anything
> helpful?

Yeah, analysis (and writing it up) is time-consuming and I was putting
it off.  Here it is.

> On Mon, Nov 23, 2009 at 8:34 PM, Adam Katz <an...@khopis.com> wrote:
>> Unless there are objections, I'm going to add two tests to my sandbox:
>>
>> RCVD_IN_NIX_SPAM, a new (to us) DNSBL populated by the same source as
>> the original [N]iXhash zone, with results on intra2net that look quite
>> promising:  72.98:0.12 spam:ham (PSBL has 48.69:0.36),
>> http://www.intra2net.com/ [...]

 DateRev    SPAM%     HAM%     S/O   RANK   NAME
20091219   6.0855   0.0158   0.997   0.91   T_RCVD_IN_NIX_SPAM
20091226   6.6822   0.0171   0.997   0.91   T_RCVD_IN_NIX_SPAM
20100116   8.8194   0.0079   0.999   0.93   T_RCVD_IN_NIX_SPAM
20100123   9.6367   0.0060   0.999   0.94   T_RCVD_IN_NIX_SPAM

Here are all the results ruleqa was willing to yield.  I've removed the
cases where there weren't about a million spams as the data for most
rules is non-representative.  After January, ruleqa stopped evaluating
the rule (and RCVD_IN_SPAMCOP) altogether, so I'm not confident in the
results as they never leveled out.

Based on that performance, NiX performs quite well, but not at a level
to justify including in SA proper as it just creates too much DNS traffic.

Jari Fredricksson's recent Top "Ten Rules" post to the list has
RCVD_IN_NIX_SPAM ranked 11th (he posted 20 rules, "Ten" was in the
thread name) with 72.29% spam versus 16% ham at 0.998 S/O (total
ham+spam corpus = 20293).  Jari is in NE Europe, like this DNSBL's
spamtrap fodder.  My company gets over 17.6% spam on Nix as well.

>> RCVD_IN_SPAMCOP, a fix-up of SpamCop to limit it to the last
>> external relay (just like every other DNSBL used by SpamAssassin).

This again only found four useful trials.  The results show that SpamCop
is indeed a well-maintained DNSBL with a very low FP rate, but it
doesn't have the sheer volume of the others.

 DateRev    SPAM%     HAM%     S/O   RANK   NAME
20091219  11.9204   0.0390   0.997   0.89   T_RCVD_IN_SPAMCOP
20091226  10.4777   0.0367   0.997   0.88   T_RCVD_IN_SPAMCOP
20100116  12.2375   0.0953   0.992   0.81   T_RCVD_IN_SPAMCOP
20100123  13.7493   0.0324   0.998   0.90   T_RCVD_IN_SPAMCOP

Compared to the full parsing of headers:

 DateRev    SPAM%     HAM%     S/O   RANK   NAME
20091219  57.4236   1.8637   0.969   0.62   RCVD_IN_BL_SPAMCOP_NET
20091226  57.1671   1.7706   0.970   0.62   RCVD_IN_BL_SPAMCOP_NET
20100116  58.6552   1.7156   0.972   0.62   RCVD_IN_BL_SPAMCOP_NET
20100123  59.0184   1.6012   0.974   0.62   RCVD_IN_BL_SPAMCOP_NET

... it would be a shame to strike spamcop, but it doesn't really seem
like much of a player (because it doesn't use spamtraps).  In fact, it's
lack of spamtraps suggests keeping it because it's capable of listing
spammers that successfully avoid spamtraps.  Maybe I'll open a bug to
use the lastexternal version instead of the current one.

>> While digging around there, I noticed that SpamCop and ham rule 
>> RCVD_IN_BSP_TRUSTED are the only rules to use check_rbl_txt(),
>> which affords it a nicer explanation of what triggered the spam.
>> For a fully apples-to-apples comparison, my fix-up reverts back to
>> plain-old check_rbl() ... which unfortunately means a second DNS
>> lookup (since we're looking for an A record rather than a TXT
>> record).
>> 
>> Both will be marked "nopublish" until we have stats to motivate
>> us.
>> 
>> check_rbl_txt() gives quite informative data, and it's supported
>> by every DNSBL I've tried (all below).  RCVD_IN_NIX_SPAM supports
>> it (though my test will avoid it until we can determine there isn't
>> a bug in lookups here), as do BRBL and others.  Assuming a lack of
>> bugs or efficiency, we should probably use it for any index that
>> doesn't contain multiple indices (like zen).

I have no news on this front.  That was more meant to be a question to
the other developers.  I suppose the TXT data is more verbose and
therefore eats more bandwidth, so therefore SA doesn't use it?