You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by "Warren Togami Jr." <wt...@gmail.com> on 2011/04/12 10:39:11 UTC

Mailspike Performance

We haven't had working statistics viewing for a few weeks, but now it is 
fixed and I'm amazed by the performance of RCVD_IN_MSPIKE_BL.

http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_MSPIKE_BL/detail

RCVD_IN_MSPIKE_BL has nearly the highest spam detection ratio of all the 
DNSBL's, second only to RCVD_IN_XBL.  But our measurements also indicate 
it is detecting this huge amount of spam with a very good ham safety rating.

* 84% overlap with RCVD_IN_XBL - redundancy isn't a huge problem here 
because XBL is a tiny score.  But 84% is surprisingly low overlap ratio 
for such high spam detecting rule.  This confirms that Mailspike is 
doing an excellent job of building their IP reputation database in a 
truly independent fashion.
* 67% overlap with RCVD_IN_PBL - overlap with PBL is concerning because 
PBL is a high score.  But 67% isn't too bad compared to other production 
DNSBL's.
* 58% overlap with RCVD_IN_PSBL - pretty good

Given Mailspike's sustained decent performance since late 2009, it seems 
clear that it is a great candidate for addition to spamassassin-3.4 by 
default.  It would be very interesting to see what it does to the scores 
during an automatic rescoring of the network rules.

Thoughts about Future Rescoring
===============================
Before that rescoring, we may want to have a serious discussion about 
reducing score pile-up in the case where multiple production DNSBL's all 
hit at the same time.  Adam Katz' approach is one possibility, albeit 
confusing to users because users see subtractions in the score reports. 
  There may be other better approaches to this.


In related news...
==================
http://www.spamtips.org/2011/01/dnsbl-safety-report-1232011.html
The January DNSBL Safety report found RCVD_IN_SEMBLACK to be reasonably 
safe, but at the time it overlapped with RCVD_IN_PBL 91% of the time 
making it dangerously redundant due to PBL's high production score.

http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_SEMBLACK/detail
Our most recent measurements indicate that SEMBLACK is back to previous 
behavior of extremely poor safety rating, with false positives on ~7% of 
ham from recent weeks.

It was a bad idea to use SEMBLACK earlier this year due to the high 
overlap with RCVD_IN_PBL, but this significant decline in safety rating 
is a clear indication that you should not be using RCVD_IN_SEMBLACK.

http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_HOSTKARMA_BL/detail
HOSTKARMA_BL overlaps with MSPIKE_BL 88% of the time, but detects far 
fewer spam and and with slightly more FP's.  Compared to last year, 
HOSTKARMA_BL's safety rating has improved considerably on a sustained 
basis, and if we were evaluating it alone it wouldn't be too bad.  But 
now that we see the overlaps, HOSTKARMA_BL at this very moment is pretty 
close to a redundant and slightly less safe subset of RCVD_IN_MSPIKE_BL. 
  Given these measurements, it probably isn't helpful to use HOSTKARMA_BL.

Warren Togami
warren@togami.com

Re: Mailspike Performance

Posted by Justin Mason <jm...@jmason.org>.

On Thu, Apr 14, 2011 at 22:51, Adam Katz <an...@khopis.com> wrote:
> RCVD_IN_MSPIKE_BL has 99% overlap with the SA3.3 set and 98% with the
> SA3.2 set.  That leaves 0.6758% of spam uniquely hitting this DNSBL (1%
> of its 67.5822%).  RCVD_IN_SEMBLACK has the same story, resulting in
> 0.5138% unique spam from its 1% non-overlap (though note its lower s/o).

Good point. But what about the ham?  if it hits the same spam, but
less ham, it's a better rule.

--j.

Re: Mailspike Performance

Posted by Adam Katz <an...@khopis.com>.

On 04/12/2011 01:39 AM, Warren Togami Jr. wrote:
> We haven't had working statistics viewing for a few weeks, but now it
> is fixed and I'm amazed by the performance of RCVD_IN_MSPIKE_BL.
> 
> http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_MSPIKE_BL/detail
> 
> 
> RCVD_IN_MSPIKE_BL has nearly the highest spam detection ratio of all
> the DNSBL's, second only to RCVD_IN_XBL. But our measurements also
> indicate it is detecting this huge amount of spam with a very good
> ham safety rating.
> 
> * 84% overlap with RCVD_IN_XBL - redundancy isn't a huge problem
> here because XBL is a tiny score.  But 84% is surprisingly low
> overlap ratio for such high spam detecting rule.  This confirms that
> Mailspike is doing an excellent job of building their IP reputation
> database in a truly independent fashion.
> * 67% overlap with RCVD_IN_PBL - overlap with PBL is concerning
> because PBL is a high score.  But 67% isn't too bad compared to other
> production DNSBL's.
> * 58% overlap with RCVD_IN_PSBL - pretty good

I created a meta for testing new DNSBLs a short while ago and didn't say
anything about it:

meta	 PUBLISHED_DNSBLS	RCVD_IN_XBL || RCVD_IN_PBL || RCVD_IN_PSBL ||
RCVD_IN_SORBS_DUL || RCVD_IN_SORBS_WEB || RCVD_IN_BL_SPAMCOP_NET ||
RCVD_IN_RP_RNBL
tflags	 PUBLISHED_DNSBLS	net nopublish	# 20110127

meta	 PUBLISHED_DNSBLS_BRBL	PUBLISHED_DNSBLS || RCVD_IN_BRBL_LASTEXT
tflags	 PUBLISHED_DNSBLS_BRBL	net nopublish	# 20110127

RCVD_IN_MSPIKE_BL has 99% overlap with the SA3.3 set and 98% with the
SA3.2 set.  That leaves 0.6758% of spam uniquely hitting this DNSBL (1%
of its 67.5822%).  RCVD_IN_SEMBLACK has the same story, resulting in
0.5138% unique spam from its 1% non-overlap (though note its lower s/o).

I'm guessing we have enough lists that they're all around this ballpark,
though we can't prove that without adding seven more meta rules (or
merely grepping the spam.log files).

Re: Mailspike Performance

Posted by RW <rw...@googlemail.com>.

On Mon, 11 Apr 2011 22:39:11 -1000
"Warren Togami Jr." <wt...@gmail.com> wrote:

> We haven't had working statistics viewing for a few weeks, but now it
> is fixed and I'm amazed by the performance of RCVD_IN_MSPIKE_BL.
> 
> http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_MSPIKE_BL/detail
> 
> RCVD_IN_MSPIKE_BL has nearly the highest spam detection ratio of all
> the DNSBL's, second only to RCVD_IN_XBL.  But our measurements also
> indicate it is detecting this huge amount of spam with a very good
> ham safety rating.
> 
> * 84% overlap with RCVD_IN_XBL - 
> ...
> * 58% overlap with RCVD_IN_PSBL - pretty good
> 

IMO these overlaps aren't very important. Two lists could have a high
overlap because they both incorporate XBL, or they may just be
independently very good in the same niche.

What is important is the overlap in FPs. If a  list has a weak FP
correlation with other lists, it's worthy of separate scoring even
if there's 99% overlap. If a list has strongly correlated FPs
against other lists it may be better to exclude it, or incorporate or
mitigate it through metarules. 

Getting FP overlap figures is a little more difficult, but they are
much more objective.

Re: Overlapping blacklists Re: Mailspike Performance

Posted by da...@chaosreigns.com.

One possibility, which should be rearranged whenever the scores assigned
to each test get out of order.

Basically, only hit a blacklist rule if you're not hitting another,
higher scoring blacklist rule.  That should pretty much fix the problems
of blacklist overlap:  False positives from higher rates of overlap than
we're seeing in mass-checks, and generated scores for blacklist rules
reduced to avoid false positives for overlaps, resulting in unnecessarily
low scores on spams where only one of the blacklists hits.

Same for whitelists.  

There are certainly more elegant ways to implement this (that wouldn't
require rebuilding if scores change order), but they would require
modification to the scoring algorithm and rescorer.


Current scores for all RCVD_IN_* rules with a positive score (set 1,
net, no bayes), sorted by score:

RCVD_IN_PBL 3.558
RCVD_IN_PSBL 2.700
RCVD_IN_SBL 2.596
RCVD_IN_SORBS_HTTP 2.499
RCVD_IN_SORBS_SOCKS 2.443
RCVD_IN_NJABL_RELAY 1.881
RCVD_IN_BRBL_LASTEXT 1.644
RCVD_IN_NJABL_SPAM 1.466
RCVD_IN_RP_RNBL 1.284
RCVD_IN_BL_SPAMCOP_NET 1.246
RCVD_IN_CSS 1.0
RCVD_IN_XBL 0.724
RCVD_IN_SORBS_WEB 0.614
RCVD_IN_NJABL_PROXY 0.208
RCVD_IN_SORBS_DUL 0.001


Change all of those to 0.001, and add, in descending order of score:

meta RCVD_IN_WORST_PBL RCVD_IN_PBL

meta RCVD_IN_WORST_PSBL RCVD_IN_PSBL && !(RCVD_IN_PBL)

meta RCVD_IN_WORST_SBL RCVD_IN_SBL && !(RCVD_IN_PBL || RCVD_IN_PSBL)

meta RCVD_IN_WORST_SORBS_HTTP RCVD_IN_SORBS_HTTP && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL)

meta RCVD_IN_WORST_SORBS_SOCKS RCVD_IN_SORBS_SOCKS && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP)

meta RCVD_IN_WORST_NJABL_RELAY RCVD_IN_NJABL_RELAY && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORBS_SOCKS)

meta RCVD_IN_WORST_BRBL_LASTEXT RCVD_IN_BRBL_LASTEXT && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY)

meta RCVD_IN_WORST_NJABL_SPAM RCVD_IN_NJABL_SPAM && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT)

meta RCVD_IN_WORST_RP_RNBL RCVD_IN_RP_RNBL && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM)

meta RCVD_IN_WORST_BL_SPAMCOP_NET RCVD_IN_BL_SPAMCOP_NET && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL)

meta RCVD_IN_WORST_CSS RCVD_IN_CSS && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL || RCVD_IN_BL_SPAMCOP_NET)

meta RCVD_IN_WORST_XBL RCVD_IN_XBL && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL || RCVD_IN_BL_SPAMCOP_NET || RCVD_IN_CSS)

meta RCVD_IN_WORST_SORBS_WEB RCVD_IN_SORBS_WEB && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL || RCVD_IN_BL_SPAMCOP_NET || RCVD_IN_CSS || RCVD_IN_XBL)

meta RCVD_IN_WORST_NJABL_PROXY RCVD_IN_NJABL_PROXY && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL || RCVD_IN_BL_SPAMCOP_NET || RCVD_IN_CSS || RCVD_IN_XBL || RCVD_IN_SORBS_WEB)

meta RCVD_IN_WORST_SORBS_DUL RCVD_IN_SORBS_DUL && !(RCVD_IN_PBL || RCVD_IN_PSBL || RCVD_IN_SBL || RCVD_IN_SORBS_HTTP || RCVD_IN_SORB S_SOCKS || RCVD_IN_NJABL_RELAY || RCVD_IN_BRBL_LASTEXT || RCVD_IN_NJABL_SPAM || RCVD_IN_RP_RNBL || RCVD_IN_BL_SPAMCOP_NET || RCVD_IN_CSS || RCVD_IN_XBL || RCVD_IN_SORBS_WEB || RCVD_IN_NJABL_PROXY)



I would expect rescoring to give all of these higher scores.

-- 
"It is the first responsibility of every citizen to question authority."
- Benjamin Franklin
http://www.ChaosReigns.com

One meta to rule them all (Overlapping blacklists)

Posted by Adam Katz <an...@khopis.com>.

On 04/11, Warren Togami Jr. wrote:
>>>> Before that rescoring, we may want to have a serious
>>>> discussion about reducing score pile-up in the case where
>>>> multiple production DNSBL's all hit at the same time.  Adam
>>>> Katz' approach is one possibility, albeit confusing to users
>>>> because users see subtractions in the score reports.  There may
>>>> be other better approaches to this.

On 04/12/2011 12:59 PM, darxus@chaosreigns.com wrote:
>>> What was Adam Katz's approach?  Not using black or white lists 
>>> just because they overlap is unfortunate.  So is the reduction of
>>> generated scores that overlap probably causes.

I have two proposals, both of which have been mentioned here in the
past.  Warren was referring to the first:

1. One meta to rule them all

This is very simple.  All it should require is removing the 'nopublish'
flag from PUBLISHED_DNSBLS (and probably renaming it to something like
"RCVD_IN_DNSBL" or "DNSBL" to avoid confusion).

This should result in a large score for the meta and therefore reduced
scores for the individual rules.  However, as the GA isn't always
rational and could miss the overlap and create dangerous scores, we
might have to manually score the meta (and/or the lookups).

This was mentioned (not for the first time) at
http://old.nabble.com/I-want-MORE-SPAM---MORE-SPAM-tt23599323.html#a23602101

It should be noted that KHOP_DNSBL_ADJ and KHOP_DNSBL_BUMP (from my
khop-bl sa-update channel) implements this as a third-party hack.  The
former's purpose is in calculating when the score has been brought too
high and then reducing it while the latter focuses on when the score
isn't high enough.  Such a hack is very very messy and happily
completely unnecessary in upstream given a rule like PUBLISHED_DNSBLS.

Also note that this process is replicated in KHOP_URIBL_ADJ and there is
a similar trick for whitelists in KHOP_RCVD_TRUST.

Since I've kept these rules out of subversion, you'll have to view the
channel itself.  I have a copy of the relevant rule file at:

http://khopis.com/sa/khop-bl/khop-bl.cf

On 04/14/2011 06:26 AM, Greg Troxel wrote:
>> I suggest adding a metarule to combine two blacklists or two
>> whitelists, and see what the existing score-generation procedure
>> gives it.  If my idea is confused, then most such metarules might
>> have near-zero scores. If one ends up with A=2 B=4 and A_and_B
>> getting -1, that validates the concept.
>> 
>> This is sort of like KHOP_DNSBL_BUMP, but letting the GA set the
>> value.

Yes, exactly my intent.  I couldn't do that on the channel without
re-scoring upstream rules, which I really didn't want to do.

On 04/14/2011 07:58 AM, John Hardin wrote:
> I'd first verify the assumption that the score generator will
> generate negative scores. I don't know that it does not, but there
> are only 56 rules with negative scores and almost all look manually
> assigned. I suspect that automatic generation of negative scores is
> intentionally suppressed to inadvertently avoid opening up "magical
> bypass" rules for spammers.

We shouldn't need negative scores.  With the adjuster in the picture, it
should get the big score and the RCVD_IN_* dependencies will have
reduced scores.  ... BIG POTENTIAL HURDLE:  users who have tweaked the
existing rules will have a very high FP risk.  The best solution is
therefore to rename everything (yuck!).

Regarding desirable negative rules ... tflags nice is a really bad idea
since this isn't a nice rule.  KHOP_DNSBL_ADJ is (probably) a unique
type of case in which a spam rule needs a negative score.

>> Perhaps Adam can explain where those scores come from - I certainly 
>> think they are a good manual guess, but it would be interesting if
>> it's more than that.

The multipliers in KHOP_DNSBL_ADJ are generated from the scores of the
rules they modify so as to approximate the total score coming from the
rules in question.  I don't keep them in perfect sync (it doesn't matter
too much unless they have a dramatic change).  As to the score for
KHOP_DNSBL_ADJ; that came from the calculated average of the message it
was hitting (some math is present in the comments) with the aspiration
of reducing the total DNSBL score below five.

KHOP_DNSBL_BUMP is matched on a similar philosophy; if a highly
trustworthy DNSBL is hit AND the combined DNSBL score isn't already too
high, it's safe to add a few points.  Its two point score itself is from
my own judgment.

(That was long enough for one email.  My second proposal, regarding a
new breed of short-circuiting that would prevent frivolous rule checks
including DNSBLs, will be sent in its own email.)

Overlapping blacklists Re: Mailspike Performance

Posted by da...@chaosreigns.com.

On 04/11, Warren Togami Jr. wrote:
> Before that rescoring, we may want to have a serious discussion
> about reducing score pile-up in the case where multiple production
> DNSBL's all hit at the same time.  Adam Katz' approach is one
> possibility, albeit confusing to users because users see
> subtractions in the score reports.  There may be other better
> approaches to this.

What was Adam Katz's approach?  Not using black or white lists just because
they overlap is unfortunate.  So is the reduction of generated scores that
overlap probably causes.

If we had enough people participating in mass-checks, it would probably be
best to have a separate test for each possible combination of blacklists
(and whitelists).

Might be best to, say, create rule categories (blacklist, whitelist), and
if more than one rule hits from a given category, only use the one with the
largest (absolute) value?

Which would complicate the rescorer.  Or might be possible to do by
modifying the tests, but seems messy.  

-- 
"Democracy is the theory that the common people know what they want,
and deserve to get it good and hard." - H. L. Mencken
http://www.ChaosReigns.com

Re: Mailspike Performance

Posted by Alex <my...@gmail.com>.

Hi,

> http://ruleqa.spamassassin.org/20110409-r1090548-n/T_RCVD_IN_HOSTKARMA_BL/detail
> HOSTKARMA_BL overlaps with MSPIKE_BL 88% of the time, but detects far fewer
> spam and and with slightly more FP's.  Compared to last year, HOSTKARMA_BL's
> safety rating has improved considerably on a sustained basis, and if we were
> evaluating it alone it wouldn't be too bad.  But now that we see the
> overlaps, HOSTKARMA_BL at this very moment is pretty close to a redundant
> and slightly less safe subset of RCVD_IN_MSPIKE_BL.  Given these
> measurements, it probably isn't helpful to use HOSTKARMA_BL.

What is the recommended score for the MSPIKE rule(s)?

I currently have it set to 2.1. Should/can it be set higher?

Thanks for this update.

Alex

Re: Mailspike Performance

Posted by da...@chaosreigns.com.

On 04/14, John Hardin wrote:
> I'd first verify the assumption that the score generator will
> generate negative scores. I don't know that it does not, but there

It does, I've tried it with the DNSWL rules.

> If this is indeed the case, then maybe we need a tflags option to
> tell the score generator "this rule's score is allowed to go
> negative".

I suspect that's the function of "nice".

-- 
"The whole aim of practical politics is to keep the populace alarmed --
and hence clamorous to be led to safety -- by menacing it with an endless
series of hobgoblins, all of them imaginary." - H. L. Mencken
http://www.ChaosReigns.com

Re: Mailspike Performance

Posted by John Hardin <jh...@impsec.org>.

On Thu, 14 Apr 2011, Greg Troxel wrote:

> I suggest adding a metarule to combine two blacklists or two whitelists,
> and see what the existing score-generation procedure gives it.  If my
> idea is confused, then most such metarules might have near-zero scores.
> If one ends up with A=2 B=4 and A_and_B getting -1, that validates the
> concept.

I'd first verify the assumption that the score generator will generate 
negative scores. I don't know that it does not, but there are only 56 
rules with negative scores and almost all look manually assigned. I 
suspect that automatic generation of negative scores is intentionally 
suppressed to inadvertently avoid opening up "magical bypass" rules for 
spammers.

If this is indeed the case, then maybe we need a tflags option to tell the 
score generator "this rule's score is allowed to go negative".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   My sidearm is a piece of emergency equipment. It absolutely must
   be reliable, not "smart".
-----------------------------------------------------------------------
  Today: the 146th anniversary of Lincoln's assassination

Re: Mailspike Performance

Posted by Greg Troxel <gd...@ir.bbn.com>.

darxus@chaosreigns.com writes:

>> If we force "not listed in any" to zero, sort of like rules not hittinng
>> is zero score, then for 2 BLs we have 3 rules: A, B and A+B.  If A gets
>> 2 points and B 1 and they largely overlap, then it seems very likely
>> that A+B deserves 2.2ish rather than 3.  If one accepts the "score the
>
> How about giving A+B 2, the greater of the values for A and B?

The problem is that this is an artificial choice that constrains the
score for "A+B" and "A" to be the same.  If it turns out that A and B
are mostly independent, it might be that score(A+B) should be closer to
score(A) + score(B).

>> I suggest adding infrastructure to declare a set of k scoring rules as
>> non-independent, which has the effect of adding 2^k-k-1 joint-situation
>> rules that can then be assigned scores different from the sum of the
>> individual scores.  For k=3, one would need 7 rules total, and thus 4
>> more (AB, AC, BC, ABC).
>
> If we had sufficient mass-check participants, I agree that would probably
> be optimal.  But it looks like we're dealing with k=15, so you're talking
> about 32,752 more rules for 15 blacklists.  And about as many more for
> whitelists.  Exponents can be a bitch.

Agreed.

> So what do you think about adding the grouped-rule declaration, as you
> suggested, but instead of creating many more rules, when scores are being
> tallied for an email, only use the largest score hit out of any rule group?

I would suggest to use the full method, but at first to only group
whitelists/blacklists that we think are having problems due to
overlapping.  One could do score generation runs with various pairs in
groups and look at the answers.

I don't know what the results are going to be, but I suspect that seeing
the results of a half-dozen groupings would be very illuminating.

> Let those float in rescoring, the same way they're tallied, and the
> blacklist (and whitelist) tests should end up with larger scores, since
> they aren't forced to be lowered by overlap.  I bet a couple of them would
> float over 5.

I suspect they wouldn't, since any amount of FP in the strongest rule
will pull the score down.   But really I don't know.

I suggest adding a metarule to combine two blacklists or two whitelists,
and see what the existing score-generation procedure gives it.  If my
idea is confused, then most such metarules might have near-zero scores.
If one ends up with A=2 B=4 and A_and_B getting -1, that validates the
concept.

This is sort of like KHOP_DNSBL_BUMP, but letting the GA set the value.

Perhaps Adam can explain where those scores come from - I certainly
think they are a good manual guess, but it would be interesting if it's
more than that.

Re: Mailspike Performance

Posted by da...@chaosreigns.com.

On 04/12, Greg Troxel wrote:
> Do you mean rules like KHOP_DNSBL_BUMP and KHOP_DNSBL_ADJ?

I think so.

> The current score-setting algorithm seems to assume orthogonal rules, or
> rather a set of rules that test independent properties.  DNSBLs (and
> DNSWLs) are fundamentally different, because they are different entity's
> estimates of a single property.

Yep.

> If we force "not listed in any" to zero, sort of like rules not hittinng
> is zero score, then for 2 BLs we have 3 rules: A, B and A+B.  If A gets
> 2 points and B 1 and they largely overlap, then it seems very likely
> that A+B deserves 2.2ish rather than 3.  If one accepts the "score the

How about giving A+B 2, the greater of the values for A and B?

> I suggest adding infrastructure to declare a set of k scoring rules as
> non-independent, which has the effect of adding 2^k-k-1 joint-situation
> rules that can then be assigned scores different from the sum of the
> individual scores.  For k=3, one would need 7 rules total, and thus 4
> more (AB, AC, BC, ABC).

If we had sufficient mass-check participants, I agree that would probably
be optimal.  But it looks like we're dealing with k=15, so you're talking
about 32,752 more rules for 15 blacklists.  And about as many more for
whitelists.  Exponents can be a bitch.

So what do you think about adding the grouped-rule declaration, as you
suggested, but instead of creating many more rules, when scores are being
tallied for an email, only use the largest score hit out of any rule group?

Let those float in rescoring, the same way they're tallied, and the
blacklist (and whitelist) tests should end up with larger scores, since
they aren't forced to be lowered by overlap.  I bet a couple of them would
float over 5.

-- 
"Let's just say that if complete and utter chaos was lightning, then
he'd be the sort to stand on a hilltop in a thunderstorm wearing wet
copper armour and shouting 'All gods are bastards'." - The Color of Magic
http://www.ChaosReigns.com

Re: Mailspike Performance

Posted by Greg Troxel <gd...@ir.bbn.com>.

  Thoughts about Future Rescoring
  ===============================
  Before that rescoring, we may want to have a serious discussion about
  reducing score pile-up in the case where multiple production DNSBL's
  all hit at the same time.  Adam Katz' approach is one possibility,
  albeit confusing to users because users see subtractions in the score
  reports. There may be other better approaches to this.

Do you mean rules like KHOP_DNSBL_BUMP and KHOP_DNSBL_ADJ?

The current score-setting algorithm seems to assume orthogonal rules, or
rather a set of rules that test independent properties.  DNSBLs (and
DNSWLs) are fundamentally different, because they are different entity's
estimates of a single property.

Consider a world where 100K IP addresses send spam, and there are 8
DNSBLs.  Some list 80K, and some only 10K, and some list non-spammy
addresses.  Absent concerns about training on noise, one could take all
256 combinations of listed/not-listed, and treat them each as separate
situation, assigning each combination a score.  The problem with this
approach is that as you get k blacklists 2^k becomes big and the number
of messages in many bins is too small.

If we force "not listed in any" to zero, sort of like rules not hittinng
is zero score, then for 2 BLs we have 3 rules: A, B and A+B.  If A gets
2 points and B 1 and they largely overlap, then it seems very likely
that A+B deserves 2.2ish rather than 3.  If one accepts the "score the
overall situation" premise with letting all 3 scores float, then the
current method is much like forcing the 3 scores to have a particular
relationship that may not make sense.

I suggest adding infrastructure to declare a set of k scoring rules as
non-independent, which has the effect of adding 2^k-k-1 joint-situation
rules that can then be assigned scores different from the sum of the
individual scores.  For k=3, one would need 7 rules total, and thus 4
more (AB, AC, BC, ABC).

Then, when finding sets of rules that have high overlap in the corpus
with the additional property that the rules are differing evidence of
the same underlying property, we could add a grouped-rule declaration.

Arguably almost all rules are correlated.  But the real problem is that
ham coming from blacklisted IP addresses is given multiple penalties
calculated under an incorrect assumption of independence.  (The same
problem exists for spam from whitelisted IP addresses.)  So perhaps a
way to search for correlations in need of addressing, which I'd define
as the A+B score being significantly different than the sum of the A and
B scores.