You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jeff Chan <je...@surbl.org> on 2004/09/05 12:22:48 UTC
Setting SpamAssassin scores for SURBL lists
Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
SpamAssassin 3.0 perceptron-generated scores as a possible guide:
> http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
>
> # The following block of scores were generated using the mass-checking
> # scripts, and a perceptron to determine the optimum scores which
> # resulted in minimum false positives or negatives. The scores are
> # weighted to produce roughly 1 false positive in 2500 non-spam messages
> # using the default threshold of 5.0.
> score URIBL_AB_SURBL 0 2.007 0 0.417
> score URIBL_OB_SURBL 0 1.996 0 3.213
> score URIBL_PH_SURBL 0 0.839 0 2.000
> score URIBL_SC_SURBL 0 3.897 0 4.263
> score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant,
to which Theo Van Dinter cited:
> $ perldoc Mail::SpamAssassin::Conf
> [...]
> If four valid scores are listed, then the score that is used
> depends on how SpamAssassin is being used. The first score is used
> when both Bayes and network tests are disabled (score set 0). The
> second score is used when Bayes is disabled, but network tests are
> enabled (score set 1). The third score is used when Bayes is
> enabled and network tests are disabled (score set 2). The fourth
> score is used when Bayes is enabled and network tests are enabled
> (score set 3).
We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to
set scores, to which I mentioned:
> We have these test results from Justin from 25 June:
>
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 121405 22516 98889 0.185 0.00 0.00 (all messages)
> 100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %)
> 13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS
> 3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC
> 2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB
> 0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH
> 12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
>
> which shows a pretty high FP rate for WS, less for the others.
> Do you happen to have access to any more recent corpus check data
> like this? Could be useful to have another snapshot for a more
> complete picture.
Which was followed up with more data and discussion:
> On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:
>> high spam + low ham is good from an FP standpoint, but having a "significant"
>> (for your definition thereof) ham hitrate means the score shouldn't be too
>> high. My handwaving scores would be something like:
[Theo's wild guess scores for Justin's June data: -- Jeff C.]
>> WS 1.2
>> SC 2.5
>> AB 3.5
>> OB 1.8
Theo then gave some of his own stats on a couple different corpora:
>> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> 416072 365031 51041 0.877 0.00 0.00 (all messages)
>> 100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
>> set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL
>> set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL
>> set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL
>> set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL
>> set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
>
>> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> 119215 67094 52121 0.563 0.00 0.00 (all messages)
>> 100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
>> set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL
>> set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL
>> set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL
>> set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL
>> set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
>
>> so for these results, I'd probably do something like:
>
>> WS 1.3
>> SC 4.0
>> AB 3.0
>> OB 2.2
>
>> since the hit rates and S/O are a bit higher for me, related to the fact I ran
>> more spam through than Justin did.
To which I added:
> Those final scores look like an excellent fit to the data to me.
and:
> Also while the PH spam hit rate [from Justin's stats] is low,
> the data is of hand checked phishing scams, which deserve to be
> blocked due to their potential danger and damage.
>
> Therefore I would tend to give PH a medium-high score like
> 3 to 5.
So we'll probably adjust the default scores on SpamCopURI
to something like:
WS 1.3
SC 4.0
AB 3.0
OB 2.2
PH 4.5
and we recommend SpamCopURI users do likewise. Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/
http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora. That FP rate needs to be reduced for WS
to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further. If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%. The other
lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular
FP rates?
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists
Posted by Jeff Chan <je...@surbl.org>.
On Sunday, September 5, 2004, 10:32:57 AM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
>> Basically the higher the FP rate, the less useful a list is.
> ... or, rather, the lower it ought to be scored.
Yes, but please remember that not everyone has the ability to
"score" their SURBL hits. Not everyone using SURBLs is using
SpamAssassin.
>> Does anyone have other corpus stats to share, in particular
>> FP rates?
Thanks for sharing your data. I know this can be a somewhat
painful subject for people, but it's very important to clean
up the false positives and make the lists better and more useful.
> Sure. All of these messages were received in the past 10 days. A lot has
> happened since June. :-)
> WS: 44004/54185s, 61/19150s
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 73335 54185 19150 0.739 0.00 0.00 (all messages)
> 100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %)
> 60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
> HOWEVER... I decided to go through the ham hits (61 of them), and look
> for false positive domains to submit.
That kind of checking should become a policy. For people who can
do that kind of checking, they should do it every time. Every
tool we have for reducing FPs should be used.
Letting FPs in just hurts the usefulness of the lists.
> I found several, but, for the most
> part, they've *already* been cleaned up and are no longer listed in WS.
> (30 out of the 61 were in a massive mailing list thread for a single
> domain that has since been whitelisted).
> And, in that 19K ham corpus, I found the following FPs still listed
> in WS:
> buckeye-express.com -- Used in a personal email address, looks legit;
> 7 examples
> nm.ru -- Used in a personal email address, looks legit
> advanstar.com -- Legit uses; found in a well-known dental
> newsletter; also personal email address of
> one of the editors; 3 messages
> 00fun.com -- Confirmed, more than one user on our system
> sent or received eCards from them
> northstarconferences.com Legit conference host site subscribed to
> by two users; 9 messages in this corpus
> mardox.com -- Search engine; registered 1875 days ago, and
> *looks* like the user did actually submit
> their site to them.
> postsnet.com -- Registered exactly one year ago, 51 NANAS,
> blank home page, ehh... but I have 4
> different legit newsletters with links to
> them.
> webspawner.com -- Created in 1996; free host/email
> npdor.com -- Surveys; been around since 1999. 103 NANAS,
> but they've been advertised by some reputable
> "word of the day" mailers (dictionary.com)
> Maybe a good candidate for UC. :-) 2
> examples
> imninc.com -- Domain is 507 days old; they do newsletters.
> At least one of them is legit. :-)
> worldhealth.net -- It's 3468 days old today (1995). One of our
> users attended a conference of theirs, and
> signed up for a newsletter.
> hoteldiscounts.com -- 2459 days old (1997), found in actual room
> booking confirmations for Comfort Inn.
Thanks. I agree those look like false positives and have
whitelisted all of them across SURBLs. Signing up for a
newsletter then forgetting about does not make a message
spam.
Instead of having these go into SURBLs, they should be checked
**before** they get added. Hopefully they would be detected
then and not get added to begin with. Wouldn't that be better?
Should hand-checking catch these as mostly legitimate?
Are we hand-checking? If not we should!
> (I'll re-post these in another thread, just so everybody sees them).
> AND, I found 2 spams that were incorrectly hand-classified as ham.
> So, if I take those out, the numbers look more like:
> WS: 44006/54187s, 0/19148s
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 73335 54187 19148 0.739 0.00 0.00 (all messages)
> 100.007 73.8897 26.1103 0.739 0.00 0.00
> 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
> Is that more like what you had in mind..? No, I'm not making that up.
> :-)
Looks good, but this corpus is perhaps too small to make
representative measurements for emails in general. That
said, any reduction in FPs is important and welcome.
> Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
> hand-check.
> - Ryan
Thanks for your stats and checking, and yes please anyone else
with ham corpora, please check for FPs.
Jeff C.
Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists
Posted by Ryan Thompson <ry...@sasknow.com>.
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
> Basically the higher the FP rate, the less useful a list is.
... or, rather, the lower it ought to be scored.
> Does anyone have other corpus stats to share, in particular
> FP rates?
Sure. All of these messages were received in the past 10 days. A lot has
happened since June. :-)
WS: 44004/54185s, 61/19150s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54185 19150 0.739 0.00 0.00 (all messages)
100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %)
60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
HOWEVER... I decided to go through the ham hits (61 of them), and look
for false positive domains to submit. I found several, but, for the most
part, they've *already* been cleaned up and are no longer listed in WS.
(30 out of the 61 were in a massive mailing list thread for a single
domain that has since been whitelisted).
And, in that 19K ham corpus, I found the following FPs still listed
in WS:
buckeye-express.com -- Used in a personal email address, looks legit;
7 examples
nm.ru -- Used in a personal email address, looks legit
advanstar.com -- Legit uses; found in a well-known dental
newsletter; also personal email address of
one of the editors; 3 messages
00fun.com -- Confirmed, more than one user on our system
sent or received eCards from them
northstarconferences.com Legit conference host site subscribed to
by two users; 9 messages in this corpus
mardox.com -- Search engine; registered 1875 days ago, and
*looks* like the user did actually submit
their site to them.
postsnet.com -- Registered exactly one year ago, 51 NANAS,
blank home page, ehh... but I have 4
different legit newsletters with links to
them.
webspawner.com -- Created in 1996; free host/email
npdor.com -- Surveys; been around since 1999. 103 NANAS,
but they've been advertised by some reputable
"word of the day" mailers (dictionary.com)
Maybe a good candidate for UC. :-) 2
examples
imninc.com -- Domain is 507 days old; they do newsletters.
At least one of them is legit. :-)
worldhealth.net -- It's 3468 days old today (1995). One of our
users attended a conference of theirs, and
signed up for a newsletter.
hoteldiscounts.com -- 2459 days old (1997), found in actual room
booking confirmations for Comfort Inn.
(I'll re-post these in another thread, just so everybody sees them).
AND, I found 2 spams that were incorrectly hand-classified as ham.
So, if I take those out, the numbers look more like:
WS: 44006/54187s, 0/19148s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54187 19148 0.739 0.00 0.00 (all messages)
100.007 73.8897 26.1103 0.739 0.00 0.00
60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up.
:-)
Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
hand-check.
- Ryan
--
Ryan Thompson <ry...@sasknow.com>
SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists
Posted by Jeff Chan <je...@surbl.org>.
On Sunday, September 5, 2004, 3:30:49 AM, Raymond Dijkxhoorn wrote:
> Seeing those data it would be very interesting if we could test a seperate
> list. Is that possible? I would like to test the Prolo and Joe's list
> combined, without the rest of the WS list. I can generate the data for a
> test like that. I have seen allmost zero FP's in the data i compose, so
> perhaps its better to seperate the lists. I think people would benefit
> from a less FP stuffed list. The current WS list is just compiled out of
> too many datasources i think.
If you can make the different lists available to me by rsync,
I can easily set up some temporary local SURBLs for testing
them. Thank you rbldnsd! :-)
Unfortunately I don't have my own test corpora, so I need to
rely on the generosity of others who do. So I'd probably
need to ask Theo, Daniel, Justin or others with corpora to
test against them.
Jeff C.