You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jeff Chan <je...@surbl.org> on 2004/09/05 12:22:48 UTC

Setting SpamAssassin scores for SURBL lists

Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
SpamAssassin 3.0 perceptron-generated scores as a possible guide:

>   http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
> 
> # The following block of scores were generated using the mass-checking
> # scripts, and a perceptron to determine the optimum scores which
> # resulted in minimum false positives or negatives.  The scores are
> # weighted to produce roughly 1 false positive in 2500 non-spam messages
> # using the default threshold of 5.0.

> score URIBL_AB_SURBL 0 2.007 0 0.417
> score URIBL_OB_SURBL 0 1.996 0 3.213
> score URIBL_PH_SURBL 0 0.839 0 2.000
> score URIBL_SC_SURBL 0 3.897 0 4.263
> score URIBL_WS_SURBL 0 0.539 0 1.462

I was trying to figure out what the different score columns meant,
to which Theo Van Dinter cited:

> $ perldoc Mail::SpamAssassin::Conf
> [...]
>    If four valid scores are listed, then the score that is used
>    depends on how SpamAssassin is being used. The first score is used
>    when both Bayes and network tests are disabled (score set 0). The
>    second score is used when Bayes is disabled, but network tests are
>    enabled (score set 1). The third score is used when Bayes is
>    enabled and network tests are disabled (score set 2). The fourth
>    score is used when Bayes is enabled and network tests are enabled
>    (score set 3).

We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.

Theo suggested looking at Spam versus ham rates as a good way to
set scores, to which I mentioned:

> We have these test results from Justin from 25 June:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>  121405    22516    98889    0.185   0.00    0.00  (all messages)
> 100.000  18.5462  81.4538    0.185   0.00    0.00  (all messages as %)
>  13.453  70.3766   0.4925    0.993   1.00    1.00  SURBL_WS
>   3.807  20.3811   0.0334    0.998   0.50    1.00  SURBL_SC
>   2.650  14.2565   0.0071    1.000   0.50    1.00  SURBL_AB
>   0.019   0.0933   0.0020    0.979   0.50    1.00  SURBL_PH
>  12.624  67.6275   0.1001    0.999   0.50    1.00  SURBL_OB
> 
> which shows a pretty high FP rate for WS, less for the others.
> Do you happen to have access to any more recent corpus check data
> like this?  Could be useful to have another snapshot for a more
> complete picture.

Which was followed up with more data and discussion:

> On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:

>> high spam + low ham is good from an FP standpoint, but having a "significant"
>> (for your definition thereof) ham hitrate means the score shouldn't be too
>> high.  My handwaving scores would be something like:

[Theo's wild guess scores for Justin's June data:  -- Jeff C.]

>> WS      1.2
>> SC      2.5
>> AB      3.5
>> OB      1.8

Theo then gave some of his own stats on a couple different corpora:

>>         OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>          416072   365031    51041    0.877   0.00    0.00  (all messages)
>>         100.000  87.7327  12.2673    0.877   0.00    0.00  (all messages as %)
>> set1     30.923  35.2466   0.0000    1.000   0.99    0.00  URIBL_SC_SURBL
>> set1     72.231  82.3273   0.0274    1.000   0.98    1.00  URIBL_OB_SURBL
>> set1     19.375  22.0847   0.0000    1.000   0.98    1.00  URIBL_AB_SURBL
>> set1     74.883  85.2939   0.4310    0.995   0.74    0.00  URIBL_WS_SURBL
>> set1      0.001   0.0000   0.0059    0.000   0.48    0.00  URIBL_PH_SURBL
> 
>>         OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>          119215    67094    52121    0.563   0.00    0.00  (all messages)
>>         100.000  56.2798  43.7202    0.563   0.00    0.00  (all messages as %)
>> set3     39.217  69.6605   0.0288    1.000   0.98    1.00  URIBL_OB_SURBL
>> set3     10.340  18.3727   0.0000    1.000   0.97    0.00  URIBL_SC_SURBL
>> set3      5.998  10.6582   0.0000    1.000   0.94    1.00  URIBL_AB_SURBL
>> set3     42.730  75.5522   0.4797    0.994   0.73    0.00  URIBL_WS_SURBL
>> set3      0.008   0.0089   0.0058    0.608   0.49    0.00  URIBL_PH_SURBL
> 
>> so for these results, I'd probably do something like:
> 
>> WS      1.3
>> SC      4.0
>> AB      3.0
>> OB      2.2
> 
>> since the hit rates and S/O are a bit higher for me, related to the fact I ran
>> more spam through than Justin did.

To which I added:

> Those final scores look like an excellent fit to the data to me.

and:

> Also while the PH spam hit rate [from Justin's stats] is low,
> the data is of hand checked phishing scams, which deserve to be
> blocked due to their potential danger and damage.
> 
> Therefore I would tend to give PH a medium-high score like
> 3 to 5.

So we'll probably adjust the default scores on SpamCopURI
to something like:

  WS      1.3
  SC      4.0
  AB      3.0
  OB      2.2
  PH      4.5

and we recommend SpamCopURI users do likewise.  Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:

  http://sourceforge.net/projects/spamcopuri/
  http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/


One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora.  That FP rate needs to be reduced for WS
to be more fully useful.

I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further.  If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%.  The other
lists have FP rates 5 to 50 times lower.

Basically the higher the FP rate, the less useful a list is.

Does anyone have other corpus stats to share, in particular
FP rates?

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/

Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists

Posted by Jeff Chan <je...@surbl.org>.

On Sunday, September 5, 2004, 10:32:57 AM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

>> Basically the higher the FP rate, the less useful a list is.

> ... or, rather, the lower it ought to be scored.

Yes, but please remember that not everyone has the ability to
"score" their SURBL hits.  Not everyone using SURBLs is using
SpamAssassin.

>> Does anyone have other corpus stats to share, in particular
>> FP rates?

Thanks for sharing your data.  I know this can be a somewhat
painful subject for people, but it's very important to clean
up the false positives and make the lists better and more useful.

> Sure. All of these messages were received in the past 10 days. A lot has
> happened since June. :-)

> WS: 44004/54185s, 61/19150s

>   OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>     73335    54185    19150    0.739   0.00    0.00  (all messages)
>   100.000  73.8870  26.1130    0.739   0.00    0.00  (all messages as %)
>    60.087  81.2107   0.0836    0.999   0.00    0.00  WS_SURBL

> HOWEVER... I decided to go through the ham hits (61 of them), and look
> for false positive domains to submit.

That kind of checking should become a policy.  For people who can
do that kind of checking, they should do it every time.  Every
tool we have for reducing FPs should be used.

Letting FPs in just hurts the usefulness of the lists.

> I found several, but, for the most
> part, they've *already* been cleaned up and are no longer listed in WS.
> (30 out of the 61 were in a massive mailing list thread for a single
> domain that has since been whitelisted).

> And, in that 19K ham corpus, I found the following FPs still listed
> in WS:

> buckeye-express.com   -- Used in a personal email address, looks legit;
>                          7 examples
> nm.ru                 -- Used in a personal email address, looks legit
> advanstar.com         -- Legit uses; found in a well-known dental
>                          newsletter; also personal email address of
>                          one of the editors; 3 messages
> 00fun.com             -- Confirmed, more than one user on our system
>                           sent or received eCards from them
> northstarconferences.com Legit conference host site subscribed to
>                          by two users; 9 messages in this corpus
> mardox.com            -- Search engine; registered 1875 days ago, and
>                           *looks* like the user did actually submit
>                          their site to them.
> postsnet.com          -- Registered exactly one year ago, 51 NANAS,
>                          blank home page, ehh... but I have 4
>                          different legit newsletters with links to
>                          them.
> webspawner.com        -- Created in 1996; free host/email
> npdor.com             -- Surveys; been around since 1999. 103 NANAS,
>                          but they've been advertised by some reputable
>                          "word of the day" mailers (dictionary.com)
>                          Maybe a good candidate for UC. :-) 2
>                          examples
> imninc.com            -- Domain is 507 days old; they do newsletters.
>                          At least one of them is legit. :-)
> worldhealth.net       -- It's 3468 days old today (1995). One of our
>                          users attended a conference of theirs, and
>                          signed up for a newsletter.
> hoteldiscounts.com    -- 2459 days old (1997), found in actual room
>                           booking confirmations for Comfort Inn.

Thanks.  I agree those look like false positives and have
whitelisted all of them across SURBLs.  Signing up for a
newsletter then forgetting about does not make a message
spam.

Instead of having these go into SURBLs, they should be checked
**before** they get added.  Hopefully they would be detected
then and not get added to begin with.  Wouldn't that be better?

Should hand-checking catch these as mostly legitimate?

Are we hand-checking?  If not we should!

> (I'll re-post these in another thread, just so everybody sees them).

> AND, I found 2 spams that were incorrectly hand-classified as ham.

> So, if I take those out, the numbers look more like:

> WS: 44006/54187s, 0/19148s

>   OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>     73335    54187    19148    0.739   0.00    0.00  (all messages)
>   100.007  73.8897  26.1103    0.739   0.00    0.00
>    60.087  81.2111   0.0000    1.000   0.00    0.00  WS_SURBL

> Is that more like what you had in mind..? No, I'm not making that up.
> :-)

Looks good, but this corpus is perhaps too small to make
representative measurements for emails in general.  That
said, any reduction in FPs is important and welcome.

> Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
> hand-check.

> - Ryan

Thanks for your stats and checking, and yes please anyone else
with ham corpora, please check for FPs.

Jeff C.

Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists

Posted by Ryan Thompson <ry...@sasknow.com>.

Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

> Basically the higher the FP rate, the less useful a list is.

... or, rather, the lower it ought to be scored.

> Does anyone have other corpus stats to share, in particular
> FP rates?


Sure. All of these messages were received in the past 10 days. A lot has
happened since June. :-)

WS: 44004/54185s, 61/19150s

  OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
    73335    54185    19150    0.739   0.00    0.00  (all messages)
  100.000  73.8870  26.1130    0.739   0.00    0.00  (all messages as %)
   60.087  81.2107   0.0836    0.999   0.00    0.00  WS_SURBL

HOWEVER... I decided to go through the ham hits (61 of them), and look
for false positive domains to submit. I found several, but, for the most
part, they've *already* been cleaned up and are no longer listed in WS.
(30 out of the 61 were in a massive mailing list thread for a single
domain that has since been whitelisted).

And, in that 19K ham corpus, I found the following FPs still listed
in WS:

buckeye-express.com   -- Used in a personal email address, looks legit;
 		         7 examples
nm.ru		      -- Used in a personal email address, looks legit
advanstar.com	      -- Legit uses; found in a well-known dental
 			 newsletter; also personal email address of
 			 one of the editors; 3 messages
00fun.com	      -- Confirmed, more than one user on our system
                          sent or received eCards from them
northstarconferences.com Legit conference host site subscribed to
 			 by two users; 9 messages in this corpus
mardox.com	      -- Search engine; registered 1875 days ago, and
                          *looks* like the user did actually submit
 			 their site to them.
postsnet.com	      -- Registered exactly one year ago, 51 NANAS,
 			 blank home page, ehh... but I have 4
 			 different legit newsletters with links to
 			 them.
webspawner.com	      -- Created in 1996; free host/email
npdor.com	      -- Surveys; been around since 1999. 103 NANAS,
 			 but they've been advertised by some reputable
 			 "word of the day" mailers (dictionary.com)
 			 Maybe a good candidate for UC. :-) 2
 			 examples
imninc.com	      -- Domain is 507 days old; they do newsletters.
 			 At least one of them is legit. :-)
worldhealth.net	      -- It's 3468 days old today (1995). One of our
 			 users attended a conference of theirs, and
 			 signed up for a newsletter.
hoteldiscounts.com    -- 2459 days old (1997), found in actual room
                          booking confirmations for Comfort Inn.

(I'll re-post these in another thread, just so everybody sees them).

AND, I found 2 spams that were incorrectly hand-classified as ham.

So, if I take those out, the numbers look more like:

WS: 44006/54187s, 0/19148s

  OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
    73335    54187    19148    0.739   0.00    0.00  (all messages)
  100.007  73.8897  26.1103    0.739   0.00    0.00
   60.087  81.2111   0.0000    1.000   0.00    0.00  WS_SURBL

Is that more like what you had in mind..? No, I'm not making that up.
:-)

Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
hand-check.

- Ryan

-- 
   Ryan Thompson <ry...@sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists

Posted by Jeff Chan <je...@surbl.org>.

On Sunday, September 5, 2004, 3:30:49 AM, Raymond Dijkxhoorn wrote:
> Seeing those data it would be very interesting if we could test a seperate
> list. Is that possible? I would like to test the Prolo and Joe's list 
> combined, without the rest of the WS list. I can generate the data for a 
> test like that. I have seen allmost zero FP's in the data i compose, so 
> perhaps its better to seperate the lists. I think people would benefit 
> from a less FP stuffed list. The current WS list is just compiled out of 
> too many datasources i think.

If you can make the different lists available to me by rsync,
I can easily set up some temporary local SURBLs for testing
them.  Thank you rbldnsd!  :-)

Unfortunately I don't have my own test corpora, so I need to
rely on the generosity of others who do.  So I'd probably
need to ask Theo, Daniel, Justin or others with corpora to
test against them.

Jeff C.