You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by da...@chaosreigns.com on 2011/04/12 03:59:05 UTC

Results you can expect with my IP reputation system

Without contributing any data:
RCVD_IN_IPREP_100 hits 29.104% ham, 0.396% spam.  S/O = 0.013.
RCVD_IN_IPREP_0   hits  0.461% ham, 7.470% spam.  S/O = 0.942.

It looks like there are plenty of rules in active use by spamassassin which
do worse.

After uploading a list of which IPs from 100 emails sent spam or ham:
RCVD_IN_IPREP_100 hits 63.568% ham,  0.396% spam.  S/O = 0.006.
RCVD_IN_IPREP_0   hits  0.461% ham, 29.259% spam.  S/O = 0.984.

And I don't expect many to provide data on 3,500 emails, but to show you
where this goes:
RCVD_IN_IPREP_100 hits 90.117% ham,  0.396% spam.  S/O = 0.004
RCVD_IN_IPREP_0   hits  0.251% ham, 50.283% spam.  S/O = 0.995

Detailed graph of the progression:
http://www.chaosreigns.com/iprep/results.svg
(Three lines for each value from three runs, variance due to random
selection of training vs. testing sets.)

This was the result of training on data from everyone I have data from
except myself, and then testing on my own data.  I split my data in half,
half for training, and half for testing.  I trained 1 ham and 1 spam at a
time (so the numbers above assume equal amounts of ham and spam), and
recalculated the score each time, using the testing half.  Since the data
from my email is a fairly significant portion of the data I have still, I'm
hoping that others will actually get better results.


Spamassassin rules to use it (currently via DNS), and instructions for
contributing data, are here:  http://www.chaosreigns.com/iprep/

I'm still anxious to get data from more people to increase the usefulness
of this for everybody.  (Just a list of IPs, time stamps, and whether they
were spam or not, collected and uploaded by my script.)  If anything is
at all unclear, please ask.  This is entirely free to everyone.


S/O is a score used by spamassassin ruleqa to judge usefulness of a test.
Numbers closer to 0.000 are better for finding ham, and numbers closer to
1.000 are better for finding spam.  It's calculated as 
(% spam hits) / (% spam hits + % ham hits)
hence Spam / Overall.

-- 
"You will need: a big heavy rock, something with a bit of a swing to it...
perhaps Mars" - How to destroy the Earth
http://www.ChaosReigns.com