You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2011/03/30 03:09:09 UTC

Script to collect IP reputation data from SA mass-check targets

I'd like everybody to run it in a daily cron job (along with your
mass-checks, if you're doing them).

http://www.chaosreigns.com/iprep/dl/iprep.pl

Works like:

./iprep.pl ham:dir:~/masscheckwork/ham spam:dir:~/masscheckwork/spam/

Where the arguments are the same as for mass-check.

Config file is ~/.ipreprc :
$trusted_networks = '';
$user = 'username';
$pass = 'password';

Email me for an account.  There's more detailed instructions in the
perl script (like argument definitions, for those not familiar with
mass-check targets.)

It uploads IP address and date of each ham and spam to my server via rsync.
(Everybody gets their own chroot jail, and I consider the data
confidential.)

I'm planning to aggregate the data and make it available as:

IP <percent ham> <count>

Where <count> is a logarithm of the total number of emails seen from that
IP.  And <percent ham> is normalized the same as the s/o value in ruleqa.
And old values will receive less weight then new values.  
(Maybe 0.99^(age in days) ?)

I kind of like the idea of only making the data available via rsync.  Seems
like it would reduce bandwidth usage, relative to serving via DNS?  


Next I'm planning to create a plugin to create tests to record values
(like iprep_ham_<percent>, iprep_count_<count>).  Then I can use them
to determine what tests would be most useful.

Output from my own corpora:
http://www.chaosreigns.com/iprep/iprep.txt


With 2618 hams, and 2956 spams, there were only *two* IP addresses that
were not 100% spam or 100% ham.  Both belong to google.

For IPv6, I'm thinking about aggregating at /48, just because that's what
he.net is letting me allocate.  That leaves 80 bits of addresses.  This is
an attempt to deal with a problem Warren worded well:  "IPv6 makes it
possible to send one spam per IPv6 address and never run out of IP
addresses".

-- 
"For every complex problem, there is a solution that is simple, neat,
and wrong." - H. L. Mencken
http://www.ChaosReigns.com

Re: [SA-dev] Script to collect IP reputation data from SA mass-check targets

Posted by da...@chaosreigns.com.
On 03/30, Adam Katz wrote:
> Be careful about measuring the usefulness of that data; you'll have to
> measure samples against each other, and even then you will have
> imperfect results.

If this ever gets added to the mass-check tests, I'll be more than happy to
create a separate set of the data based only on data from people who are
not contributing to mass-checks.  Right now, I only have data from 1796
emails that aren't run through mass-check, so it's not worth it.  But I'm
keeping all input data separated by who contributed it, so a special
version for mass-check folks will be easy.  

I just posted some test results to the users list that I'm pretty happy
with.  I'd really like to get more data though.

Graph of the results:  http://www.chaosreigns.com/iprep/results.svg
Based on training on all corpora except mine, and then training on mine 1
spam and 1 ham at a time, calculating the accuracy at each step using a
separate test set of my email.  3 sets of lines from 3 runs using randomly
selected training and scoring sets.

Project web page:  http://www.chaosreigns.com/iprep/

-- 
"I don't want to die... just yet... not while there's... women."
- J. Matthew Root, 8/23/02 (http://www.jmrart.com/)
http://www.ChaosReigns.com

Re: [SA-dev] Script to collect IP reputation data from SA mass-check targets

Posted by Adam Katz <an...@khopis.com>.
On 03/29/2011 06:09 PM, darxus@chaosreigns.com wrote:
> I'd like everybody to run it in a daily cron job (along with your
> mass-checks, if you're doing them).

Be careful about measuring the usefulness of that data; you'll have to
measure samples against each other, and even then you will have
imperfect results.