You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2011/03/30 03:09:09 UTC
Script to collect IP reputation data from SA mass-check targets
I'd like everybody to run it in a daily cron job (along with your
mass-checks, if you're doing them).
http://www.chaosreigns.com/iprep/dl/iprep.pl
Works like:
./iprep.pl ham:dir:~/masscheckwork/ham spam:dir:~/masscheckwork/spam/
Where the arguments are the same as for mass-check.
Config file is ~/.ipreprc :
$trusted_networks = '';
$user = 'username';
$pass = 'password';
Email me for an account. There's more detailed instructions in the
perl script (like argument definitions, for those not familiar with
mass-check targets.)
It uploads IP address and date of each ham and spam to my server via rsync.
(Everybody gets their own chroot jail, and I consider the data
confidential.)
I'm planning to aggregate the data and make it available as:
IP <percent ham> <count>
Where <count> is a logarithm of the total number of emails seen from that
IP. And <percent ham> is normalized the same as the s/o value in ruleqa.
And old values will receive less weight then new values.
(Maybe 0.99^(age in days) ?)
I kind of like the idea of only making the data available via rsync. Seems
like it would reduce bandwidth usage, relative to serving via DNS?
Next I'm planning to create a plugin to create tests to record values
(like iprep_ham_<percent>, iprep_count_<count>). Then I can use them
to determine what tests would be most useful.
Output from my own corpora:
http://www.chaosreigns.com/iprep/iprep.txt
With 2618 hams, and 2956 spams, there were only *two* IP addresses that
were not 100% spam or 100% ham. Both belong to google.
For IPv6, I'm thinking about aggregating at /48, just because that's what
he.net is letting me allocate. That leaves 80 bits of addresses. This is
an attempt to deal with a problem Warren worded well: "IPv6 makes it
possible to send one spam per IPv6 address and never run out of IP
addresses".
--
"For every complex problem, there is a solution that is simple, neat,
and wrong." - H. L. Mencken
http://www.ChaosReigns.com
Re: [SA-dev] Script to collect IP reputation data from SA
mass-check targets
Posted by da...@chaosreigns.com.
On 03/30, Adam Katz wrote:
> Be careful about measuring the usefulness of that data; you'll have to
> measure samples against each other, and even then you will have
> imperfect results.
If this ever gets added to the mass-check tests, I'll be more than happy to
create a separate set of the data based only on data from people who are
not contributing to mass-checks. Right now, I only have data from 1796
emails that aren't run through mass-check, so it's not worth it. But I'm
keeping all input data separated by who contributed it, so a special
version for mass-check folks will be easy.
I just posted some test results to the users list that I'm pretty happy
with. I'd really like to get more data though.
Graph of the results: http://www.chaosreigns.com/iprep/results.svg
Based on training on all corpora except mine, and then training on mine 1
spam and 1 ham at a time, calculating the accuracy at each step using a
separate test set of my email. 3 sets of lines from 3 runs using randomly
selected training and scoring sets.
Project web page: http://www.chaosreigns.com/iprep/
--
"I don't want to die... just yet... not while there's... women."
- J. Matthew Root, 8/23/02 (http://www.jmrart.com/)
http://www.ChaosReigns.com
Re: [SA-dev] Script to collect IP reputation data from SA mass-check
targets
Posted by Adam Katz <an...@khopis.com>.
On 03/29/2011 06:09 PM, darxus@chaosreigns.com wrote:
> I'd like everybody to run it in a daily cron job (along with your
> mass-checks, if you're doing them).
Be careful about measuring the usefulness of that data; you'll have to
measure samples against each other, and even then you will have
imperfect results.