You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by da...@chaosreigns.com on 2011/03/30 18:44:20 UTC

Please report IPs delivering ham and spam with this script

My plan is to create another free reputation service, like a combination of
a whitelist and a blacklist, except providing the actual data instead of
just yes/no/maybe.  To help SpamAssassin filtering, obviously.

The data I'm planning to provide is, for every IP address, the percentage
of email from it which was ham (normalized like the S/O value in
SpamAssassin ruleqa), and total count of recent emails from that IP
(a logarithm of it).  Output data based on my own email:

http://www.chaosreigns.com/iprep/iprep.txt


With my 2618 hams, and 2956 spams, there were only *two* IP addresses that
were not 100% spam or 100% ham (both belong to google).  This kind of thing
is why black lists and white lists are useful for predicting if an email is
spam or ham.  The highest ranked test in SpamAssassin is RCVD_IN_XBL, a
spamhaus.org blacklist.  #7 is RCVD_IN_PSBL, and #11 is RCVD_IN_DNSWL_HI,
which is also the highest ranking "nice" rule.


To do this, I need data from you.

Create a folder containing only email you've confirmed is ham, and another
containing what you've confirmed is spam.

http://www.chaosreigns.com/iprep/dl/iprep.pl

./iprep.pl ham:dir:~/masscheckwork/ham spam:dir:~/masscheckwork/spam/

The arguments are the same as the "targets" used by SpamAssassin's
mass-check (using its perl modules):

    <class>:<format>:<location>
    <class>       is "spam" or "ham"
    <format>      is "dir", "file", "mbx", "mbox", or "detect"
    <location>    is a file or directory name.  globbing of ~ and * is supported

You can specify many targets at once.  

Please run it as a daily cron job.

The required ~/.ipreprc config file:
$trusted_networks = '<space delimited list of trusted hosts>';
$user = 'username';
$pass = 'password';

$trusted_networks is very important, and needs to contain everything from
both your trusted_networks and internal_networks values from SpamAssassin,
which are documented here:  
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#network_test_options
http://wiki.apache.org/spamassassin/TrustPath
This is to prevent reporting the IP of your trusted relays instead of the
actual IP sending the email.  

Email me to get an account to upload the data.  Please email me from a
non-freemail account, one not listed in
http://svn.apache.org/repos/asf/spamassassin/trunk/rules/20_freemail_domains.cf
Major examples of freemail accounts, which I don't want you to email me from,
are:  gmail.com, yahoo.com, and hotmail.com.  This is just to make it
slightly harder for spammers to send me bad data.  And if you're on this
list, I know you have a non-freemail account.

I won't tell anybody your email address, and I consider the uploaded data
confidential.


I'm thinking about providing the data only via rsync, instead of via DNS,
because I think that should reduce network load.  I'd create a plugin that
would grab the data directly.


Just as a disclosure, I have been involved with dnswl.org since November
2006.  I have no plan to use any of their data, other than to look for
problems in my data.

-- 
"Let's just say that if complete and utter chaos was lightning, then
he'd be the sort to stand on a hilltop in a thunderstorm wearing wet
copper armour and shouting 'All gods are bastards'." - The Color of Magic
http://www.ChaosReigns.com

Re: Please report IPs delivering ham and spam with this script

Posted by da...@chaosreigns.com.
On 04/01, David F. Skoll wrote:
> o 536,596 (5.8%) sent _only_ ham
> 
> o 7,821,574 (86%) sent _only_ spam
> 
> o The remaining 744,705 (8.2%) sent a mixture.  Most Yahoo! servers are in
>   this category.

Sounds reasonable.  It's nice to see the numbers, thanks.

> You saw less than 0.05% sending a mixture, which means you are probably
> not getting a good sample.

Yup.  I don't have enough data.  That's why I'm asking for more.

-- 
"Life is either a daring adventure or it is nothing at all."
- Helen Keller
http://www.ChaosReigns.com

Re: Please report IPs delivering ham and spam with this script

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 1 Apr 2011 14:34:16 -0400
darxus@chaosreigns.com wrote:

> Out of the 86,899 IPs I have data for, all but 38 are either 100%
> spam or 100% ham,

That sounds a bit funny.

We have data on over 17 million IP addresses (collected using
http://mimedefang.org/reputation) Of those, about 9 million report at
least one ham or one spam -- the remainder either never made it past
greylisting or only tried emailing nonexistent recipient addresses.

Of those 9,102,875 hosts:

o 536,596 (5.8%) sent _only_ ham

o 7,821,574 (86%) sent _only_ spam

o The remaining 744,705 (8.2%) sent a mixture.  Most Yahoo! servers are in
  this category.

You saw less than 0.05% sending a mixture, which means you are probably
not getting a good sample.

Regards,

David.

PS: If anyone wants to contribute to and download *our* reputation
list, please see http://mimedefang.org/reputation and email me
off-list.  Please be aware that unlike darxus' list, ours is not
freely-available, though we generally give free downloads to
organizations willing to feed us reputation data if they do a
statistically-useful amount of mail (>= 50K messages/day).



Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by da...@chaosreigns.com.
On 04/01, Michael Scheidell wrote:
> On 4/1/11 2:34 PM, darxus@chaosreigns.com wrote:
> >header   RCVD_IN_IPREPDNS_0         eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.0')
> >describe RCVD_IN_IPREPDNS_0         Sender listed athttp://www.chaosreigns.com/iprep/, 0% ham
> >tflags   RCVD_IN_IPREPDNS_0         net
> >
> might actually need a quantity qualifier.
> 
> (if this ip is 0 % ham... does that actually mean it is 100% spam?)
> 
> or does that mean that I (so far) only saw one email hit it, and it is spam?

It means that all of the email seen from that IP so far has been spam.
Which may only have been one email.

> other than this is marking 'spam rates' and DCC commercial does the
> same thing for 'bulk' rates,  what is the difference between this
> and DCC?

The "commercial" part.  

> maybe 2nd or 3rd octet could contain 'confidence factor'.. eg:

It does, actually.  A logarithm of the count of emails seen from that IP
(newer emails weighted more than old emails, and scaled up so small old
counts are greater than 0).

I haven't studied data enough to figure out what threshold is best for
what, and I don't think the existing rule definition language provides
a good way to specify a range.

Also, ignoring it is working quite well.

-- 
"I refuse to tip toe through life only to arrive safely at death."
http://www.ChaosReigns.com

Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by Michael Scheidell <mi...@secnap.com>.
On 4/1/11 2:34 PM, darxus@chaosreigns.com wrote:
> header   RCVD_IN_IPREPDNS_0         eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.0')
> describe RCVD_IN_IPREPDNS_0         Sender listed athttp://www.chaosreigns.com/iprep/, 0% ham
> tflags   RCVD_IN_IPREPDNS_0         net
>
might actually need a quantity qualifier.

(if this ip is 0 % ham... does that actually mean it is 100% spam?)

or does that mean that I (so far) only saw one email hit it, and it is spam?

other than this is marking 'spam rates' and DCC commercial does the same 
thing for 'bulk' rates,  what is the difference between this and DCC?

note: dcc uses (for large installs) a local, VLDB that they 'sync' 
(flood they call it) in real time.  but it not only tells you the bulk 
rate of the sender's ip, but the 'bulk hit rate' for the email you just got.

sounds similar, but bulk vs spam.

(and its inverse.. you collect percentages of HAM.  the collect 
percentages of BULK).

maybe 2nd or 3rd octet could contain 'confidence factor'.. eg:

some sliding scale of how many actual emails you have seen?



-- 
Michael Scheidell, CTO
o: 561-999-5000
d: 561-948-2259
ISN: 1259*1300
 >*| *SECNAP Network Security Corporation

    * Best Intrusion Prevention Product, Networks Product Guide
    * Certified SNORT Integrator
    * Hot Company Award, World Executive Alliance
    * Best in Email Security, 2010 Network Products Guide
    * King of Spam Filters, SC Magazine

______________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/
______________________________________________________________________  

Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by Mark Martinec <Ma...@ijs.si>.
> > Do not forget to backslash-quote dots in a regular expression
> > if you mean a literal dot instead of 'any character'.
> 
> Eep.  That was copied from existing rules.  I believe you're right, and
> there are a bunch of rules that need more escaping.  Thanks.

True, there is a bunch of rules that need more escaping.
It is noted somewhere in the bug tracking (but not as a standalone ticket),
and needs a volunteer to do the cleaning :)

  Mark

Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by da...@chaosreigns.com.
On 04/01, Mark Martinec wrote:
> > eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.100') describe
> 
> Do not forget to backslash-quote dots in a regular expression
> if you mean a literal dot instead of 'any character'.

Eep.  That was copied from existing rules.  I believe you're right, and
there are a bunch of rules that need more escaping.  Thanks.

-- 
"Will I ever learn? I hope not, I'm having too much fun."
- Brent "Minime" Avis, motorcycle.com
http://www.ChaosReigns.com

Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by da...@chaosreigns.com.
On 04/01, Mark Martinec wrote:
> > eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.100') describe
> 
> Do not forget to backslash-quote dots in a regular expression
> if you mean a literal dot instead of 'any character'.

Updated rules (thanks again):


ifplugin Mail::SpamAssassin::Plugin::DNSEval
header   __RCVD_IN_IPREPDNS     eval:check_rbl('iprep-firsttrusted', 'iprep.chaosreigns.com.')
tflags   __RCVD_IN_IPREPDNS     nice net

header   RCVD_IN_IPREPDNS_100   eval:check_rbl_sub('iprep-firsttrusted', '^127\.\d+\.\d+\.100$')
describe RCVD_IN_IPREPDNS_100   Sender listed at http://www.chaosreigns.com/iprep/, 100% ham
tflags   RCVD_IN_IPREPDNS_100   nice net

header   RCVD_IN_IPREPDNS_50    eval:check_rbl_sub('iprep-firsttrusted', '^127\.\d+\.\d+\.50$')
describe RCVD_IN_IPREPDNS_50    Sender listed at http://www.chaosreigns.com/iprep/, 50% ham
tflags   RCVD_IN_IPREPDNS_50    nice net

header   RCVD_IN_IPREPDNS_0     eval:check_rbl_sub('iprep-firsttrusted', '^127\.\d+\.\d+\.0$')
describe RCVD_IN_IPREPDNS_0     Sender listed at http://www.chaosreigns.com/iprep/, 0% ham
tflags   RCVD_IN_IPREPDNS_0     net

meta     RCVD_NOT_IN_IPREPDNS   ( ! RCVD_IN_IPREPDNS_100 && ! RCVD_IN_IPREPDNS_50 && ! RCVD_IN_IPREPDNS_0 && ! NO_RELAYS )
describe RCVD_NOT_IN_IPREPDNS   Sender not listed at http://www.chaosreigns.com/iprep/
tflags   RCVD_NOT_IN_IPREPDNS   net

score    RCVD_IN_IPREPDNS_100   -0.1
score    RCVD_IN_IPREPDNS_50    -0.0001
score    RCVD_IN_IPREPDNS_0     0.1
score    RCVD_NOT_IN_IPREPDNS   0.0001
endif


-- 
"Go forth, and be excellent to one another." - http://www.jhuger.com/fredski.php
http://www.ChaosReigns.com

Re: New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by Mark Martinec <Ma...@ijs.si>.
> eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.100') describe

Do not forget to backslash-quote dots in a regular expression
if you mean a literal dot instead of 'any character'.

  Mark

New DNS white/blacklist + spamassassin rules Re: Please report IPs delivering ham and spam with this script

Posted by da...@chaosreigns.com.
While I still plan for this to primarily be used via rsync and a
spamassassin plugin, I've loaded the data into DNS records and created
spamassassin rules so it can easily be tested now.  It's updating
automatically once a day.

I'm hoping this will encourage people to contribute data.  Because now you
should get an immediate improvement in your spam filtration, based on data
you've provided on what IPs send you ham and spam.  

More info, including the script to submit data (either from spam/ham
folders, or individual emails piped to standard input) here:
http://www.chaosreigns.com/iprep/

The spamassassin rules:


ifplugin Mail::SpamAssassin::Plugin::DNSEval
header  __RCVD_IN_IPREP   eval:check_rbl('iprep-firsttrusted', 'iprep.chaosreigns.com.')
tflags  __RCVD_IN_IPREP   nice net

header   RCVD_IN_IPREPDNS_100       eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.100')
describe RCVD_IN_IPREPDNS_100       Sender listed at http://www.chaosreigns.com/iprep/, 100% ham
tflags   RCVD_IN_IPREPDNS_100       nice net

header   RCVD_IN_IPREPDNS_50        eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.50')
describe RCVD_IN_IPREPDNS_50        Sender listed at http://www.chaosreigns.com/iprep/, 50% ham
tflags   RCVD_IN_IPREPDNS_50        nice net

header   RCVD_IN_IPREPDNS_0         eval:check_rbl_sub('iprep-firsttrusted', '127.\d+.\d+.0')
describe RCVD_IN_IPREPDNS_0         Sender listed at http://www.chaosreigns.com/iprep/, 0% ham
tflags   RCVD_IN_IPREPDNS_0         net

meta     RCVD_NOT_IN_IPREPDNS       ( ! RCVD_IN_IPREPDNS_100 && ! RCVD_IN_IPREPDNS_50 && ! RCVD_IN_IPREPDNS_0 && ! NO_RELAYS )
describe RCVD_NOT_IN_IPREPDNS       Sender not listed at http://www.chaosreigns.com/iprep/
tflags   RCVD_NOT_IN_IPREPDNS       net

score RCVD_IN_IPREPDNS_100 -0.1
score RCVD_IN_IPREPDNS_50  -0.0001
score RCVD_IN_IPREPDNS_0    0.1
score RCVD_NOT_IN_IPREPDNS  0.0001
endif



For people not contributing data, this is not likely to be useful yet.

Out of the 86,899 IPs I have data for, all but 38 are either 100% spam or
100% ham, so a great predictor of what the next email from known IPs will
be.  This is why blacklists and whitelists, including spamassassin's AWL
(which is another combination of both) are nothing new.  

The advantages I'm providing over SA's AWL are:
1) It's based on human verified ham and spam, not SA's previous opinions of
   emails.
2) Shared knowledge from other people's email.

What I hope to be an advantage over dnswl.org, which I've been involved in,
is increased automation.


Here's a test I ran using only the last 500 of my own emails.  All hand
categorized as spam or ham, and sorted by received data.  One by one it
learns the IP as a ham source, spammer, or mix, and using what it has
learned, guesses what the next email is.  Every 100 emails it reports its
success rate for the last 100 emails:

$ ./progress.pl
Rank 100, hit 51.7647058823529% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 0% of spam.
Rank none, hit 48.2352941176471% of ham, hit 100% of spam.

Rank 100, hit 76% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 28% of spam.
Rank none, hit 24% of ham, hit 72% of spam.

Rank 100, hit 72.3684210526316% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 4.16666666666667% of spam.
Rank none, hit 27.6315789473684% of ham, hit 95.8333333333333% of spam.

Rank 100, hit 79.4520547945205% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 48.1481481481481% of spam.
Rank none, hit 20.5479452054795% of ham, hit 51.8518518518519% of spam.

Rank 100, hit 79.2682926829268% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 27.7777777777778% of spam.
Rank none, hit 20.7317073170732% of ham, hit 72.2222222222222% of spam.


So after 400 emails, RCVD_IN_IPREPDNS_100 is hitting 79% of ham and no
spam.  I don't think anything else spamassassin uses can do this well.

But I have data from 184,335 emails.  Using all that data, results for
the last 10,000 emails were:

Rank 100, hit 94.1176470588235% of ham, hit 0.0101553772722657% of spam.
Rank 50, hit 1.30718954248366% of ham, hit 0.0101553772722657% of spam.
Rank 0, hit 0% of ham, hit 64.2022951152635% of spam.
Rank none, hit 4.57516339869281% of ham, hit 35.7773941301919% of spam.

RCVD_IN_IPREPDNS_100 hits 94% of ham, and 0.01% of spam.
RCVD_IN_IPREPDNS_0 hits 64% of spam and no ham.  Again, I don't think
anything else spamassassin uses can do this well.  

But results this good can only be expected for people contributing data.
At least until we get more people contributing data.

-- 
"The price of freedom is the willingness to do sudden battle, anywhere,
at any time, and with utter recklessness." - Robert A. Heinlein
http://www.ChaosReigns.com