You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Jeff Chan <je...@surbl.org> on 2004/04/02 06:51:10 UTC

Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests

I am pleased to announce that Eric Kolve has added SURBL support
to his SpamAssassin 2.63 plugin called SpamCopURI:

  http://sourceforge.net/projects/spamcopuri/

In order to use the new RBL method, please comment out the the
previous tests SPAMCOP_URI and SPAMCOP_URI_HOST and increase
the score for the new test up to something like 2.5:

  score SPAMCOP_URI_RBL  2.5

in the spamcop_uri.cf file.  Values higher than 2.5 may be
appropriate because the test is a highly accurate indicator
of spam, for some of the reasons mentioned at the SURBL site:

  http://www.surbl.org/

Note that unlike URIDNSBL, we are comparing *domains* found in
message bodies to *domains* in SURBL (aka a name or RHSBL), rather
than resolving the names into IP addresses (representing the spam
web site's hosting server) and comparing those addresses to a
number-based RBL.

We consider this a direct approach to the problem of URIs
advertised in spam, and we're confident that the URI data
we are getting from SpamCop and scoring based on report
counts are very useful and relevant.  More information about
the data SURBL is built on can be found at:

  http://spamcheck.freeapp.net/

Would someone with access to large spam and ham corpi please
give SpamCopURI a try against their recent data, as Daniel
Quinlan did with URIDNSBL + SURBL, and kindly let us know what
kind of results they obtain?  Currently four trailing days of
SpamCop URI reports are represented in SURBL.

Thanks!

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://sc.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests

Posted by Jeff Chan <je...@surbl.org>.

On Friday, April 2, 2004, 2:11:39 AM, Jeff Chan wrote:
> If it's the case that domains expire out of the SpamCop
> URI data sooner than the particular spam domains remain
> a problem, then I could definitely see a need for a longer
> expiration.  Being somewhat new to the game, I don't
> have any data to support either argument.

OK I can see one flaw in my argument would be that if message
body domain blocking were already popular and successful then
*reporting* about spam URIs would taper off as fewer spams
reached victims, even if the spam-referenced domains stayed
up.  In that case we could simply increase our expiration
time to make the blocking persist long after the reports
tapered off.  (But there still should be some mechanism for
expiring domains off the block list, whatever time period
is used.  Or there should be some other method of removing
domains from the list.)

Does anyone have any data about the persistence of spam URI
domains?  I'll even settle for any data about spam web server
IP addresses.  :-)

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests

Posted by Jeff Chan <je...@surbl.org>.

On Thursday, April 1, 2004, 11:37:54 PM, Daniel Quinlan wrote:
> Jeff Chan <je...@surbl.org> writes:

>> Would someone with access to large spam and ham corpi please give
>> SpamCopURI a try against their recent data, as Daniel Quinlan did with
>> URIDNSBL + SURBL, and kindly let us know what kind of results they
>> obtain?  Currently four trailing days of SpamCop URI reports are
>> represented in SURBL.

> 2.6x modules, rules, and patches aren't very interesting right now.
> Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
> and I'll gladly give it a whirl.

I would do that immediately if I knew how to write one.  I've
been rewriting my data stuff lately, while letting Eric update
SpamCopURI to now use SURBL.  (The somewhat frustrating thing is
that someone already familiar with SA 3.0 plugins could probably
make such a patch for URIDNSBL in a small fraction of the time it
would take me to come up to speed.  But I realize everyone else
is short of time also.)

> Four days still seems rather low.

What would be a better expiration time, and how do you suggest
removing from the blacklist domains that are no longer active in
spams?

We can expire after any arbitrary number of days.  I'm leaning
towards seven days right now since it's a typical DNS cacheout
interval. 

> Bear in mind that we're testing
> corpora that have spams somewhere between 0 and 3 months old (on
> average).  SpamCop is very hard to accurately gauge because stuff leaves
> so quickly.

True, but it also accurately reflects spams that people are
actually getting and reporting at any given moment.  To me
that feature has a significant value in timeliness.

If it's the case that domains expire out of the SpamCop
URI data sooner than the particular spam domains remain
a problem, then I could definitely see a need for a longer
expiration.  Being somewhat new to the game, I don't
have any data to support either argument.

My intuition is that if a domain continued to appear
in spam, people would continue to report it, and it
would therefore continue to show up in our SURBL data.
I'm interested in finding out what I may be overlooking
in this assumption.

Do you or anyone else here have some data that might shed
some light on this question?

> Expiring stuff quickly doesn't really reduce FPs unless
> you're testing old ham vs. new spam.  I care more about the S/O ratio
> (spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

My priorities are near zero FPs and near 100% accuracy in
the spams we do tag.  I don't guarantee that we will tag
all spams, but I'd like the ones we say are spam to actually
*be* spam.  Verity is important to me.

Other techniques may be able to catch spams which we miss, and we
may be able to improve our process to catch more spams our way.
I also think our spam% will be very high if the SpamCop reports
represent a good cross-section of actual spams at any given time.

Comments?  Surely I'm missing something...  ;)

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/

Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests

Posted by Daniel Quinlan <qu...@pathname.com>.

Jeff Chan <je...@surbl.org> writes:

> Would someone with access to large spam and ham corpi please give
> SpamCopURI a try against their recent data, as Daniel Quinlan did with
> URIDNSBL + SURBL, and kindly let us know what kind of results they
> obtain?  Currently four trailing days of SpamCop URI reports are
> represented in SURBL.

2.6x modules, rules, and patches aren't very interesting right now.
Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
and I'll gladly give it a whirl.

Four days still seems rather low.  Bear in mind that we're testing
corpora that have spams somewhere between 0 and 3 months old (on
average).  SpamCop is very hard to accurately gauge because stuff leaves
so quickly.  Expiring stuff quickly doesn't really reduce FPs unless
you're testing old ham vs. new spam.  I care more about the S/O ratio
(spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting