You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jeff Chan <je...@surbl.org> on 2005/06/03 08:26:07 UTC

Re: Is Bayes Really Necessary?

On Thursday, May 26, 2005, 12:49:05 PM, Evan Langlois wrote:
> On Thu, 2005-05-26 at 10:42 -0400, Chris Santerre wrote:

>> For site wide, I'm pretty much against it. I know people will argue that
>> point. I'm obviously biased towards SARE rules updated with RDJ. And the use
>> of URIBL.com lists. But these allow a general users, or a sitewide install
>> to "set and forget". Which is what we strive for, so SA can be more widley
>> excepted. 
>> 
>> I have a 99% filter rate without bayes. And I'm proud of that. 

> I've been testing URIBL and SURBL against just reversing the hostnames
> and looking it up on SBL-XBL,

SBL and XBL have numeric IP addresses, so they shouldn't match
host names.

SURBLs on the other hand have mostly domain names with a few IPs.
Whatever appears in URI host portions is what goes into SURBLs.
Usually URIs have domain names so that's what most of the SURBL
records are.

Cheers,

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/

Re: Is Bayes Really Necessary?

Posted by Jeff Chan <je...@surbl.org>.

On Friday, June 3, 2005, 12:33:26 AM, Duncan Hill wrote:
> On Friday 03 June 2005 08:10, Loren Wilton typed:
>> It was basically "the spammer makes a zillion new domains, and they all
>> take time to get into SURBL, so some spam gets through.  But they all point
>> to the same dotted quad, and I can match on that lookup".
>>
>> If that statement is true, perhaps the surbl lists could automatically
>> include the dotquads for hosts that are known to be pure spam sources and
>> not mixed systems.  Then the client could get the ip for a suspect hostname
>> and see if it matched a known spam dotquad.

> I'd swear this came up before.  The one (slight?) problem with this tactic is 
> that you can have too many FPs if a spammer targets a legit hosting 
> operation.

Exactly.  Listing resolved IPs magnifies the problems with false
positives, joe jobs and collateral damage.  Please see:

  http://www.surbl.org/faq.html#numbered

"Are there plans to offer an RBL list with the domain names
resolved into IP addresses?"

> Postifx does have a neat restriction to reject based on the IP address of the 
> name server.  You run the same risk, but I've noticed that the pr1ces, al1v3 
> and so on spammer has used the same NS servers for each one....

Using sbl.spamhaus.org with uridnsbl in SA3 does something
similar.  SBL has many spammer nameservers listed in it and
uridnsbl checks a URI's nameservers against SBL.  It tends
to detect many spamy domains that way (and occasionally a few
relatively innocent bystanders).

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/

Re: Is Bayes Really Necessary?

Posted by Duncan Hill <sa...@nacnud.force9.co.uk>.

On Friday 03 June 2005 08:10, Loren Wilton typed:
> It was basically "the spammer makes a zillion new domains, and they all
> take time to get into SURBL, so some spam gets through.  But they all point
> to the same dotted quad, and I can match on that lookup".
>
> If that statement is true, perhaps the surbl lists could automatically
> include the dotquads for hosts that are known to be pure spam sources and
> not mixed systems.  Then the client could get the ip for a suspect hostname
> and see if it matched a known spam dotquad.

I'd swear this came up before.  The one (slight?) problem with this tactic is 
that you can have too many FPs if a spammer targets a legit hosting 
operation.

Postifx does have a neat restriction to reject based on the IP address of the 
name server.  You run the same risk, but I've noticed that the pr1ces, al1v3 
and so on spammer has used the same NS servers for each one....

Re: Is Bayes Really Necessary?

Posted by Loren Wilton <lw...@earthlink.net>.

> SURBLs on the other hand have mostly domain names with a few IPs.
> Whatever appears in URI host portions is what goes into SURBLs.
> Usually URIs have domain names so that's what most of the SURBL
> records are.

Jeff, the OP (or someone) had an interesting idea, I thought.

It was basically "the spammer makes a zillion new domains, and they all take
time to get into SURBL, so some spam gets through.  But they all point to
the same dotted quad, and I can match on that lookup".

If that statement is true, perhaps the surbl lists could automatically
include the dotquads for hosts that are known to be pure spam sources and
not mixed systems.  Then the client could get the ip for a suspect hostname
and see if it matched a known spam dotquad.

Possibly this would want to be a separate list.

Alternately, it might want to be possible 'backend processing' inside surbl
itself.  For instance, you could run your own caching dns.  Any hostname
lookup request not matching the current list (or the whitelist) gets looked
up.  If the ip address matches that of a known spam host, it is
automatically added to the list and a positive hit is returned to the
original requestor.  Instant catching of unknown spam domains!

Of course with your policies you may simply want to add the domain name to a
list for manual review rather than directly including it.  Or perhaps
establish a new list that is scored deliberately at half the normal surbl
score and add it to that list and flag for manual review.  If it is spam, it
will provide at least some early warning to people receiving it.  If it
turns out to be a false hit, it will be found in manual review and removed
from the list shortly, and in the mean time the low score means no great
harm will likely be done.

I think this is a concept worth thinking about.  Domain names are near
infinite, but there is a limit on IPV4 ip addresses; so a lot of domain
names must end up mapping to the same ip address in some way or other.  This
is something that we should be able to exploit.

        Loren