You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jon Gerdes <GE...@whl.co.uk> on 2004/07/09 16:15:04 UTC

SiteWideBayesFeedback

Dear all

I am trying to set up a SiteWideBayesFeedback by following the suggestions on the Wiki but am a little confused as to what is needed in my setup.  

We use Novell GroupWise as a client.  The only forwarding option sends the mail as an attachment to wherever.  Now I think this is exactly what is needed, ie headers are preserved etc.

Reproduced below is an example message I have scraped out of spam's mailbox file.  Do I need to remove anything or is all the extra routing info for the encapsulating mail OK?

If I can get a definative answer I'll update the Wiki accordingly, then I get to spend some quality time with sa-learn.

Cheers
Jon Gerdes

―---------------------------8<------------------------

cesium:/home/spam # cat Mailbox
>From GERDESJ@whl.co.uk Fri Jul 09 13:51:08 2004
Return-Path: <GE...@whl.co.uk>
Delivered-To: spam@mail.whl.co.uk
Received: (qmail 21411 invoked by uid 508); 9 Jul 2004 13:51:08 -0000
Received: from GERDESJ@whl.co.uk by cesium by uid 502 with qmail-scanner-1.20
 (sweep: 2.20/3.83. spamassassin: 2.63.  Clear:RC:1(172.16.12.4):.
 Processed in 1.61962 secs); 09 Jul 2004 13:51:08 -0000
Received: from chlorine.whl.co.uk (172.16.12.4)
  by cesium.whl.co.uk with SMTP; 9 Jul 2004 13:51:06 -0000
Received: from gw.gkn-whl.co.uk (unverified) by chlorine.whl.co.uk
 (Content Technologies SMTPRS 4.2.10) with SMTP id <T6...@chlorine.whl.co.uk> for <sp...@mail.whl.co.uk>;
 Fri, 9 Jul 2004 14:51:05 +0100
Received: from GWPD-GKNWHL-Message_Server by gw.gkn-whl.co.uk
        with Novell_GroupWise; Fri, 09 Jul 2004 14:51:05 +0100
Message-Id: <s0...@gw.gkn-whl.co.uk>
X-Mailer: Novell GroupWise Internet Agent 5.5.7.1
Date: Fri, 09 Jul 2004 14:50:58 +0100
From: "Jon Gerdes" <GE...@whl.co.uk>
To: <sp...@mail.whl.co.uk>
Subject: Fwd: Environment Agency
Mime-Version: 1.0
Content-Type: message/rfc822

Received: from chlorine.whl.co.uk
        by gw.gkn-whl.co.uk; Wed, 07 Jul 2004 10:43:19 +0100
Received: from cesium.whl.co.uk (unverified) by chlorine.whl.co.uk
 (Content Technologies SMTPRS 4.2.10) with SMTP id <T6...@chlorine.whl.co.uk> for <ge...@mailsweeper.whl.co.uk>;
 Wed, 7 Jul 2004 10:43:19 +0100
Received: (qmail 24254 invoked by uid 500); 7 Jul 2004 09:43:19 -0000
Delivered-To: gerdesj@cesium.whl.co.uk
Received: (qmail 24250 invoked by uid 508); 7 Jul 2004 09:43:18 -0000
Received: from j@blp.net by cesium by uid 502 with qmail-scanner-1.20
 (sweep: 2.20/3.83. spamassassin: 2.63.  Clear:RC:1(193.37.69.18):SA:0(0.2/5.6):.
 Processed in 0.821099 secs); 07 Jul 2004 09:43:18 -0000
Received: from mail.whl.co.uk (HELO mail-dmz.whl.co.uk) (193.37.69.18)
  by cesium.whl.co.uk with SMTP; 7 Jul 2004 09:43:18 -0000
Received: from relay1.bt.net (relay1.bt.net [194.72.6.100])
        by mail-dmz.whl.co.uk (8.11.6/8.11.6) with ESMTP id i679hH015487
        for <ge...@whl.co.uk>; Wed, 7 Jul 2004 10:43:17 +0100
Received: from [212.90.33.10] (helo=blueloop.net)
        by relay1.bt.net with esmtp (Exim 3.36 #1)
        id 1Bi8xO-0005UZ-00
        for gerdesj@whl.co.uk; Wed, 07 Jul 2004 10:43:14 +0100
Received: from sarah (host213-122-49-49.in-addr.btopenworld.com [213.122.49.49])
        by blueloop.net (8.11.6/8.11.2) with SMTP id i679gCP12139
        for <ge...@whl.co.uk>; Wed, 7 Jul 2004 10:42:12 +0100
Message-ID: <00...@blueloop.int>
From: "Julie Grant" <j...@blp.net>
To: "Jon Gerdes" <ge...@whl.co.uk>
Subject: Environment Agency
Date: Wed, 7 Jul 2004 10:32:51 +0100
MIME-Version: 1.0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
X-Spam-Level:
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on cesium.whl.co.uk
X-Spam-Status: No, hits=0.2 required=5.6 tests=EXCUSE_16,EXCUSE_3,
        REMOVE_IN_QUOTES autolearn=no version=2.63

Hi Jon

I have just spoken with ..........

―---------------------------8<------------------------

Re: include IP lookups in SURBL lists

Posted by Jeff Chan <je...@surbl.org>.
OK I'm going to respond to several ideas in this thread in a
single reply.  It may help to go back and review some of the
thread messages.

1.  Regarding adding resolved IP addresses to SURBLs: Not gonna
happen.  FP potential is way too high.  A single (false) entry
resolving to a legitimate large shared web hosting server could
block hundreds or more legitimate sites.

2.  However the next version of sc.surbl.org data engine
will be a hybrid name/number system where:

  A.  the domains will get resolved internally,
  B.  the resulting IPs will get sorted into (CIDR) bins,
  C.  any fresh domain report that happens to resolve into one of
those bins will inherit the count of hits in the bins (perhaps
modulo some function), and most likely any fresh spam domains
resolving into a well-populated bin will get listed on the first
report instead of the tenth as sc does now.  We could even raise
the threshold to decrease FPs or change to a "top 500" or "top
1000" list.

So that should short circuit most the lag in detection for
domains resolving to persistent spammer IPs for the sc data.

The resulting lists will still be mostly domains.  We probably
won't let the internal IPs out, at least not in the existing
SURBLs.  Perhaps we could turn them into a separate list which
could be scored lower.  But our focus will remain on domains
because they are highly specific and don't require the time-
consuming step of name resolution.  (Name resolution is no
problem on a small box, but on big mail systems it can make
content checking impractical.  Resolved IPs also have some
of the potential problems already mentioned, most importantly
FPs.)

3.  The outblaze data already has a "recentness of domain
registration factor" of 90 days.  It also includes extensive
spam traps.  The combination appears to catch many spammer
URI domains pretty quickly and with a low FP rate.  So it
already somewhat incorporates John Hardin's idea of catching
recently registered domains, with the added factor that they
actually got caught spamming.  Outblaze's traps apparently
are pretty well engineered, given the relatively low FP rate.

BTW, there's a longer discussion of this question in the FAQ:

  http://www.surbl.org/faq.html#numbered

"Are there plans to offer an RBL list with the domain names
resolved into IP addresses?"

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/


Re: include IP lookups in SURBL lists

Posted by John Hardin <jo...@aproposretail.com>.
On Sun, 2004-07-11 at 13:50, Marc Kool wrote:
> >>>Have you tried using ob.surbl.org? I think it catches most of the domains
> >>>you mentioned.
> 
> ob is good, but they lag behind the reality. This can never change because of 
> the flow of the process: catch spam, verify it and add a domain to the list.

Hmm.

Is there any way to tap into the registrar system so you get notified of
a new domain name when it's created?

Maybe a low-scoring (1 or 2 points) SURBL for domains that are less than
one month old? (the cutoff age may need to be tuned, of course...)

--
John Hardin  KA7OHZ                           <jo...@aproposretail.com>
Internal Systems Administrator                    voice: (425) 672-1304
Apropos Retail Management Systems, Inc.             fax: (425) 672-0192
-----------------------------------------------------------------------
  ...the Fates notice those who buy chainsaws...
                                             -- www.darwinawards.com
-----------------------------------------------------------------------
 2 days until Apropos Forum 2004


Re: include IP lookups in SURBL lists

Posted by Marc Kool <M....@vioro.nl>.
John Fawcett wrote:
> From: "Marc Kool"
> 
>>John Fawcett wrote:
>>
>>>If that processing logic were implemented, then you would be identifying
> 
> all
> 
>>>domains that were hosted on an ip where there is/was a spammer domain as
>>>spammers.
>>>That will potentially increase FPs, the rule would not be so useful and
> 
> its

Unfortunately I am not in the position to find out which set of domains
resolve to the spammers IP address, but I strongly believe that very few providers
put a spammer on a shared box.

>>>score would have to be decreased.
>>>
>>>I cannot see any way to automatically tell whether 211.158.6.88 has ONLY
>>>spammer domains and therefore should be added to such a list of ips.
>>
>>Also true but somewhat theoretical if a more than X spam domains are
> 
> served
> 
>>from the same IP address (where X >= 3 ?)
>>
> 
> 
> Is X=3 satisfactory to not create FPs for big virtual hosting providers
> which
> reuse IP addresses for many domains?
> 
> What is the right value of X which will sacle so that it doesn't create FPs
> on large mail servers? (One of the features of the surbl lists is the low FP
> rate and some poeple are using them on very large mail servers).
> 
> 
>>>Have you tried using ob.surbl.org? I think it catches most of the
> 
> domains
> 
>>>you
>>>mentioned.

ob is good, but they lag behind the reality. This can never change because of 
the flow of the process: catch spam, verify it and add a domain to the list.

>>The surbl lists catch the mentioned domains _now_.  But this spammer
> 
> generates
> 
>>new ones regularly and it takes a while before the new domains are known
>>and included in the surbl lists.  I cannot estimate how many spams
>>can get through in "a while" but I have noticed on my system that
>>mails that were originally flagged non-spam were flagged spam a few hours
>>later because the URIs were then included in an updated surbl list.
>>
>>To stop this process where the new domain can be included in URI's and is
>>not (yet) included in surbl lists, the IP address could be included in
>>the surbl list and hence this spammer has no time window any more where
>>his spam gets undetected by surbl lookups.
> 
> 
> I think the ob list is already having quite a lot of success in blocking
> newly
> generated domains. When a spammer starts using a new domain and it hits
> an ob spamtrap, if that domain has been recently registered, it gets
> blocked.
> 
> Any idea about how many of the new domains on same ips are being missed
> currently by ob.surbl.org?
> 
> John

I administer an email server for 5 domains and 120 active users.
Since I only keep ham and spam for my own email account I can only report
in detail for this account:

in Jun 1 - Jun 10:
 93 correctly classified ham emails
131 correctly classified spam emails
900+ whitelisted emails of various mailing lists
 10  FN's (first classified as ham and "some" hours later correctly classified as spam)
where "some" is between 1 and 10 hours.

ob and the other lists are good but lag behind.
But note that registrering a new domain name is relatively cheap and registering an IP address is not.
Since spammers are agressive we have to have means to fight their agressive methods.
I believe that putting known IP addresses in a surbl list can be a good and effective way:
it makes operating cost for spammers higher and make life difficult since they need
to get new IP addresses far more quickly.
Note that 211.158.6.88 has been used in spam since June 23 or *17 days*, with an IP address lookup
many spams would be blocked without lagging behind the spammer.

To start another thread:
I am a contributor to the free URL database that can be used by squidguard and dansguardian
with a strong focus on sex sites (with 397000 domains). 
Although not every sex site sends spam, a mail administrator may want to implement
a local policy to block emails that refer to sex sites (I would :-)
Does anybody wants/needs/likes sex.surbl.org ???
For a list like sex.surbl.org a feature to include IP addresses is a benefit
since 57% percent of the sex domains have 10 or more domains resolving to a single IP address.

Marc


Re: include IP lookups in SURBL lists

Posted by John Fawcett <jo...@michaweb.net>.
From: "Marc Kool"
> John Fawcett wrote:
> >
> > If that processing logic were implemented, then you would be identifying
all
> > domains that were hosted on an ip where there is/was a spammer domain as
> > spammers.
> > That will potentially increase FPs, the rule would not be so useful and
its
> > score would have to be decreased.
> >
> > I cannot see any way to automatically tell whether 211.158.6.88 has ONLY
> > spammer domains and therefore should be added to such a list of ips.
>
> Also true but somewhat theoretical if a more than X spam domains are
served
> from the same IP address (where X >= 3 ?)
>

Is X=3 satisfactory to not create FPs for big virtual hosting providers
which
reuse IP addresses for many domains?

What is the right value of X which will sacle so that it doesn't create FPs
on large mail servers? (One of the features of the surbl lists is the low FP
rate and some poeple are using them on very large mail servers).

> > Have you tried using ob.surbl.org? I think it catches most of the
domains
> > you
> > mentioned.
>
> The surbl lists catch the mentioned domains _now_.  But this spammer
generates
> new ones regularly and it takes a while before the new domains are known
> and included in the surbl lists.  I cannot estimate how many spams
> can get through in "a while" but I have noticed on my system that
> mails that were originally flagged non-spam were flagged spam a few hours
> later because the URIs were then included in an updated surbl list.
>
> To stop this process where the new domain can be included in URI's and is
> not (yet) included in surbl lists, the IP address could be included in
> the surbl list and hence this spammer has no time window any more where
> his spam gets undetected by surbl lookups.

I think the ob list is already having quite a lot of success in blocking
newly
generated domains. When a spammer starts using a new domain and it hits
an ob spamtrap, if that domain has been recently registered, it gets
blocked.

Any idea about how many of the new domains on same ips are being missed
currently by ob.surbl.org?

John


Re: include IP lookups in SURBL lists

Posted by Marc Kool <M....@vioro.nl>.
John Fawcett wrote:
> From: "Marc Kool"
> 
>>Hi,
>>
>>Using quaraintained spam and FN's I found out that the various SURBL lists
> 
> lag behind the spammers.
> 
>>I consider it "normal" but also like to improve it.
>>
>>I only receive 20-50 spams per day and did an analysis and found out that
> 
> the
> 
>>URLs of the spam messages are about domains using the same IP address.
>>
>>I found for example:
>>211.158.6.88 2giKe4V5C.simptompsakiana.org
>>211.158.6.88 5tYTNHYH.polishesofikals.org
>>211.158.6.88 7Z05PeUBKz.9H8UozoNv.pazdanimphos.org
>>211.158.6.88 9L88lRG.poisesneynano.org
>>211.158.6.88 9XA.1eX.fraklesneynano.org
>>211.158.6.88 BL4CLL.fraklesneynano.org
>>211.158.6.88 BlnXPOc7d.LURaH.bortsimisbortsimis.org
>>211.158.6.88 Cdj.2NJq2BanB.bortsimisbortsimis.org
>>211.158.6.88 DC.pikasxesros.org
>>(and lots more)
>>
>>So I wonder if we could extend the SURBL module in SA to also verify the
> 
> IP address of the URI
> 
>>in a (new?) surbl list.
>>
>>Marc
> 
> 
> Marc
> 
> The SURBL work only on urls found within spam. They do not resolve these to
> IPs.
> Resolving them to IPs and checking against a dnsbl would require a different
> processing logic (and more processing time).

true.

> If that processing logic were implemented, then you would be identifying all
> domains that were hosted on an ip where there is/was a spammer domain as
> spammers.
> That will potentially increase FPs, the rule would not be so useful and its
> score would have to be decreased.
> 
> I cannot see any way to automatically tell whether 211.158.6.88 has ONLY
> spammer domains and therefore should be added to such a list of ips.

Also true but somewhat theoretical if a more than X spam domains are served
from the same IP address (where X >= 3 ?)

> Have you tried using ob.surbl.org? I think it catches most of the domains
> you
> mentioned.

The surbl lists catch the mentioned domains _now_.  But this spammer generates
new ones regularly and it takes a while before the new domains are known
and included in the surbl lists.  I cannot estimate how many spams
can get through in "a while" but I have noticed on my system that
mails that were originally flagged non-spam were flagged spam a few hours
later because the URIs were then included in an updated surbl list.

To stop this process where the new domain can be included in URI's and is
not (yet) included in surbl lists, the IP address could be included in
the surbl list and hence this spammer has no time window any more where
his spam gets undetected by surbl lookups.
-Marc
 
> John

Re: include IP lookups in SURBL lists

Posted by John Fawcett <jo...@michaweb.net>.
From: "Marc Kool"
> Hi,
>
> Using quaraintained spam and FN's I found out that the various SURBL lists
lag behind the spammers.
> I consider it "normal" but also like to improve it.
>
> I only receive 20-50 spams per day and did an analysis and found out that
the
> URLs of the spam messages are about domains using the same IP address.
>
> I found for example:
> 211.158.6.88 2giKe4V5C.simptompsakiana.org
> 211.158.6.88 5tYTNHYH.polishesofikals.org
> 211.158.6.88 7Z05PeUBKz.9H8UozoNv.pazdanimphos.org
> 211.158.6.88 9L88lRG.poisesneynano.org
> 211.158.6.88 9XA.1eX.fraklesneynano.org
> 211.158.6.88 BL4CLL.fraklesneynano.org
> 211.158.6.88 BlnXPOc7d.LURaH.bortsimisbortsimis.org
> 211.158.6.88 Cdj.2NJq2BanB.bortsimisbortsimis.org
> 211.158.6.88 DC.pikasxesros.org
> (and lots more)
>
> So I wonder if we could extend the SURBL module in SA to also verify the
IP address of the URI
> in a (new?) surbl list.
>
> Marc

Marc

The SURBL work only on urls found within spam. They do not resolve these to
IPs.
Resolving them to IPs and checking against a dnsbl would require a different
processing logic (and more processing time).

If that processing logic were implemented, then you would be identifying all
domains that were hosted on an ip where there is/was a spammer domain as
spammers.
That will potentially increase FPs, the rule would not be so useful and its
score would have to be decreased.

I cannot see any way to automatically tell whether 211.158.6.88 has ONLY
spammer domains and therefore should be added to such a list of ips.

Have you tried using ob.surbl.org? I think it catches most of the domains
you
mentioned.

John



include IP lookups in SURBL lists

Posted by Marc Kool <M....@vioro.nl>.
Hi,

Using quaraintained spam and FN's I found out that the various SURBL lists lag behind the spammers.
I consider it "normal" but also like to improve it.

I only receive 20-50 spams per day and did an analysis and found out that the
URLs of the spam messages are about domains using the same IP address.

I found for example:
211.158.6.88 2giKe4V5C.simptompsakiana.org
211.158.6.88 5tYTNHYH.polishesofikals.org
211.158.6.88 7Z05PeUBKz.9H8UozoNv.pazdanimphos.org
211.158.6.88 9L88lRG.poisesneynano.org
211.158.6.88 9XA.1eX.fraklesneynano.org
211.158.6.88 BL4CLL.fraklesneynano.org
211.158.6.88 BlnXPOc7d.LURaH.bortsimisbortsimis.org
211.158.6.88 Cdj.2NJq2BanB.bortsimisbortsimis.org
211.158.6.88 DC.pikasxesros.org
(and lots more)

So I wonder if we could extend the SURBL module in SA to also verify the IP address of the URI
in a (new?) surbl list.

Marc

Re: SiteWideBayesFeedback

Posted by Matt Kettler <mk...@comcast.net>.
At 03:15 PM 7/9/04 +0100, Jon Gerdes wrote:
>Reproduced below is an example message I have scraped out of spam's 
>mailbox file.  Do I need to remove anything or is all the extra routing 
>info for the encapsulating mail OK?

You need to scrap the forwarding headers. SA will otherwise interpret the 
second set of headers as a part of the message body, which is not the 
desired result.

It will also learn tokens in the first set of headers quite agressively. 
Eliminate them. sa-learn needs to see the message with pretty close to 
orignial headers. An extra Recieved: or two is ok, but other than that, no 
changes are good changes.