You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2005/01/14 04:00:23 UTC

SURBL whitelist volume chicken-egg problem

So, as you may be aware, we have a minor issue in terms of figuring out
which whitelisted domains should be skipped in queries.

  SpamAssassin now ships with list of domains that are excluded for
  SURBL lookups from the SURBL whitelist.  This list is the 125
  most commonly queried domains.

  SURBL counts the number of queries each domain receives to track the
  most commonly queried domains so we can produce an accurate list of
  domains.

  But, once we skip a domain, its relative volume is going to drop way
  off in the SURBL data.

One idea I had to fix this is that SA not use the SURBL whitelist for 1
in 10 queries and that those be directed to a different zone.  However,
that would be somewhat counterproductive in terms of DNS caching and I'm
not sure how happy Jeff would be about the idea.

Another way would be to not use the exclusion list for certain periods
of time if you could select just those times for generating volume
data.  A bit too hacky.

Another way to fix the problem would be to rank the domains with some
other source of volume data (not SURBL-related) such as looking at a DNS
cache at a large ISP.

Any other ideas?

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: SURBL whitelist volume chicken-egg problem

Posted by Sidney Markowitz <si...@sidney.com>.
Daryl C. W. O'Shea said:
> The emails generated could be used to calculate
> the domains most often seen.

I would be afraid of it being too easy for malicious people to hack by
sending in false data, DoS attacks on the email addresses, etc. Also
there is no reason to load down some email address with data from
everyone who is running SpamAssassin. Feeds from a few large ISPs would
be accurate enough for the purpose and more trustworthy.

 Sidney Markowitx
 http://www.sidney.com



Re: SURBL whitelist volume chicken-egg problem

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Daniel Quinlan wrote:
 > Any other ideas?

I've got a plugin (attached) that tracks the occurrence of 
non-blacklisted domains in messages, storing the data in an SQL database 
and emailing the data out periodically.  The plugin expires it's 
database automatically.

The emails generated could be used to calculate the domains most often seen.

Two tables, described in the attached TABLES file, are needed.


Daryl

Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 7:00:07 PM, Robert Menschel wrote:
DQ>>   But, once we skip a domain, its relative volume is going to drop
DQ>> way off in the SURBL data.

> But that doesn't affect the validity of the 125 domains, does it?

No, but the first cut at the top 125 was based on more limited
data (less time covered).  We were hoping to get a more accurate
"125" by looking at a longer time period of 90 days instead of
10.  In other words, we wanted to fine tune the data a little
better, ignore some shorter-lived popular domains, like those
from the U.S. election, etc.

> Assume that the top 125 domains are stable and reliable and should
> remain whitelisted (changes to that decision should be made manually,
> and almost never). Let SURBL collect its new data, and look at the top
> 50-100 domains. Assume that at least some of these first top 125 will
> not be listed, since they aren't being SURBL queried. Don't drop the
> 125, but simply add to the whitelist a number of the "new" top 100, if
> warranted.

It would work, but I think there was to be some kind of limit on
the manual whitelist.  IIRC 125 was arbitrarily chosen, but it
happens to correspond almost exactly to the 50th percentile of
ham domain hits.

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Loren Wilton <lw...@earthlink.net>.
> I was thinking along the lines of something that SpamAssassin downloads
> once a month, or queries to find out if there is an update available and
> only downloads if there is. Since the idea is to limit DNS queries, of

While it isn't part of the offical SA project, this sounds like exactly a
job for RDJ.

Just make a somthing.cf that contains the whitelist, and add it to rdj as
one of the things to download when it changes.

        Loren


Re: SURBL whitelist volume chicken-egg problem

Posted by Sidney Markowitz <si...@sidney.com>.
Jeff Chan said:
> There are a number of reasons for not doing a whitelist RBL:
>
> 1.  Excessive queries:  Whitehat domains come up a lot
> in messages.

I was thinking along the lines of something that SpamAssassin downloads
once a month, or queries to find out if there is an update available and
only downloads if there is. Since the idea is to limit DNS queries, of
course it would not be implemented as a DNS-based whitelist that is
checked for every URI. It could be stored on a DNS if you could trust
people not to misuse it, but it must be designed for infrequent downloads
in bulk, with queries of URIs done to a local database.

> 2.  Potential misuse:  Inadvertently blacklisting whitehats,
> i.e. user error.

If it is separate enough from the blacklist, i.e., it is queried and used
in a totally different way than a DNS query of each URI domain, then I
don't see much potential for misuse. You simply have a list of the top n
non-spam domains that can be downloaded in bulk and document how to do it
and that it is to be used to reduce the number of DNS queries.

> 3.  Possibility of negative scoring:  Some application would
> probably try to negative score them

SpamAssassin would not do it. You would not encourage that. Your
documentation would make it clear that it is a list of domains not to
bother DNS querying that do not indicate either spam or ham when they
appear in an email. Even if some misguided programmer missed all that, I
don't see how it would be in a mainstream popular antispam program with
enough use to effect spammers' behavior.

 Sidney Markowitz
 http://sidney.com



Re: SURBL whitelist volume chicken-egg problem

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Jeff Chan wrote:
> Currently there's no provision for updating the hard coded list
> other than releasing a new version of SA.  Something more dynamic
> could perhaps be engineered, short of another RBL.
> 
> There are a number of reasons for not doing a whitelist RBL:
> 
> 1.  Excessive queries:  Whitehat domains come up a lot in
> messages.

Setting the whitelisted TTLs to 7 days (the default bind max-cache-ttl) 
should take care of excessive load to the SURBL servers.

Of course, to be of any advantage, the whitelisted records would have to 
be part of the zone containing the blacklisted records.  Which would be 
a problem for any application just doing a simple lookup, paying no 
attention to what it actually resolves as...

> 2.  Potential misuse:  Inadvertently blacklisting whitehats, i.e.
> user error.

New zones called 'whiteandblack.surbl.org' (or something similar), 
containing the data from the current zones and the whitelisted domains, 
should be pretty obvious that the zone contains both whitelisted and 
blacklisted domains, and that an application needs to pay attention to 
what a lookup actually resolves to.

The existing blacklist zones would continue as they operate now.

> 3.  Possibility of negative scoring:  Some application would
> probably try to negative score them, which would simply cause
> spammers to load up their spams with a lot of whitehat domains,
> which would drive up mail processing loads, DNS queries, etc.,
> and potentially get spam through filters.
> ...

Protecting end users from their own stupidity is bad enough.  So long as 
the RBL is clearly, and visibly, documented (citing reasons why not to 
negative score whitelists) I don't believe there is a need to protect 
application developers from their stupidity too.  Anyone who goes adding 
RBLs to their application without looking into info about that RBL 
deserves bad results anyway.

Developers could fall in the same trap, assigning negative scores, based 
on SPF records, but we still use them.  I don't see much of a difference.


Daryl


Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 7:31:06 PM, Sidney Markowitz wrote:
> Robert Menschel said:
>> Don't drop the 125, but simply add to the
>> whitelist a number of the "new" top 100

> I like that idea as a second choice. If the list is only updated when
> there is a new release of SpamAssassin then it will not grow too rapidly.
> It would be quite a few years to get to a table of a thousand entries.

> There would be a minor problem with a whitelisted domain expiring and
> getting snapped up by a spammer. That could be taken care of by the SURBL
> people checking if a domain that is being added to the SURBL is on the
> whitelist and informing the SA team so it can be removed.

Problem is the 125 (or 250 or whatever) are hard coded into SA
versions.  Those versions don't get updated very often (and
some people never or very seldom upgrade their code) so it would
be hard to remove hard-coded, formerly legitimate domains that
got recycled by spammers.   Therefore the ones chosen for an
ignore list like this need to be very stable and certain, like
yahoo, w3.org, etc. 

> But it is only my second choice, if there is no way for SURBL to monitor
> domains in email independent of the SA queries. I like the idea of them
> getting feeds from ISPs like the one sonic.net offered. That way they can
> maintain a current list of most common domains in ham mail independent of
> the SpamAssassin release cycle. SpamAssassin could download the list more
> or less often depending on how volatile the list is. My guess is that
> monthly is fine, as that is much better than once per SA release cycle.

Currently there's no provision for updating the hard coded list
other than releasing a new version of SA.  Something more dynamic
could perhaps be engineered, short of another RBL.

There are a number of reasons for not doing a whitelist RBL:

1.  Excessive queries:  Whitehat domains come up a lot in
messages.
2.  Potential misuse:  Inadvertently blacklisting whitehats, i.e.
user error.
3.  Possibility of negative scoring:  Some application would
probably try to negative score them, which would simply cause
spammers to load up their spams with a lot of whitehat domains,
which would drive up mail processing loads, DNS queries, etc.,
and potentially get spam through filters.
...

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Sidney Markowitz <si...@sidney.com>.
Robert Menschel said:
> Don't drop the 125, but simply add to the
> whitelist a number of the "new" top 100

I like that idea as a second choice. If the list is only updated when
there is a new release of SpamAssassin then it will not grow too rapidly.
It would be quite a few years to get to a table of a thousand entries.

There would be a minor problem with a whitelisted domain expiring and
getting snapped up by a spammer. That could be taken care of by the SURBL
people checking if a domain that is being added to the SURBL is on the
whitelist and informing the SA team so it can be removed.

But it is only my second choice, if there is no way for SURBL to monitor
domains in email independent of the SA queries. I like the idea of them
getting feeds from ISPs like the one sonic.net offered. That way they can
maintain a current list of most common domains in ham mail independent of
the SpamAssassin release cycle. SpamAssassin could download the list more
or less often depending on how volatile the list is. My guess is that
monthly is fine, as that is much better than once per SA release cycle.

 Sidney Markowitz
 http://www.sidney.com



Re: SURBL whitelist volume chicken-egg problem

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Daniel,

I don't quite understand the problem.

Thursday, January 13, 2005, 7:00:23 PM, you wrote:

DQ> So, as you may be aware, we have a minor issue in terms of
DQ> figuring out which whitelisted domains should be skipped in
DQ> queries.  

DQ>   SpamAssassin now ships with list of domains that are excluded
DQ> for SURBL lookups from the SURBL whitelist.  This list is the 125
DQ> most commonly queried domains.

Good.

DQ>  SURBL counts the number of queries each domain receives to track
DQ> the most commonly queried domains so we can produce an accurate
DQ> list of domains.

Understood.

DQ>   But, once we skip a domain, its relative volume is going to drop
DQ> way off in the SURBL data.

But that doesn't affect the validity of the 125 domains, does it?

Assume that the top 125 domains are stable and reliable and should
remain whitelisted (changes to that decision should be made manually,
and almost never). Let SURBL collect its new data, and look at the top
50-100 domains. Assume that at least some of these first top 125 will
not be listed, since they aren't being SURBL queried. Don't drop the
125, but simply add to the whitelist a number of the "new" top 100, if
warranted.

Would that not work?

Bob Menschel




Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 6:39:07 AM, Raymond Dijkxhoorn wrote:
>> Why not make the SURBL whitelist into a DNSBL? I'm assuming that the
>> whitelist has something to do with correctness of results. Or is it just
>> to reduce the number of queries?

> IT was mainly to reduce hits on the nameservers. On all SURBLs the 
> whitelist is used allreayd so for lookups it would be a waste to do a 
> extra lookup. Doesnt bring much extra, or would you like to add a negative 
> score for whitelisted domains? Then it would make sense. But on the other 
> hand then spammers will add random links with whitelisted domains, since 
> the whitelist is publicly available.

Yes, that's a reason why we didn't do it.  :-)

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Raymond Dijkxhoorn <ra...@prolocation.net>.
Hi!

>> So, as you may be aware, we have a minor issue in terms of figuring out
>> which whitelisted domains should be skipped in queries.

>>   SpamAssassin now ships with list of domains that are excluded for
>>   SURBL lookups from the SURBL whitelist.  This list is the 125
>>   most commonly queried domains.

> Why not make the SURBL whitelist into a DNSBL? I'm assuming that the
> whitelist has something to do with correctness of results. Or is it just
> to reduce the number of queries?

IT was mainly to reduce hits on the nameservers. On all SURBLs the 
whitelist is used allreayd so for lookups it would be a waste to do a 
extra lookup. Doesnt bring much extra, or would you like to add a negative 
score for whitelisted domains? Then it would make sense. But on the other 
hand then spammers will add random links with whitelisted domains, since 
the whitelist is publicly available.

Bye,
Raymond.

Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 5:34:06 AM, Tony Finch wrote:
> On Thu, 13 Jan 2005, Daniel Quinlan wrote:

>> So, as you may be aware, we have a minor issue in terms of figuring out
>> which whitelisted domains should be skipped in queries.
>>
>>   SpamAssassin now ships with list of domains that are excluded for
>>   SURBL lookups from the SURBL whitelist.  This list is the 125
>>   most commonly queried domains.

> Why not make the SURBL whitelist into a DNSBL? I'm assuming that the
> whitelist has something to do with correctness of results. Or is it just
> to reduce the number of queries?

> Tony.

We prevent about half of all DNS queries by whitelisting the top
125 and not checking those on the client side (in SA).

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Tony Finch <do...@dotat.at>.
On Thu, 13 Jan 2005, Daniel Quinlan wrote:

> So, as you may be aware, we have a minor issue in terms of figuring out
> which whitelisted domains should be skipped in queries.
>
>   SpamAssassin now ships with list of domains that are excluded for
>   SURBL lookups from the SURBL whitelist.  This list is the 125
>   most commonly queried domains.

Why not make the SURBL whitelist into a DNSBL? I'm assuming that the
whitelist has something to do with correctness of results. Or is it just
to reduce the number of queries?

Tony.
-- 
f.a.n.finch  <do...@dotat.at>  http://dotat.at/
ST DAVIDS HEAD TO COLWYN BAY, INCLUDING ST GEORGES CHANNEL: SOUTH OR SOUTHEAST
5 OR 6, OCCASIONALLY 7. OUTBREAK OF RAIN AND DRIZZLE. MODERATE OR GOOD,
POSSIBLY POOR AT TIMES. MODERATE OR ROUGH.

Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 3:01:27 AM, Raymond Dijkxhoorn wrote:
> Hi!

>>>>>> Another way to fix the problem would be to rank the domains with some
>>>>>> other source of volume data (not SURBL-related) such as looking at a DNS
>>>>>> cache at a large ISP.

>>>> Not everyone may have been included in an earlier discussion.
>>>> Since SpamAssassin is whitelisting a top 125 of domains and
>>>> not checking them, those 125 tend to be underrepresented in
>>>> the DNS queries.  Daniel was interested in finding a more
>>>> representative sample of the whitehat domains to feed back
>>>> into the process to revise the 125.

>>> So we should do some counting again. Thats not a big problem isnt it :)

>> Since SpamAssassin isn't even checking those 125 domains
>> they won't appear in the queries.  We can't count what
>> isn't there.  ;-)

> Semi, still a lot of people are using SA 2.x with the plugin.
> But true, you got qa point there :)

Yes, and there are other programs using SURBLs other
than SpamAssassin, but it's probably the main one.

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Raymond Dijkxhoorn <ra...@prolocation.net>.
Hi!

>>>>> Another way to fix the problem would be to rank the domains with some
>>>>> other source of volume data (not SURBL-related) such as looking at a DNS
>>>>> cache at a large ISP.

>>> Not everyone may have been included in an earlier discussion.
>>> Since SpamAssassin is whitelisting a top 125 of domains and
>>> not checking them, those 125 tend to be underrepresented in
>>> the DNS queries.  Daniel was interested in finding a more
>>> representative sample of the whitehat domains to feed back
>>> into the process to revise the 125.

>> So we should do some counting again. Thats not a big problem isnt it :)

> Since SpamAssassin isn't even checking those 125 domains
> they won't appear in the queries.  We can't count what
> isn't there.  ;-)

Semi, still a lot of people are using SA 2.x with the plugin.
But true, you got qa point there :)

Bye,
Raymond.




Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
Hello Raymond,


On Friday, January 14, 2005, 1:56:11 AM, Raymond Dijkxhoorn wrote:
> Hi!

>>>> Another way to fix the problem would be to rank the domains with some
>>>> other source of volume data (not SURBL-related) such as looking at a DNS
>>>> cache at a large ISP.

>> Not everyone may have been included in an earlier discussion.
>> Since SpamAssassin is whitelisting a top 125 of domains and
>> not checking them, those 125 tend to be underrepresented in
>> the DNS queries.  Daniel was interested in finding a more
>> representative sample of the whitehat domains to feed back
>> into the process to revise the 125.

> So we should do some counting again. Thats not a big problem isnt it :)

Since SpamAssassin isn't even checking those 125 domains
they won't appear in the queries.  We can't count what
isn't there.  ;-)

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Raymond Dijkxhoorn <ra...@prolocation.net>.
Hi!

>>> Another way to fix the problem would be to rank the domains with some
>>> other source of volume data (not SURBL-related) such as looking at a DNS
>>> cache at a large ISP.

> Not everyone may have been included in an earlier discussion.
> Since SpamAssassin is whitelisting a top 125 of domains and
> not checking them, those 125 tend to be underrepresented in
> the DNS queries.  Daniel was interested in finding a more
> representative sample of the whitehat domains to feed back
> into the process to revise the 125.

So we should do some counting again. Thats not a big problem isnt it :)

Bye,
Raymond.

Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Friday, January 14, 2005, 1:00:04 AM, Raymond Dijkxhoorn wrote:
>> One idea I had to fix this is that SA not use the SURBL whitelist for 1
>> in 10 queries and that those be directed to a different zone.  However,
>> that would be somewhat counterproductive in terms of DNS caching and I'm
>> not sure how happy Jeff would be about the idea.

> Please dont, this wont scale at all, people running their own copy's of 
> the RBL's wont be happy with this.

>> Another way would be to not use the exclusion list for certain periods
>> of time if you could select just those times for generating volume
>> data.  A bit too hacky.
>>
>> Another way to fix the problem would be to rank the domains with some
>> other source of volume data (not SURBL-related) such as looking at a DNS
>> cache at a large ISP.

> We allready do these things, we monitor traffic on some of the SURBL 
> servers and have pretty ok stats available of what the 'top domains' are.

Hi Raymond,
Not everyone may have been included in an earlier discussion.
Since SpamAssassin is whitelisting a top 125 of domains and
not checking them, those 125 tend to be underrepresented in
the DNS queries.  Daniel was interested in finding a more
representative sample of the whitehat domains to feed back
into the process to revise the 125.

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/


Re: SURBL whitelist volume chicken-egg problem

Posted by Raymond Dijkxhoorn <ra...@prolocation.net>.
Hi!

> One idea I had to fix this is that SA not use the SURBL whitelist for 1
> in 10 queries and that those be directed to a different zone.  However,
> that would be somewhat counterproductive in terms of DNS caching and I'm
> not sure how happy Jeff would be about the idea.

Please dont, this wont scale at all, people running their own copy's of 
the RBL's wont be happy with this.

> Another way would be to not use the exclusion list for certain periods
> of time if you could select just those times for generating volume
> data.  A bit too hacky.
>
> Another way to fix the problem would be to rank the domains with some
> other source of volume data (not SURBL-related) such as looking at a DNS
> cache at a large ISP.

We allready do these things, we monitor traffic on some of the SURBL 
servers and have pretty ok stats available of what the 'top domains' are.

Bye,
Raymond.

Re: SURBL whitelist volume chicken-egg problem

Posted by Jeff Chan <wh...@surbl.org>.
On Thursday, January 13, 2005, 7:00:23 PM, Daniel Quinlan wrote:
> So, as you may be aware, we have a minor issue in terms of figuring out
> which whitelisted domains should be skipped in queries.

>   SpamAssassin now ships with list of domains that are excluded for
>   SURBL lookups from the SURBL whitelist.  This list is the 125
>   most commonly queried domains.

>   SURBL counts the number of queries each domain receives to track the
>   most commonly queried domains so we can produce an accurate list of
>   domains.

>   But, once we skip a domain, its relative volume is going to drop way
>   off in the SURBL data.

> One idea I had to fix this is that SA not use the SURBL whitelist for 1
> in 10 queries and that those be directed to a different zone.  However,
> that would be somewhat counterproductive in terms of DNS caching and I'm
> not sure how happy Jeff would be about the idea.

> Another way would be to not use the exclusion list for certain periods
> of time if you could select just those times for generating volume
> data.  A bit too hacky.

> Another way to fix the problem would be to rank the domains with some
> other source of volume data (not SURBL-related) such as looking at a DNS
> cache at a large ISP.

> Any other ideas?

> Daniel

As a matter of fact, Sonic (a medium-large ISP) has offered me
a ham and spam URI host feed, but I have not had a chance to
look at it yet.  The ham data could be a source of good white
list domains.

Jeff C.
-- 
Jeff Chan
mailto:whitelist@surbl.org
http://www.surbl.org/