You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marc Perkel <ma...@perkel.com> on 2008/04/26 18:49:58 UTC

Starting a URIBL - Howto? [OT]

I was just wondering from those of you who have done it - how to start a 
URIBL. I'm guessing the process (simplified) is:

1) Mine messages for links
2) Subtract out anything matching a fairly large white list

So my first question here is - what do most of you used to mine the 
links in a message with? Can someone point me in the right direction? 
Also - I'm willing to work with and share data with others who are 
already doing this.

Re: Starting a URIBL - Howto? [OT]

Posted by Rob McEwen <ro...@invaluement.com>.

Dallas Engelken wrote:
> No, you're right, thats not fair.   If I compare only recent reactive 
> listings, minus the subdomain hosters that we list, you hit about 60% 
> whereas before it was more like 27%.
>
> imvURI stats from last 5000 URIBL black listings
> -> 2981 hits
> -> 2019 misses
Dallas, I've made some recent *substantial* improvements to ivmURI. (1) 
I've added *several* new spam sources... it was always a weakness of 
ivmURI that the raw data that fed ivmURI wasn't "wide" enough. That 
incoming data is much wider now! ...and... (2) I improved ivmSIP's 
response time (previously, it was getting bogged down in some auditing 
tasks that was delaying writes to the rsync files... that has been fixed).

RESULTS...

stats from 5/23/2008 (a few minutes ago).
---------------------
322/500 (ivmURI hits from the latest 500 URIBL listings)
(whereas a couple of tests in April showed 186/500 and 225/500)

301/500 (URIBL hits from the latest 500 ivmURI listings)
NOTE: to compare apples-to-apples, subdomain listings in URIBL were removed

Let me know if you'd like a snapshot of ivmURI for your own analysis of 
these latest improvements.

ALSO...

In spite of your off-list explanation, I'm STILL confused about what you 
mean when you refer to URIBL's *pro-active listings*???

You must be either referring to:

(A) Listings *currently* in URIBL-GOLD, but *not* *yet* in URIBL-BLACK
--or--
(B) Listings *currently* in URIBL-BLACK which were *previously* listed 
in URIBL-GOLD

Which is it? "A" or "B"? (or something else?)

OF COURSE: The silly part about all these stats is that the *superior* 
comparison between DNSBLs is "hit rates" on spams sent to mail servers 
combined with low FP rates. It is possible for a DNSBL to have far fewer 
listings, but, in "real world" testing, hit on higher numbers of spams 
with less FPs.

Rob McEwen

Re: Re: Starting a URIBL - Howto? [OT]

Posted by Dallas Engelken <da...@uribl.com>.

Rob McEwen wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Dallas 
> Engelken wrote:
>> Yes, of course, but you're results.txt is biased as it only shows 
>> where imvURI hits.
>>
>> Based on the last 20k adds to URIBL,  it appears to me that imvURI 
>> has less coverage?
>> <snip>:
> Dallas,
>
> Yes, you are right!
>
> URIBL *does* cast a wider net than ivmURI.
>
> So, in general, I agree with your statement that ivmURI has less 
> coverage than URIBL. But I'm confused about your stats... and they 
> looks really weird. (but maybe I'm just not understanding them?)
>
> So here is what I did.
>
> I took the last 500 additions to URIBL, (not including geocity and 
> blogspot items... so that this comparison would compare apples to 
> apples!) I then ran those against ivmURI.
>
> 186 of the 500 latest additions to URIBL were also found in ivmURI.
>
> I then reversed this testing and ran URIBL against the last 500 
> additions to ivmURI.
>
> 328 of the latest 500 additions to ivmURI were listed on URIBL.
>
> So yes, basically, you're right, URIBL does have greater coverage than 
> ivmURI.
>
> Your point is well made. For the most part, URIBL casts a wider net 
> than ivmURI. Also, if you were to include geocity and blogspot hits, 
> of course, that would throw the comparison wildly in URIBL's favor... 
> but I'm not so sure that would be a fair comparison.

No, you're right, thats not fair.   If I compare only recent reactive 
listings, minus the subdomain hosters that we list, you hit about 60% 
whereas before it was more like 27%.

imvURI stats from last 5000 URIBL black listings
 -> 2981 hits
 -> 2019 misses

>
> (In both tests, I checked against the 2nd list just about 2-3 minutes 
> after grabbing the lastest data from first list. This is important as 
> I was seeing those stats quickly grow for BOTH after my initial 
> collection of stats... because items not yet in both lists are 
> continuously getting into the other list fast. So timing is mission 
> critical in this kind of testing and the time between gathering and 
> checking MUST be the same both ways.)
>
> However, I think you missed my point about 
> http://invaluement.com/results.txt
>
> I wasn't saying that this proved that ivmURI is better than URIBL or 
> SURBL. Only that this proves ivmURI as being *relevant* and *useful* 
> ...even for those who are already using *both* URIBL and SURBL.  (and 
> this is just one such proof!)

you said,

"and ALL 3 catch stuff the other 2 miss... FOR EXAMPLE: http://invaluement.com/results.txt )"

your EXAMPLE contradicts the statement that precedes it.  I can only take it in the context of how I read it.  


>
> For example, if ivmURI were only catching stuff already caught by 
> URIBL and SURBL, ivmURI wouldn't be relevant or helpful to anyone. 
> Moreover, I believe that URIBL or SURBL could easily create a 
> similarly impressive page as my http://invaluement.com/results.txt page.

Probably.

>
> Bottom line is that you are correct... AND... I'm sorry you took this 
> as me dissing URIBL!
>

I didnt take it that way.  I was just pointing out that your statement 
didnt match your accompanying example.

> Simply put, there are some series of spams that each of the three URI 
> blacklists are better at catching than the other two. That is ALL that 
> I meant by this.
>
Okay, if you would have said that, I would have agreed and never posted :)

-- 
Dallas Engelken
dallase@uribl.com
http://uribl.com

Re: Starting a URIBL - Howto? [OT]

Posted by Rob McEwen <ro...@invaluement.com>.

Dallas Engelken wrote:
> Yes, of course, but you're results.txt is biased as it only shows 
> where imvURI hits.
>
> Based on the last 20k adds to URIBL,  it appears to me that imvURI has 
> less coverage?
> <snip>:
Dallas,

Yes, you are right!

URIBL *does* cast a wider net than ivmURI.

So, in general, I agree with your statement that ivmURI has less 
coverage than URIBL. But I'm confused about your stats... and they looks 
really weird. (but maybe I'm just not understanding them?)

So here is what I did.

I took the last 500 additions to URIBL, (not including geocity and 
blogspot items... so that this comparison would compare apples to 
apples!) I then ran those against ivmURI.

186 of the 500 latest additions to URIBL were also found in ivmURI.

I then reversed this testing and ran URIBL against the last 500 
additions to ivmURI.

328 of the latest 500 additions to ivmURI were listed on URIBL.

So yes, basically, you're right, URIBL does have greater coverage than 
ivmURI.

Your point is well made. For the most part, URIBL casts a wider net than 
ivmURI. Also, if you were to include geocity and blogspot hits, of 
course, that would throw the comparison wildly in URIBL's favor... but 
I'm not so sure that would be a fair comparison.

(In both tests, I checked against the 2nd list just about 2-3 minutes 
after grabbing the lastest data from first list. This is important as I 
was seeing those stats quickly grow for BOTH after my initial collection 
of stats... because items not yet in both lists are continuously getting 
into the other list fast. So timing is mission critical in this kind of 
testing and the time between gathering and checking MUST be the same 
both ways.)

However, I think you missed my point about 
http://invaluement.com/results.txt

I wasn't saying that this proved that ivmURI is better than URIBL or 
SURBL. Only that this proves ivmURI as being *relevant* and *useful* 
...even for those who are already using *both* URIBL and SURBL.  (and 
this is just one such proof!)

For example, if ivmURI were only catching stuff already caught by URIBL 
and SURBL, ivmURI wouldn't be relevant or helpful to anyone. Moreover, I 
believe that URIBL or SURBL could easily create a similarly impressive 
page as my http://invaluement.com/results.txt page.

Bottom line is that you are correct... AND... I'm sorry you took this as 
me dissing URIBL!

Simply put, there are some series of spams that each of the three URI 
blacklists are better at catching than the other two. That is ALL that I 
meant by this.

I'm trying to NOT turn this into a pissing contest. Can we end this 
here?  (Frankly, I'm keeping a LOT of powder dry right now as a gesture 
of good-will.)

BTW - How do you have access? Direct queries are not allowed... even for 
my paying subscribers. And I don't recall ever setting you up for RSYNC 
access? (I recall offering...  I just don't recall it ever happening.) 
Where is your access coming from?

ALSO: Does this mean that I now am not allowed to make the official 
invaluement.com site launch announcement on the URIBL list? ...I hope 
not... then again, we might all be old and gray by the time that happens :)

Rob McEwen

Re: Starting a URIBL - Howto? [OT]

Posted by Clayton Keller <in...@ruraltel.net>.

Dallas Engelken wrote:
> Rob McEwen wrote:
>> (on-list follow-up)
>>
>> By "proactive listings", I discovered in my off-list conversation with 
>> Dallas that this refers to URIBL-Gold listings... where items are 
>> listed in "uribl-gold" in advance of seeing them in actual spams. But 
>> this uribl-gold list isn't available to the public and is not even 
>> prescribed as a list to use for fighting spam.
> 
> We do ask anyone with access to it to use it.  Since its  basically 
> uribl black for domains that we believe will show up in future spam 
> campaigns, there is no reason not to.  I'm sure there are some on this 
> list that can comment further in regards to its effectiveness.
> 
>> I'm really disappointed that Dallas would have presented that kind of 
>> comparison to ivmURI. This is like comparing some kid's best 
>> basketball game on an X-Box to Michael Jordan's best basketball game 
>> on the court. I'm glad that URIBL-Gold is helping URIBL black get 
>> better... but until the listing actually makes it into URIBL-Black... 
>> and is then actually *usable* for blocking spam...
> 
>  From a RBL  perspective,  the purpose of the data in there is to catch 
> the front end of spam runs.  Assuming it takes ~5 minutes to list, 
> rebuild, and redistribute new zone data  in reactive mode, we could miss 
> 50% of a 10 minute campaign.  Obviously the longer the campaign draws 
> out, the better the miss rate looks.   But those using gold+black have 
> 100% hitrates on alot of these campaigns,  which is something that is 
> difficult if not impossible to achieve on a reactive blacklist based 
> soley on trap data or user feed back.
> 
> As you can see at http://www.uribl.com/gold.shtml, over 20% (14k of 57k) 
> of the domains that have been listed in gold for hours, days, even 
> weeks, have since moved to black.    So,  assume each of those 14k 
> domains returned NXDOMAIN on black.uribl.com for the first ~5 minutes of 
> each of their campaigns, how much spam do you think we missed?  Quite a 
> lot I'd say.   That short window is what we are targetting here.   It 
> doesnt result in a huge hitrate because it only hits in gold during the 
> rebuild and redistribute window, but it does serve its purpose quite well.
> 
> Aside from client side spam filtering,  I could see 
> registries/registrars, web hosts, ip space owners and the like 
> benefiting from this data as well.  Knowing there is potential for abuse 
> prior to the abuse actually occurs could be quite a powerful tool.    
> For example, I can tell you that ns1.tuhaerge.com is the next NS that 
> will be spewing up VPXL crapmail 
> (http://www.spamtrackers.hk/wiki/index.php?title=VPXL)..    That NS and 
> every domain registred against that NS should be instantly nuked, but 
> getting those Chinese registrars to action anything like this, even with 
> proper evidence, is nearly impossible... just think if you asked them to 
> kill it before the abuse started.  ;)

Hi, I just wanted to comment that only a few hours after Dallas sent his 
last email we did see that NS spewing junk.

I know it's a little late in response, but I thought I'd pass this info 
along to everyone involved in the thread just so you know your work does 
appear to be paying off.

Re: Starting a URIBL - Howto? [OT]

Posted by Dallas Engelken <da...@uribl.com>.

Rob McEwen wrote:
> (on-list follow-up)
>
> By "proactive listings", I discovered in my off-list conversation with 
> Dallas that this refers to URIBL-Gold listings... where items are 
> listed in "uribl-gold" in advance of seeing them in actual spams. But 
> this uribl-gold list isn't available to the public and is not even 
> prescribed as a list to use for fighting spam.

We do ask anyone with access to it to use it.  Since its  basically 
uribl black for domains that we believe will show up in future spam 
campaigns, there is no reason not to.  I'm sure there are some on this 
list that can comment further in regards to its effectiveness.

> I'm really disappointed that Dallas would have presented that kind of 
> comparison to ivmURI. This is like comparing some kid's best 
> basketball game on an X-Box to Michael Jordan's best basketball game 
> on the court. I'm glad that URIBL-Gold is helping URIBL black get 
> better... but until the listing actually makes it into URIBL-Black... 
> and is then actually *usable* for blocking spam...

 From a RBL  perspective,  the purpose of the data in there is to catch 
the front end of spam runs.  Assuming it takes ~5 minutes to list, 
rebuild, and redistribute new zone data  in reactive mode, we could miss 
50% of a 10 minute campaign.  Obviously the longer the campaign draws 
out, the better the miss rate looks.   But those using gold+black have 
100% hitrates on alot of these campaigns,  which is something that is 
difficult if not impossible to achieve on a reactive blacklist based 
soley on trap data or user feed back.

As you can see at http://www.uribl.com/gold.shtml, over 20% (14k of 57k) 
of the domains that have been listed in gold for hours, days, even 
weeks, have since moved to black.    So,  assume each of those 14k 
domains returned NXDOMAIN on black.uribl.com for the first ~5 minutes of 
each of their campaigns, how much spam do you think we missed?  Quite a 
lot I'd say.   That short window is what we are targetting here.   It 
doesnt result in a huge hitrate because it only hits in gold during the 
rebuild and redistribute window, but it does serve its purpose quite well.

Aside from client side spam filtering,  I could see 
registries/registrars, web hosts, ip space owners and the like 
benefiting from this data as well.  Knowing there is potential for abuse 
prior to the abuse actually occurs could be quite a powerful tool.    
For example, I can tell you that ns1.tuhaerge.com is the next NS that 
will be spewing up VPXL crapmail 
(http://www.spamtrackers.hk/wiki/index.php?title=VPXL)..    That NS and 
every domain registred against that NS should be instantly nuked, but 
getting those Chinese registrars to action anything like this, even with 
proper evidence, is nearly impossible... just think if you asked them to 
kill it before the abuse started.  ;) 

-- 
Dallas Engelken
dallase@uribl.com
http://uribl.com

Re: Starting a URIBL - Howto? [OT]

Posted by Rob McEwen <ro...@invaluement.com>.

(on-list follow-up)

First, earlier I presented these stats:
186/500 (ivmURI hits from the latest 500 URIBL listings)
328/500 (URIBL hits from the latest 500 ivmURI listings)

A follow-up *idential* test... only conducted later... gave these stats:
225/500 (ivmURI hits from the latest 500 URIBL listings)
282/500 (URIBL hits from the latest 500 ivmURI listings)

(geocities/blogspots/etc URIs excluded from both tests)

Why the difference? Why the improvement in ivmURI? How did ivmURI 
*significantly* narrow that gap?

Two reasons:
(1) ivmURI's engine works faster during non-EST-business hours and 
weekend hours (for various reasons) ...(I'm working on ivmURI's engine 
right now. I've made these needed improvements with ivmSIP... now I just 
need to do the same with ivmURI)
(2) While much of URIBL is automated, user-submissions to URIBL wane a 
bit when both America and Europe are experiencing non-business hours.. 
even non-waking hours... and weekend hours

The the reason why ivmURI does BETTER in that testing than it did 
several hours ago.

...but none of this matters that much... as I'll prove later... but I 
present this anyways "for the record"

Dallas Engelken wrote:
> ivmURI stats from last 20000 URIBL reactive listings.
> -> 5519 hits
> -> 14481 misses
Dallas confirmed that these initial stats he posted DID include all 
those geocities, blogpot, and other subdomains in URIBL that ivmURI 
doesn't even try to catch... and there are TONS of those now in the 
URIBL list. So Dallas's stats here are comparing "apples to oranges". 
According to Dallas's off-list comments to me, when the "subdomains" are 
removed, the ivmURI hits on recent URIBL listings are significantly 
higher than these stats he original posted. Of course, I don't make it 
my goal in life to list every last domain in URIBL. But this would 
partially explain why my stats look so different from Dallas's stats... 
and why these stats (unfairly and artificially) made ivmURI look so bad.

> ivmURI stats from last 20000 URIBL proactive listings.
> -> 351 hits
> -> 19649 misses
By "proactive listings", I discovered in my off-list conversation with 
Dallas that this refers to URIBL-Gold listings... where items are listed 
in "uribl-gold" in advance of seeing them in actual spams. But this 
uribl-gold list isn't available to the public and is not even prescribed 
as a list to use for fighting spam. I'm really disappointed that Dallas 
would have presented that kind of comparison to ivmURI. This is like 
comparing some kid's best basketball game on an X-Box to Michael 
Jordan's best basketball game on the court. I'm glad that URIBL-Gold is 
helping URIBL black get better... but until the listing actually makes 
it into URIBL-Black... and is then actually *usable* for blocking 
spam... it really doesn't count for anything. Therefore, such a 
comparison is not only unfair, it is downright laughable. (To be extra 
clear, in contrast to URIBL-gold, ALL the items reported on 
http://invaluement.com/results.txt HAVE been seen "in the wild" and I do 
have corresponding evidence spams "on file")

A LARGER QUESTION:

What matters more, how many items are in a list? Or (1) the amount of 
"real world" spam sent to *real* users (NOT dictionary attack spam sent 
to "unknown users") that a list "hits" on? Along with (2) low FP-rates.

At the moment:

SURBL has 1.34 MILLION listings
URIBL has 310K listings
ivmURI has 233K listings

But those numbers don't tell the whole story. ivmURI stands up quite 
well when measuring real world "hits" on spam sent to real users. When 
measured in the real world, ivmURI compares quite well in 
head-to-head-to-head tests against SURBL and URIBL... even with it's 
smaller footprint... and ivmURI is at least as good in the low-FPs 
department.

But, like I said, ALL three lists are indispensable and block spam that 
the other two miss.

Rob McEwen

Re: Re: Starting a URIBL - Howto? [OT]

Posted by Dallas Engelken <da...@uribl.com>.

Rob McEwen wrote:
>
>  and ALL 3 catch stuff the other 2 miss... FOR EXAMPLE: 
> http://invaluement.com/results.txt )
>
Yes, of course, but you're results.txt is biased as it only shows where 
imvURI hits.

Based on the last 20k adds to URIBL,  it appears to me that imvURI has 
less coverage?

imvURI stats from last 20000 URIBL reactive listings.
 -> 5519 hits
 -> 14481 misses

imvURI stats from last 20000 URIBL proactive listings.
 -> 351 hits
 -> 19649 misses


-- 
Dallas Engelken
dallase@uribl.com
http://uribl.com

Re: Starting a URIBL - Howto? [OT]

Posted by Rob McEwen <ro...@invaluement.com>.

Jeremy Fairbrass wrote:
> Hi Rob,
> Are your invaluement.com DNSBLs available for us to use? Your 
> http://invaluement.com/results.txt page tells me why I should be using 
> it TODAY ;) but I can't find any info about how...!!
>
> Cheers,
> Jeremy
[Note, others have asked the same on-line. This will be my only on-list 
answer. The others will get the same answer off-list. Anyone else 
interested should e-mail me off-list, rob@invaluement.com ]

Jeremy,

The beta testing period is over. I now only allow paying subscribers. 
Hopefully, the web site to sign up will be available soon. (I keep 
trying to finish it... but it gets delayed all the time as I continually 
get carried away finding new ways to improve my lists!) In the meantime, 
you can get access immediately... before the web site launches... by 
filling out the following form below (anyone is welcome to do this... 
just make sure you send this back to *me* and not to the list). I'll the 
respond with further instructions as well as a button where you can 
subscribe via PayPal. The subscription includes a trial period where you 
pay $1 for the first 10 days. You can cancel at any time. (Other methods 
of payment are available upon request and I often grant very large 
potential subscribers longer periods of free testing time.)

Also, direct queries to my DNSBLs are never allowed and will always 
fail... even for subscribers! Instead, subscribers get access via RSYNC 
to either rbldnsd-formatted files, or BIND-formatted files. (I provide 
detailed instructions about that!) Additionally, I have to have a clear 
understanding of... for who/what this is used per subscriber. For 
example, at this point, I don't know enough about Jeremy Fairbrass to 
send that subscription button.. but I'm sure he will help me out with that!)

*********************************************************
Obtaining a subscription to the invaluement.com DNSBLs

(1) Name & contact information, including phone number & e-mail address,
company, etc.

(2) Tell me the approximate number of mailboxes/users that your use of this
product will protect if/when you decide to officially subscribe. (also 
include
your spam filtering customer's users if applicable... basically, anyone who
benefits by your use of this product should be included in the total)

(3) Let me know of you provide either spam filtering software or 
filtering appliances or DNSBLs or any other spam filtering technologies 
to third parties where the actual filtering is then done outside of your 
network. There is a different pricing plan for those situations.

(4) What type of access do you require:

(a) RSYNC to rbldnsd-formatted files (RECOMMENDED!)
...OR..
(b) RSYNC to (dns) bind-formatted files

(5) What IP address should I should grant permission for your RSYNC 
client to access the lists? (and a backup IP is welcome)

******************************************

Send that information and I'll respond with further instructions as well 
as my now-finalized price list (the same one that will be posted on my 
web site soon).

Thanks for your interest!

Rob McEwen

Re: Starting a URIBL - Howto? [OT]

Posted by Jeremy Fairbrass <je...@fairbrass.co.nz>.

"Rob McEwen" <ro...@invaluement.com> wrote in message news:48137B9C.7090906@invaluement.com...
> Marc Perkel wrote:
>> I was just wondering from those of you who have done it - how to start a URIBL. I'm guessing the process (simplified) is:
>>
>> 1) Mine messages for links
>> 2) Subtract out anything matching a fairly large white list
>>
>> So my first question here is - what do most of you used to mine the links in a message with? Can someone point me in the right 
>> direction? Also - I'm willing to work with and share data with others who are already doing this.
>>
> Marc,
>
> Just like a regular sender's IP dnsbl (aka "RBL"), the hardest part is not having FPs... in fact, this is probably *harder* for 
> URIBLs compared to RBLs. The second hardest part is being able to list spammer's URIs *quickly* (particularly since trying to do 
> so exacerbates the first problem.)
>
> The process you described is the best way to start... it is where everyone starts. But many have started with amazing whitelists, 
> done what you described, and have failed. It take much more than a great whitelist to make a great blacklist.
>
> In fact, I know someone who frequents these anti-spam lists ...who I consider smarter than either you or me... and I happen to 
> consider him the world's foremost authority on how to create and maintain a *great* RBL. (I'm not allowed to mention who he is... 
> in this context... but just about everyone reading this would recognize his name... NO, this is NOT Steve Linford... please, no 
> questions or guesses about this!)  Anyway, over the past several months... he tried to create a great URIBL and, so far, his URIBL 
> falls far short of SURBL and URIBL and ivmURI.
>
> Marc, if I had to make a short list of those who I thought might be able to pull this off... you'd definitely be on the short 
> list.
>
> However, don't be discouraged if you come up short and/or if it takes many months... even years... to accomplish what you seek. If 
> the guy I described can't do it (at least last I checked...), then believe me, this is NOT an easy task.
>
> I know MUCH about this. I've been one of the admins for SURBL for the past 4+ years. Additionally, I created own URIBL called 
> "ivmURI", which is now *easily* in the same league as SURBL and URIBL... In fact, ivmSIP is probably even better... at least, 
> according to the hit stats and FP stats that some of my users have provided me where all three URI blacklists are compared to each 
> other. (Of course, all three lists are indispensable... I use ALL of them in my spam filtering... and ALL 3 catch stuff the other 
> 2 miss... FOR EXAMPLE: http://invaluement.com/results.txt )
>
> At this time, there is no other publicly available URI blacklist that comes close to SURBL and URIBL and ivmURI. No "close" 4th 
> place. Again, *not* *even* *close*.
>
> I hope this helps and doesn't discourage you. I had a wise college professor tell me "big problem, big solution... little problem, 
> little solution". Spammer's URIs is a big problem that requires a big solution. Knowing what you're up against in creating a URI 
> blacklist might seem discouraging in the short term, but might give you the proper long-term focus and patience you need to really 
> pull this off.
>
> Best wishes for your success in this endeavor!
>
> Rob McEwen
> (creator of the "invaluement.com" DNSBLs, ivmURI & ivmSIP)
>


Hi Rob,
Are your invaluement.com DNSBLs available for us to use? Your http://invaluement.com/results.txt page tells me why I should be using 
it TODAY ;) but I can't find any info about how...!!

Cheers,
Jeremy

Re: Starting a URIBL - Howto? [OT]

Posted by Rob McEwen <ro...@invaluement.com>.

Marc Perkel wrote:
> I was just wondering from those of you who have done it - how to start 
> a URIBL. I'm guessing the process (simplified) is:
>
> 1) Mine messages for links
> 2) Subtract out anything matching a fairly large white list
>
> So my first question here is - what do most of you used to mine the 
> links in a message with? Can someone point me in the right direction? 
> Also - I'm willing to work with and share data with others who are 
> already doing this.
>
Marc,

Just like a regular sender's IP dnsbl (aka "RBL"), the hardest part is 
not having FPs... in fact, this is probably *harder* for URIBLs compared 
to RBLs. The second hardest part is being able to list spammer's URIs 
*quickly* (particularly since trying to do so exacerbates the first 
problem.)

The process you described is the best way to start... it is where 
everyone starts. But many have started with amazing whitelists, done 
what you described, and have failed. It take much more than a great 
whitelist to make a great blacklist.

In fact, I know someone who frequents these anti-spam lists ...who I 
consider smarter than either you or me... and I happen to consider him 
the world's foremost authority on how to create and maintain a *great* 
RBL. (I'm not allowed to mention who he is... in this context... but 
just about everyone reading this would recognize his name... NO, this is 
NOT Steve Linford... please, no questions or guesses about this!)  
Anyway, over the past several months... he tried to create a great URIBL 
and, so far, his URIBL falls far short of SURBL and URIBL and ivmURI.

Marc, if I had to make a short list of those who I thought might be able 
to pull this off... you'd definitely be on the short list.

However, don't be discouraged if you come up short and/or if it takes 
many months... even years... to accomplish what you seek. If the guy I 
described can't do it (at least last I checked...), then believe me, 
this is NOT an easy task.

I know MUCH about this. I've been one of the admins for SURBL for the 
past 4+ years. Additionally, I created own URIBL called "ivmURI", which 
is now *easily* in the same league as SURBL and URIBL... In fact, ivmSIP 
is probably even better... at least, according to the hit stats and FP 
stats that some of my users have provided me where all three URI 
blacklists are compared to each other. (Of course, all three lists are 
indispensable... I use ALL of them in my spam filtering... and ALL 3 
catch stuff the other 2 miss... FOR EXAMPLE: 
http://invaluement.com/results.txt )

At this time, there is no other publicly available URI blacklist that 
comes close to SURBL and URIBL and ivmURI. No "close" 4th place. Again, 
*not* *even* *close*.

I hope this helps and doesn't discourage you. I had a wise college 
professor tell me "big problem, big solution... little problem, little 
solution". Spammer's URIs is a big problem that requires a big solution. 
Knowing what you're up against in creating a URI blacklist might seem 
discouraging in the short term, but might give you the proper long-term 
focus and patience you need to really pull this off.

Best wishes for your success in this endeavor!

Rob McEwen
(creator of the "invaluement.com" DNSBLs, ivmURI & ivmSIP)