You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2009/04/29 02:40:59 UTC

my emailBL is live!

This was actually rather simple to set up.  I'll publish the code
(AGPL) that runs it in a bit (I need to clean it up to withstand the
heavy-handed criticism on this list ...).  Note, I'm using ZoneEdit's
free NS mirroring, which has limited bandwidth.  I'm willing to pay
their minimum threshold if it gets that popular, but any more than
that and I'll be looking for other options.  (NOT PRODUCTION GRADE!)

A SpamAssassin plugin will be needed to get it working, too ... I
suspect there are gurus here who can do that part as easily as I did
the scraper and BIND code.  If nobody bites, I'll get to it in time.

For now, we have a functional proof-of-concept.  I'll post the code, a
more formal announcement, and more documentation to my blog and
website in a few days ("a few" might be a large number).  The emailBL
syncs with the upstream every 4h (I'd reduce the TTL and increase the
syncing frequency, but I'd risk running out of bandwidth).

(Note, the DNS will take another 1-4 hours to propagate.)

The structure of the upstream list:

    ADDRESS,TYPE[TYPE...],DATE

ADDRESS is an email address like <test@ emailbl.khopesh.com>
TYPE is one or more letters of A B C D as follows:
    A (reply-to)
    B (from, !reply-to)
    C (msg body has ADDRESS)
    D (msg body has ADDRESS obfuscated)
DATE is the last time it was seen, formatted YYYYMMDD, in UTC(?).

The structure of domains in my emailBL index:

    USER.DOMAIN.emailbl.khopesh.com  TXT  <DATE>
    USER.DOMAIN.emailbl.khopesh.com  A    127.0.0.<N_TYPE>

USER is the ADDRESS's username, altered as follows:
  s/^([^@+]{1,16})[^@]*@.*/$1/;  # truncate to 16 characters
  s/^[^a-z0-9]*|[^a-z0-9]*$//g;  # fix leading/trailing chars
  s/[^-a-z.0-9]/-/g;             # fix illegal chars
DOMAIN is the ADDRESS's domain
N_TYPE is a numerical version of TYPE above (A=1, B=2, C=3, D=4)

Main test points (with no space after the at sign, obviously):

    test@ example.com
        -> test.example.com.emailbl.khopesh.com
    test@ emailbl.khopesh.com
        -> test.emailbl.khopesh.com.emailbl.khopesh.com

Alternate test point (mimicking DNSBLs):

    2.0.0.127.emailbl.khopesh.com

Let's pretend we're in a shell (I've spaced all emails):
################

# Look up TXT record (last-seen DATE) for <test@ example.com>
$ host -t txt test.example.com.emailbl.khopesh.com.
test.example.com.emailbl.khopesh.com descriptive text "20090328"
$

# Look up A record (inclusion TYPE[s]) for <test@ example.com>
$ host test.example.com.emailbl.khopesh.com.
test.example.com.emailbl.khopesh.com has address 127.0.0.3
test.example.com.emailbl.khopesh.com has address 127.0.0.4
test.example.com.emailbl.khopesh.com has address 127.0.0.1
test.example.com.emailbl.khopesh.com has address 127.0.0.2
$

################

More comments in-line:

Jesse Thompson (developer of anti-phishing-email-reply) wrote me:
> Yes, I and others have thought of it.  But I don't need it since we
> only use the list to scan log files and populate mapping tables.  I
> don't have time or money to do any of this, and I'm kept pretty
> busy just updating the list...on top of my other bazillion other
> responsibilities.
> 
> You are welcome to use the list to create your own URIBL of course.

(Jesse is BCC'd.)  And so I did.  Thanks for keeping the list updated.
 Hopefully this emailBL will open your list to new horizons.  Clearly,
credit for the real work goes to you and the other APER developers.

Rob McEwen wrote:
>>> Personally, I think the obfuscation is overkill. Instead, I'd
>>> prefer to change the "@" symbol to an underscore (and any other
>>> minor change that might be needed to work with dns queries) and
>>> be done with it. This would also make the implementation easier,
>>> and research by ISPs easire.

Mike Cardwell contended:
>> It would definitely require a hashing algorithm, like MD5. IIRC
>> there is a maximum length for a hostname, and that is 255
>> characters. What if the hostname in your email address is 255
>> characters long on it's own...?

When MD5sums were first proposed (in place of my wild escaping), it
seemed like a great idea.  However, a voice in the back of my head,
now spoken (typed?) by Rob, has been growing louder.  My
implementation now merely truncates email usernames to 16 characters
(plus the noted defanging, which makes it complicated again ...) and
replaces the @ with a dot (not an underscore, that's not a legal
character).

In fact, collisions here could be regarded as good, as usernames that
long can include tracking strings (e.g. the mailer for our list,
users-return-12345-joe=bob.com@ spamassassin.apache.org, becomes
users-return-123.spamassassin.apache.org), which should help.

I did fully implement my proposed latter 16 characters (of MD5's 32)
plus dot plus the domain, complete with hash lookups, but I just
removed it (which is why non-test lookups will fail for the next ~4h).

>> Having access to the plain text email address would only make it
>> easier for ISPs to do anything if they had access to the zone file.
>> In which case, you could just give them access to a separate list
>> which has the email addresses in plain text.

Unless we're replacing the currently well-groomed upstream source at
http://anti-phishing-email-reply.googlecode.com/#, I see no reason to
offer such services (since they do it better).

>> So in rbldnsd, ...

Whoa, what's that?!  Interesting ... it's even in Debian.  I think I'm
happy with BIND for the moment, since my origin point is hidden from
use and the actual NS records are merely slaves run by zoneedit (so
efficiency isn't really important).  I probably need to stay on BIND
as I doubt I could use rbldnsd to host my SpamAssassin channels.

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Re: 419 emailBL?

Posted by Henrik K <he...@hege.li>.

On Mon, May 04, 2009 at 10:51:14PM +0200, mouss wrote:
>
> That said, I am surprised because you defended the fact that the
> freemail plugin includes the list of freemail domains...

Think about it. Maybe few thousand freemail domains, that hardly change. Why
would that require realtime updating? They can simply be updated with
sa-update. It's strange that someone would have to "defend" this.

> This wasn't intended as a list to download. I didn't even check the
> license. I was simply replying to your "I'm surprised there still hasn't
> been an emailBL around". or if you prefer: the idea of an "emailBL
> around" isn't new. note also that SARE has a ruleset with phone numbers
> and "snail mail" infos found in spam.

Ideas are another thing, but implementing it is simple actually. We are
already at alpha stage with almost finished plugin for SA and harvesting
lots of addresses. Results to be seen..

Re: 419 emailBL?

Posted by mouss <mo...@ml.netoyen.net>.

Henrik K a écrit :
> On Sun, May 03, 2009 at 06:25:01PM +0200, mouss wrote:
>> I can't use a dnsbl on recipient addresses in postfix. This requires
>> additionnal code (exceptionally if the records are hashed...). MySQL on
>> the other hand is supported by many daemons. Sure, SA would need a mysql
>> access db plugin, but that would be beneficial for other things I think.
> 
> MySQL is not a global solution.
> 

it is for me since I use it. I don't see why I should load gigas in
rbldnsd when I can query mysql. but I agree that this is a personal
view. so let's leave it like this.

That said, I am surprised because you defended the fact that the
freemail plugin includes the list of freemail domains...

> Fixing up a postfix policyd is no problem and exim supports it out of the
> box, md5 is hardy "exceptional" function.
> 

sure.

>>> Personally I'm only interested in "freemails", I don't know how feasible it
>>> would be to create a global email blacklist. 419/phishers are pretty much
>>> the only spam that's hard to catch. I'm surprised there still hasn't been an
>>> emailBL around, 
>>
>> http://www.419scam.org/419-bl.htm
> 
> Sorry I'm not interested in wgetting a humongous list, which happens also to
> be 2 days old, also no mention of anything about freshness. :)
> 

This wasn't intended as a list to download. I didn't even check the
license. I was simply replying to your "I'm surprised there still hasn't
been an emailBL around". or if you prefer: the idea of an "emailBL
around" isn't new. note also that SARE has a ruleset with phone numbers
and "snail mail" infos found in spam.

Re: 419 emailBL?

Posted by Henrik K <he...@hege.li>.

On Sun, May 03, 2009 at 06:25:01PM +0200, mouss wrote:
>
> I can't use a dnsbl on recipient addresses in postfix. This requires
> additionnal code (exceptionally if the records are hashed...). MySQL on
> the other hand is supported by many daemons. Sure, SA would need a mysql
> access db plugin, but that would be beneficial for other things I think.

MySQL is not a global solution.

Fixing up a postfix policyd is no problem and exim supports it out of the
box, md5 is hardy "exceptional" function.

> > Personally I'm only interested in "freemails", I don't know how feasible it
> > would be to create a global email blacklist. 419/phishers are pretty much
> > the only spam that's hard to catch. I'm surprised there still hasn't been an
> > emailBL around, 
> 
> 
> http://www.419scam.org/419-bl.htm

Sorry I'm not interested in wgetting a humongous list, which happens also to
be 2 days old, also no mention of anything about freshness. :)

Re: 419 emailBL?

Posted by mouss <mo...@ml.netoyen.net>.

Benny Pedersen a écrit :
> On Sun, May 3, 2009 18:25, mouss wrote:
>> stock postfix. something I can't do with a dnsbl since there is no
>> reject_rhsbl_recipient...
> 

correction: There is no DNSBL check that acts on the full email address.
reject_rhsbl_recipient will lookup the domain part.

> http://www.docunext.com/blog/2006/12/07/sorbs-settings/

or simply

http://www.postfix.org/postconf.5.html#reject_rhsbl_recipient

Re: 419 emailBL?

Posted by Benny Pedersen <me...@junc.org>.

On Sun, May 3, 2009 18:25, mouss wrote:
> stock postfix. something I can't do with a dnsbl since there is no
> reject_rhsbl_recipient...

http://www.docunext.com/blog/2006/12/07/sorbs-settings/

-- 
http://localhost/ 100% uptime and 100% mirrored :)

Re: 419 emailBL?

Posted by mouss <mo...@ml.netoyen.net>.

Henrik K a écrit :
> On Sun, May 03, 2009 at 03:14:22PM +0200, mouss wrote:
>> Henrik K a écrit :
>>> On Sun, May 03, 2009 at 03:40:47AM +0200, mouss wrote:
>>>> with rsync or the like, you can simply add the addresses (no MD5, no
>>>> anything) to an access list that your MTA can use.
>>> You don't get free rsyncs for big players like uribl for reason (um, traffic
>>> etc?).
>> some DNSBLs are available via rsync.
>>
>> $ wc -l psbl.txt
>>  1494939 psbl.txt
>> $ ls -l psbl.txt
>> ... 20969353 ...
> 
> Like I said, no one is stopping offering it. It's up to the list or someone
> donating resources to such list. But the bigger/more popular the list,
> harder it is to create a reliable rsync-network that can handle hoardes of
> clients checking stuff every 15 minutes.
> 
>>> If we had a big emailbl, obviously it would be impractical as well.
>>> You really want to be updated every 5-15 minutes, which DNS allows.
>>>
>> It is possible to use a mechanism similar to SA update:
>> - use DNS to see if there is an update
>> - if so, download changes since some recent version
> 
> See the DNS part? You already got answer there so why complicate things? ;)
> 

not the same.

1- here, you do one dns check every 5-15 minutes. the number has nothing
to do with the amount of mail you see.
2- and the query is not done while checking mail. it's asynchronous and
adds no latency to mail checking.
3- it requires no integration with MTA or whatever. I can use this with
stock postfix. something I can't do with a dnsbl since there is no
reject_rhsbl_recipient...



>>> Of course no one stops such list offering the plain text emails as plain
>>> file. But do you want potentially millions of emails in a file?
>>>
>> 1- I prefer that over latency
> 
> You can use rbldnsd, if the data is available.. I just meant why would you
> want to have a complicated setup, especially if you are going to use the
> data possibly on several levels (MTA, SA). Transferring files around and
> reloading daemons is silly.
> 

I can't use a dnsbl on recipient addresses in postfix. This requires
additionnal code (exceptionally if the records are hashed...). MySQL on
the other hand is supported by many daemons. Sure, SA would need a mysql
access db plugin, but that would be beneficial for other things I think.

(and with local data, you can support regular expressions [except for
the "simple" wildcard things]. AFAIK, rbldnsd doesn't support these).

>> - the disabled addresses do not need to be "shared" anymore.
> 
> I'm asking because I don't know: is that reality? Do you get confirmation
> from i.e. gmail that some account is disabled? From the list point of view
> it's simple enough to wait a month or so to see if the email is still found
> in spams. Reporting etc is another thing and not necessarily concern of the
> list.
> 

I have no evidence for email addresses, but fraud domains/subdomains get
disabled (except at "uncollaborative" sites or registrars. but there I
blacklist the whole domain...).

> Personally I'm only interested in "freemails", I don't know how feasible it
> would be to create a global email blacklist. 419/phishers are pretty much
> the only spam that's hard to catch. I'm surprised there still hasn't been an
> emailBL around, 


http://www.419scam.org/419-bl.htm


> but maybe this time it becomes reality.. atleast to have
> some scoring in SA.
> 
>> I don't have a "fixed" opinion. I am just trying to see if using the
>> well-known dns hack (dnsbl) is the best choice.
> 
> DNS is simple and effective remote database for simple queries. Unless
> someone invents even better and easy to use global solution.
>

Re: 419 emailBL?

Posted by Henrik K <he...@hege.li>.

On Sun, May 03, 2009 at 03:14:22PM +0200, mouss wrote:
> Henrik K a écrit :
> > On Sun, May 03, 2009 at 03:40:47AM +0200, mouss wrote:
> >> with rsync or the like, you can simply add the addresses (no MD5, no
> >> anything) to an access list that your MTA can use.
> > 
> > You don't get free rsyncs for big players like uribl for reason (um, traffic
> > etc?).
> 
> some DNSBLs are available via rsync.
> 
> $ wc -l psbl.txt
>  1494939 psbl.txt
> $ ls -l psbl.txt
> ... 20969353 ...

Like I said, no one is stopping offering it. It's up to the list or someone
donating resources to such list. But the bigger/more popular the list,
harder it is to create a reliable rsync-network that can handle hoardes of
clients checking stuff every 15 minutes.

> > If we had a big emailbl, obviously it would be impractical as well.
> > You really want to be updated every 5-15 minutes, which DNS allows.
> > 
> 
> It is possible to use a mechanism similar to SA update:
> - use DNS to see if there is an update
> - if so, download changes since some recent version

See the DNS part? You already got answer there so why complicate things? ;)

> > Of course no one stops such list offering the plain text emails as plain
> > file. But do you want potentially millions of emails in a file?
> > 
> 
> 1- I prefer that over latency

You can use rbldnsd, if the data is available.. I just meant why would you
want to have a complicated setup, especially if you are going to use the
data possibly on several levels (MTA, SA). Transferring files around and
reloading daemons is silly.

> - the disabled addresses do not need to be "shared" anymore.

I'm asking because I don't know: is that reality? Do you get confirmation
from i.e. gmail that some account is disabled? From the list point of view
it's simple enough to wait a month or so to see if the email is still found
in spams. Reporting etc is another thing and not necessarily concern of the
list.

Personally I'm only interested in "freemails", I don't know how feasible it
would be to create a global email blacklist. 419/phishers are pretty much
the only spam that's hard to catch. I'm surprised there still hasn't been an
emailBL around, but maybe this time it becomes reality.. atleast to have
some scoring in SA.

> I don't have a "fixed" opinion. I am just trying to see if using the
> well-known dns hack (dnsbl) is the best choice.

DNS is simple and effective remote database for simple queries. Unless
someone invents even better and easy to use global solution.

Cheers,
Henrik

Re: 419 emailBL?

Posted by mouss <mo...@ml.netoyen.net>.

Henrik K a écrit :
> On Sun, May 03, 2009 at 03:40:47AM +0200, mouss wrote:
>> with rsync or the like, you can simply add the addresses (no MD5, no
>> anything) to an access list that your MTA can use.
> 
> You don't get free rsyncs for big players like uribl for reason (um, traffic
> etc?).

some DNSBLs are available via rsync.

$ wc -l psbl.txt
 1494939 psbl.txt
$ ls -l psbl.txt
... 20969353 ...


> If we had a big emailbl, obviously it would be impractical as well.
> You really want to be updated every 5-15 minutes, which DNS allows.
> 

It is possible to use a mechanism similar to SA update:
- use DNS to see if there is an update
- if so, download changes since some recent version


> Of course no one stops such list offering the plain text emails as plain
> file. But do you want potentially millions of emails in a file?
> 

1- I prefer that over latency
2- do we _now_ have millions of such addresses? if not, premature
optimization...


here is how I see things:

- criminals (AFF, phish, ...) use some email addresses
- these addresses get listed
- the addresses are reported to domains owners
- domain owners disable these addresses (if the domain owner is the
criminal, then the full domain can be listed, and/or it can be reported
to the registrar... etc.)
- the disabled addresses do not need to be "shared" anymore.
- ... etc


I don't have a "fixed" opinion. I am just trying to see if using the
well-known dns hack (dnsbl) is the best choice.

Re: 419 emailBL?

Posted by Henrik K <he...@hege.li>.

On Sun, May 03, 2009 at 03:40:47AM +0200, mouss wrote:
> 
> with rsync or the like, you can simply add the addresses (no MD5, no
> anything) to an access list that your MTA can use.

You don't get free rsyncs for big players like uribl for reason (um, traffic
etc?). If we had a big emailbl, obviously it would be impractical as well.
You really want to be updated every 5-15 minutes, which DNS allows.

Of course no one stops such list offering the plain text emails as plain
file. But do you want potentially millions of emails in a file?

Re: 419 emailBL?

Posted by Theo Van Dinter <fe...@apache.org>.

On Wed, Apr 29, 2009 at 7:56 PM, Adam Katz <an...@khopis.com> wrote:
>> I guess it depends what you mean by "enormous".  A sought rule update is 135k.
>
> And 135k doesn't add up to a lot of bandwidth?  I suppose it depends
> on the number of users, and I'm figuring worst-case scenario, e.g.
> when/if it ships enabled in the default SA install.

Well, it depends what you're measuring.  :)

The update itself isn't large, it's just 135k, which is the not
"enormous" bit.  135k in and of itself is a pretty tiny file, but I'm
not sure what "enormous" means in this context -- megs?  gigs?

The aggregate bandwidth could very well be large, depending on update
publish frequency, client update frequency, number of clients, client
bandwidth, etc.  From what I've seen, the standard SA updates w/ the
same ~130k size and the current number of users ... isn't a lot of
bandwidth.

There are some pretty standard ways to deal with this issue though, such as:

a) have lots of mirrors, same idea as your P2P idea though less
dynamic  (oh, that was another thought I had ... go short of using
torrents since they're resource heavy and instead make our own P2P
protocol doing a dynamic http/mirrored.by system)

b) split the channel into a frequent / not frequent channel (or stable
/ testing, or split based on content, or ...) for patterns which don't
change often, there's no reason to keep sending them out.  same idea I
mentioned before.

c) shrink or hold update size steady in face of updates.  hard.

d) make updates less frequently.  defeats the purpose?  clearly every
15m is different than every day is different than weekly ...

To be perfectly honest, I really don't worry about the "omg, update
bandwidth" issue right now.  I worry that there aren't enough updates
right now.  The only auto-generated one, sought, is daily, and the
manual ones now are more than weekly on average.  I don't know if
sought could even be produced faster, you need a certain amount of
incoming ham and spam to sample and produce test rules, and enough
diversity of mails to test against to avoid "obvious" bad rules...

Re: [SA] 419 emailBL?

Posted by Adam Katz <an...@khopis.com>.

>> And if bandwidth at the server is a problem, would publishing the ruleset
>> updates via the Coral Cache network work?
> 
> Unfortunately, no.  In fact, they kind of suck as a CDN.  We
> originally were putting updates through there and would regularly have
> issues w/ 404s, corrupt or incomplete downloads, etc.
> 
> It may have improved since the 2005 or so timeframe when we started w/
> updates, but ...  Haven't checked in a while.

Still has the same issues.  I'll be removing them from my sa-update
channels mirror files very soon.

Re: 419 emailBL?

Posted by John Hardin <jh...@impsec.org>.

On Wed, 29 Apr 2009, Theo Van Dinter wrote:

> On Wed, Apr 29, 2009 at 8:06 PM, John Hardin <jh...@impsec.org> wrote:
>>> And 135k doesn't add up to a lot of bandwidth?
>> And if bandwidth at the server is a problem, would publishing the ruleset
>> updates via the Coral Cache network work?
>
> Unfortunately, no.  In fact, they kind of suck as a CDN.  We
> originally were putting updates through there and would regularly have
> issues w/ 404s, corrupt or incomplete downloads, etc.
>
> It may have improved since the 2005 or so timeframe when we started w/
> updates, but ...  Haven't checked in a while.

I've edited my MIRRORED.BY, we'll see how it goes...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The real opiate of the masses isn't religion; it's the belief that
   somewhere there is a benefit that can be delivered without a
   corresponding cost.                       -- Tom of "Radio Free NJ"
-----------------------------------------------------------------------
  9 days until the 64th anniversary of VE day

Re: 419 emailBL?

Posted by Theo Van Dinter <fe...@apache.org>.

On Wed, Apr 29, 2009 at 8:06 PM, John Hardin <jh...@impsec.org> wrote:
>> And 135k doesn't add up to a lot of bandwidth?
>
> ...so don't look for updates more than once every day or two.

Yeah, but I think the point was that a frequently changing ruleset
would be downloaded frequently.

> And if bandwidth at the server is a problem, would publishing the ruleset
> updates via the Coral Cache network work?

Unfortunately, no.  In fact, they kind of suck as a CDN.  We
originally were putting updates through there and would regularly have
issues w/ 404s, corrupt or incomplete downloads, etc.

It may have improved since the 2005 or so timeframe when we started w/
updates, but ...  Haven't checked in a while.

Re: 419 emailBL?

Posted by John Hardin <jh...@impsec.org>.

On Wed, 29 Apr 2009, Adam Katz wrote:

> Theo Van Dinter wrote:
>> On Wed, Apr 29, 2009 at 6:24 PM, Adam Katz <an...@khopis.com> wrote:
>>> The mechanism for sa-update is brilliant, but
>>> doesn't lend itself to enormous indices of frequently-changing rulesets.
>>
>> I guess it depends what you mean by "enormous".  A sought rule update is 135k.
>
> And 135k doesn't add up to a lot of bandwidth?

...so don't look for updates more than once every day or two.

And if bandwidth at the server is a problem, would publishing the ruleset 
updates via the Coral Cache network work?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   A superior gunman is one who uses his superior judgment to keep
   himself out of situations that would require the use of his
   superior skills.
-----------------------------------------------------------------------
  9 days until the 64th anniversary of VE day

Re: 419 emailBL?

Posted by Adam Katz <an...@khopis.com>.

Theo Van Dinter wrote:
> On Wed, Apr 29, 2009 at 6:24 PM, Adam Katz <an...@khopis.com> wrote:
>> The mechanism for sa-update is brilliant, but
>> doesn't lend itself to enormous indices of frequently-changing rulesets.
> 
> I guess it depends what you mean by "enormous".  A sought rule update is 135k.

And 135k doesn't add up to a lot of bandwidth?  I suppose it depends
on the number of users, and I'm figuring worst-case scenario, e.g.
when/if it ships enabled in the default SA install.

> The likelihood is, imo, that you would probably split up your updates
> into multiple channels before they really got out of control in size.
> For example, you could do something like a weekly, daily, and
> sub-daily channel, and move rules appropriately between them.  Yes, a
> little more of a PITA for clients, but how much churn do you really
> expect?

How about hierarchical channel support, e.g. a channel's MIRRORED.BY
file is merely itself a sa-update-channels file.

>> Justin:  Perhaps sa-update could support [version].torrent in addition
>> to [version].tar.gz on each mirror?  (This doesn't touch the current
>> DNS-based version/announce system.)  Channels hosted for versions of
>> SA after the supporting release (e.g. 0.4.3.[channel] and "higher")
>> would be allowed to host only the torrent file.
> 
> I had actually thought about doing a P2P sa-update so as to better
> withstand DoS issues, skip the need for a mirrored.by file, etc.  But
> the main issue is that most channel updates are rather small, and so
> therefore the downloads are rather fast.  Compared to doing a torrent,
> which takes relatively a long time to get setup, and just as you
> start, you're done.  Also, it means clients are serving data, which
> makes the "quick sa-update and move on" more of a procedure and you
> have to worry about remote connectivity, etc, etc.
> 
> In the end it didn't seem worthwhile beyond the security aspect, so I
> didn't move beyond the "thinking about" stage.
> 
> (and yes, I know I'm not Justin. ;))

You're close enough on the SA development order.  For BT, I was
actually envisioning much larger rulesets with sought merely heralding
a future with lots of large auto-generated rulesets, but perhaps it
doesn't scale at the right point.  I think I'm trying to squeeze to
much :-p

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Re: [SA] 419 emailBL?

Posted by Theo Van Dinter <fe...@apache.org>.

On Wed, Apr 29, 2009 at 6:24 PM, Adam Katz <an...@khopis.com> wrote:
> The mechanism for sa-update is brilliant, but
> doesn't lend itself to enormous indices of frequently-changing rulesets.

I guess it depends what you mean by "enormous".  A sought rule update is 135k.

The likelihood is, imo, that you would probably split up your updates
into multiple channels before they really got out of control in size.
For example, you could do something like a weekly, daily, and
sub-daily channel, and move rules appropriately between them.  Yes, a
little more of a PITA for clients, but how much churn do you really
expect?

> Justin:  Perhaps sa-update could support [version].torrent in addition
> to [version].tar.gz on each mirror?  (This doesn't touch the current
> DNS-based version/announce system.)  Channels hosted for versions of
> SA after the supporting release (e.g. 0.4.3.[channel] and "higher")
> would be allowed to host only the torrent file.

I had actually thought about doing a P2P sa-update so as to better
withstand DoS issues, skip the need for a mirrored.by file, etc.  But
the main issue is that most channel updates are rather small, and so
therefore the downloads are rather fast.  Compared to doing a torrent,
which takes relatively a long time to get setup, and just as you
start, you're done.  Also, it means clients are serving data, which
makes the "quick sa-update and move on" more of a procedure and you
have to worry about remote connectivity, etc, etc.

In the end it didn't seem worthwhile beyond the security aspect, so I
didn't move beyond the "thinking about" stage.

(and yes, I know I'm not Justin. ;))

Re: my emailBL is live!

Posted by John Hardin <jh...@impsec.org>.

On Wed, 29 Apr 2009, Jesse Thompson wrote:

> A word of caution.  Be very careful how you use the list.  The intended 
> usage for the list is to prevent (or monitor) local users from sending 
> email to the listed addresses.  The phishers frequently use compromised 
> end-user accounts to receive the phishing replies, so there is a high 
> risk of false positives, especially if you attempt to classify messages 
> containing one these addresses as spam.

+1

Given the context of this information, the only safe way to use it is as a 
component of a meta that also requires phishy text fragments.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   You do not examine legislation in the light of the benefits it
   will convey if properly administered, but in the light of the
   wrongs it would do and the harms it would cause if improperly
   administered.                                  -- Lyndon B. Johnson
-----------------------------------------------------------------------
  9 days until the 64th anniversary of VE day

Re: [SA] emailBL code

Posted by John Hardin <jh...@impsec.org>.

On Fri, 1 May 2009, Adam Katz wrote:

> John Hardin wrote:
>> How would the phisher collect the password info from their target using 
>> a forged sender address?
>
> A web form.

Hrm. Okay, I'll buy that. If you're going to spearfish a specific 
organization then it would be reasonable to put the effort into forging a 
password capture website that looks plausible.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Ignorance doesn't make stuff not exist.               -- Bucky Katt
-----------------------------------------------------------------------
  7 days until the 64th anniversary of VE day

Re: [SA] emailBL code

Posted by Adam Katz <an...@khopis.com>.

John Hardin wrote:
> How would the phisher collect the password info from their target using
> a forged sender address?

A web form.

Re: emailBL code

Posted by Adam Katz <an...@khopis.com>.

Jesse Thompson wrote:
>     Possible values for TYPE:
>         E: The ADDRESS (usually in the From header) might receive replies
>             but it was not intended to receive the replies.

Oh!  That's a new one.  Changes my code.  My code now supports Z as
requesting a hidden email address, A-J as codes (with FGHIJ being
currently undefined), and ignores K-Y (as both undefined and not noted).

  $type_list =~ s/.*,([A-Z]+),.*/$1/;
  if ($type_list =~ /Z/) {
    $email =~ s/\t".*"/\t"\@hidden\@"/; # hide the email address
  }
  $type_list =~ s/[K-Z]//g; # remove unhandled types K-Y and Z
  $type_list =~ s/(?=.)/+2**/g;
  $type_list =~ tr [A-J] [0-9]; # this needs rewriting when we get a K!
  $type_list = eval 0 . $type_list;
  $type_list = "\tA\t127.0.0.$type_list\n";

Other suggestions to my list before somebody works on a plugin?
Other sources with which to seed it?
Volunteers to test it?  I'm not sure if I have enough volume surviving
greylisting (which nabs ~90% of my incoming mail) for useful stats, e.g.
my hits on malware-patrol is fully zero (and yes, I run clamAV *after* SA).

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Re: emailBL code

Posted by Henrik K <he...@hege.li>.

On Fri, May 01, 2009 at 02:36:28PM -0500, Jesse Thompson wrote:
> John Hardin wrote:
>> On Fri, 1 May 2009, Adam Katz wrote:
>>
>>> The emailBL mechanism could easily be populated by a spamtrap, but the
>>> danger from false positives (forged sender addresses) would be quite
>>> real.
>
> On a related note: you also need to worry about the phishers  
> intentionally forging the Reply-To with normal addresses in an attempt  
> to poison the list.

Especially if one only lists freemail addresses on the list (like we are
going to do for now), the worry it pretty small. Why would the 419/phishers
want to spend time blocking normal peoples freemails? And if the spamtraps
are well hidden (or process otherwise manual), it would take serious effort
to get someone listed.

Re: emailBL code

Posted by Jesse Thompson <je...@doit.wisc.edu>.

John Hardin wrote:
> On Fri, 1 May 2009, Adam Katz wrote:
> 
>> The emailBL mechanism could easily be populated by a spamtrap, but the
>> danger from false positives (forged sender addresses) would be quite
>> real.

On a related note: you also need to worry about the phishers 
intentionally forging the Reply-To with normal addresses in an attempt 
to poison the list.

> Suggestion: ignore the sender address if there is a Reply-To: header or 
> if there is an email address in the body of the message. There might 
> need to be some logic around detecting the contact address in the 
> message body - there could be garbage addresses inserted to get the 
> phishtrap to ignore the sender address...

That's what we do.  We've had lengthy discussions about this issue.  It 
all boils down accurately gauging the intention of the phisher, which is 
essentially impossible to automate.

It gets tricky when you consider the situation where the phisher 
intended the user to reply to the address included in the body, but the 
user doesn't pay attention and replies to the From instead, *and* the 
phisher happens to still have access to the original compromised account 
(the From address) used to send the phish.  So, it makes sense to add 
the From to the list in this case.  However, the account in question is 
usually cleaned up by the email provider quickly, so now a normal user's 
address is on the list.  And... to make matters worse, that user will 
potentially start receiving credentials from other users that are 
replying to the phish messages.

Anyway, here is the current state of how we classify the addresses:

     Possible values for TYPE:

         A: The ADDRESS was used in the Reply-To header.

         B: The ADDRESS was used in the From header.

         C: The content of the phishing message contained the ADDRESS.

         D: The content of the phishing message contained the ADDRESS,
             and it was obfuscated.

         E: The ADDRESS (usually in the From header) might receive replies
             but it was not intended to receive the replies.

     Note: unless otherwise specified, in order for the ADDRESS to
           qualify for each TYPE, it must have been intended to
           receive the replies.

Jesse

-- 
   Jesse Thompson
   Division of Information Technology, University of Wisconsin-Madison
   Email/IM: jesse.thompson@doit.wisc.edu

Re: emailBL code

Posted by John Hardin <jh...@impsec.org>.

On Fri, 1 May 2009, Adam Katz wrote:

> The emailBL mechanism could easily be populated by a spamtrap, but the
> danger from false positives (forged sender addresses) would be quite
> real.

How would the phisher collect the password info from their target using a 
forged sender address?

Suggestion: ignore the sender address if there is a Reply-To: header or if 
there is an email address in the body of the message. There might need to 
be some logic around detecting the contact address in the message body - 
there could be garbage addresses inserted to get the phishtrap to ignore 
the sender address...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Warning Labels we'd like to see #1: "If you are a stupid idiot while
  using this product you may hurt yourself. And it won't be our fault."
-----------------------------------------------------------------------
  7 days until the 64th anniversary of VE day

Re: emailBL code

Posted by Adam Katz <an...@khopis.com>.

Yet Another Ninja wrote:
>> I'm trying hard to convince myself this data is really useful.
>> 
>> the whole 
>> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses
>> file has 4518 entries, including vintage 2008
>> 
>> compared to the big_boyz my trap feed is quite small and I
>> collected 1598 entries during the last 4 hrs

Well, this is different from traps ... though admittedly not by much.
 The fact that it's updated so frequently is a merit, and the reason
dates are noted is so that you can adjust accordingly.

The emailBL mechanism could easily be populated by a spamtrap, but the
danger from false positives (forged sender addresses) would be quite
real.  Maybe only publish addresses that pass or fail SPF/DKIM/etc, so
that domains without a way to verify authenticity are immune to it?

>> does anybody have any hit metrics?

Mike Cardwell responded:
> The list was set up to satisfy a very specific group of users that
> were being targetted by a very specific scam. Spear Phishing
> against Higher Education institutions in the UK and USA. It was
> originally discussed on a mailing list run by "nd.edu" which can
> only be subscribed to by people who are in that particular sector.
> For that particular group, the list has been useful. How useful it
> is for people outside of that scenario, I don't know.

This is why I set up the emailbl in the first place:  to see what it
does.  We need an SA plugin next.

Re: emailBL code

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Yet Another Ninja wrote:

>>> This is not to suggest that I ever understood the part about using
>>> half-length MD5.
>>
>> No need.  I'm using full-length hashes now, plus the SURBL/chmod style
>> IP addresses.  I must have lost the email I was composing on the topic,
>> but it's fully propagated by now.  I've attached my code.
>>
>> Note that the code still supports the old truncated string.  I'll rip
>> that out soon.  Also note that I'm not an advanced perl coder (almost
>> all of my perl scripts start as POSIX shell scripts, including this one)
>> .... so while I'm happy to get *suggestions*, I'm not so eager for the
>> insults and hash words this list tends to give instead.
> 
> I'm trying hard to convince myself this data is really useful.
> 
> the whole 
> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses 
> file has 4518 entries, including vintage 2008
> 
> compared to the big_boyz my trap feed is quite small and I collected 
> 1598 entries during the last 4 hrs
> 
> hmmmmm
> 
> does anybody have any hit metrics?

The list was set up to satisfy a very specific group of users that were 
being targetted by a very specific scam. Spear Phishing against Higher 
Education institutions in the UK and USA. It was originally discussed on 
a mailing list run by "nd.edu" which can only be subscribed to by people 
who are in that particular sector. For that particular group, the list 
has been useful. How useful it is for people outside of that scenario, I 
don't know.

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Anti-phishing outside of just detection

Posted by Adam Katz <an...@khopis.com>.

I wrote:
>> I'd still rather block the offending message than intercept responses
>> to it (as that means it has suckered users, which means it has wasted
>> their time).  I see APER as a possible aid in that pursuit, though as
>> Jesse has mentioned, it is not fully reliable (as to be determined).
>> Still, these little checks add up, so even if APER gives a message 0.1
>> points, that might be enough to mark it as spam or even block it at
>> the door.
>>
>> As a secondary defense, blocking replies sounds like a grand idea.

Mandy wrote:
> I absolutely agree that the messages should be stopped on their way
> in.  I'd rather our users not have an opportunity to be suckered.  But
> at least knowing about the replies gives us a way to target our
> education efforts (now, where'd I put that LART...)

Along this light, I'd love to honeypot it; complement phishing
detection with an automated responder along the lines of "okay, here's
my login information" which of course is connected to a meaningless
account that merely informs the admins that somebody has logged on.

With that information, the admins can dig up the offending message and
see who else received it, they can examine the IP of the login and
track who else it has logged in as, and of course, the authorities can
be involved.  All before the users would have concluded there was a
problem.

Going the other direction, I read (maybe a year ago?) that some US
government organization was actually sending fake phishing emails to
their users.  When the user clicks on it, they are informed of what
they did and how to prevent it.  KnujOn (or maybe it was somebody else
presenting at this year's MIT Spam Conference?) is now pushing for
sites taken down for phishing (et al) to be replaced with information
on what happened rather than generic placeholders or nothing at all.
This is a GRAND idea!

Re: emailBL code

Posted by Mandy <me...@gmail.com>.

On Fri, May 1, 2009 at 3:37 PM, Adam Katz <an...@khopis.com> wrote:
> Can you determine how many of those were out-of-office messages?  Then
> again, even at just two, if you can stop such compromises, it's worth
> it (and then some).

The replies I was talking about was, sadly, manually filtered to
remove everything that looked like an auto response.  What I couldn't
tell was how many were "yeah, right!" or "die, spammer, die!" style
responses.  Thankfully we only had 2 compromised accounts (but that's
two too many).

> I'd still rather block the offending message than intercept responses
> to it (as that means it has suckered users, which means it has wasted
> their time).  I see APER as a possible aid in that pursuit, though as
> Jesse has mentioned, it is not fully reliable (as to be determined).
> Still, these little checks add up, so even if APER gives a message 0.1
> points, that might be enough to mark it as spam or even block it at
> the door.
>
> As a secondary defense, blocking replies sounds like a grand idea.

I absolutely agree that the messages should be stopped on their way
in.  I'd rather our users not have an opportunity to be suckered.  But
at least knowing about the replies gives us a way to target our
education efforts (now, where'd I put that LART...)

As far as blocking inbound messages, I'm going to have to remove a few
addresses from the list before I can do that.  My initial search
results were chock full of false positives.  One of the people who
made the list corresponds very regularly with 10 - 20 people in my
organization.  Granted, at 0.1, it's not a big deal, and such a rule
would probably make a fantastic META companion (warning, fictional,
unlinted rule follows)

meta   L_PHISHY   FROM_ON_APER && WEBMAIL_SUBJECT
score  L_PHISHY   2.5

anyone?

Re: emailBL code

Posted by Adam Katz <an...@khopis.com>.

I forgot to also mention honeypots here.

Create a few accounts whose sole purpose is finding these phishing
attacks.  They are email accounts which will appear to fall victim to
the attack, sending their "password" which gains "access" to the
company's web portal.  Of course, all this "access" does is tell the
admins that something bad is happening (watch that IP!) and that an
announcement to the user base is probably in order.

I believe that the best response to spam is ... responses to spam.
Especially for phishers, giving bad data and creating automated methods
by which to notice the attacks will help fight against them and will
alert them to the fact that phishing is not profitable.

After all, they only do it because it makes money.  An unfortunately
large amount of money.  Report your spam to places like KnujOn and
SpamCop!  KnujOn's mission statement is something along the lines of
stopping the profitability of spam, one domain (registrar) at a time,
and SpamCop sends out nice little complaint letters to registrars and
upstream network admins.

I'd really like to see the KnujOn reporting bug (SpamAssassin bug 6085)
filled so that it is easier to report directly to them, especially for
phishing spam for its obvious importance over standard spam.

Garth and Robert:  This thread is a bit big (sorry) ... basically, we're
(I'm?) working on putting up a URIBL based on phishing email reply-tos
(emailBL).  This email you're reading should be indexed as a child of
this post: http://www.nabble.com/Phishing-tt23226790.html#a23339685 (so
you can climb up the thread) ... I cc'd you because this might be of
interest to you (aside from the plug).  No reason you can't pursue email
domains in addition to web domains...

Re: emailBL code

Posted by Adam Katz <an...@khopis.com>.

Mandy wrote:
> I work for a Canadian provincial government, on a system with about
> 50,000 mailboxes.  I scanned our outbound mail logs over the past 6
> months with this data.  There were 31 replies to "Your webmail is
> expired!! !" type messages in that period.
> 
> If we had had been blocking outbound mail based on this list, the two
> compromised accounts we had to deal with (one of which made the list
> in its turn) wouldn't have happened.
> 
> I definitely see value here.

Can you determine how many of those were out-of-office messages?  Then
again, even at just two, if you can stop such compromises, it's worth
it (and then some).

I'd still rather block the offending message than intercept responses
to it (as that means it has suckered users, which means it has wasted
their time).  I see APER as a possible aid in that pursuit, though as
Jesse has mentioned, it is not fully reliable (as to be determined).
Still, these little checks add up, so even if APER gives a message 0.1
points, that might be enough to mark it as spam or even block it at
the door.

As a secondary defense, blocking replies sounds like a grand idea.

Re: emailBL code

Posted by Mandy <me...@gmail.com>.

On Fri, May 1, 2009 at 7:52 AM, Jesse Thompson
<je...@doit.wisc.edu> wrote:
> Yet Another Ninja wrote:
>>
>> I'm trying hard to convince myself this data is really useful.

I work for a Canadian provincial government, on a system with about
50,000 mailboxes.  I scanned our outbound mail logs over the past 6
months with this data.  There were 31 replies to "Your webmail is
expired!! !" type messages in that period.

If we had had been blocking outbound mail based on this list, the two
compromised accounts we had to deal with (one of which made the list
in its turn) wouldn't have happened.

I definitely see value here.

>> compared to the big_boyz my trap feed is quite small and I collected 1598
>> entries during the last 4 hrs
>
> Hello Yet Another Ninja,
>
> "big_boyz": as in a small collection of university postmasters?  I guess we
> should be honored, but I have a feeling that you were being condescending.

I got the impression he was talking about the major RBL providers
(spamhaus, spamcop), and the commercial filtering vendors.

[snip]

> Even the largest password-reply phishing campaign we've seen was only sent
> to 2500 of our users (and that was using the same reply-to).  On average, we
> see around 200 messages (30 unique reply-to's; not all new) of this type of
> phishing attempt every day.  I assume that the other universities see
> something similar.

After I spend some more time evaluating things, and looking for this
specific type of campaign, I'm planning to start blocking outbound
mail based on your list.  If I develop some tools for finding the
campaigns I'd be happy to contribute the messages.

Austin.

Re: emailBL code

Posted by John Hardin <jh...@impsec.org>.

On Fri, 1 May 2009, Yet Another Ninja wrote:

> Only little drawback is how to centralize (or not) all this gold to make 
> it useful to more than me and my dog.

I (and I'm sure others) would be willing to feed phishing corpa from our 
quarantines, so long as it's easy to do.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Warning Labels we'd like to see #1: "If you are a stupid idiot while
  using this product you may hurt yourself. And it won't be our fault."
-----------------------------------------------------------------------
  7 days until the 64th anniversary of VE day

Re: emailBL code

Posted by Yet Another Ninja <sa...@alexb.ch>.

On 5/1/2009 4:52 PM, Jesse Thompson wrote:
> Yet Another Ninja wrote:
>> I'm trying hard to convince myself this data is really useful.
>>
>> the whole 
>> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses 
>> file has 4518 entries, including vintage 2008
>>
>> compared to the big_boyz my trap feed is quite small and I collected 
>> 1598 entries during the last 4 hrs
> 
> Hello Yet Another Ninja,
> 
> "big_boyz": as in a small collection of university postmasters?  I guess 
> we should be honored, but I have a feeling that you were being 
> condescending.

Feel as you please.
I manage a relatively small trap space compared to some of the players 
here, so I meant what I said. Traps never correlate to a number of 
specific rcpt addresses, only.

> If you are the opposite of a "big_boy", that must mean that your domain 
> is smaller than a large university's, so you must have less than, say, 
> 50,000 unique active users.  
I'm definitely smaller, that doesn't mean that trap traffic can't be 
huge. Traps aren't active - they sit there and get hammered.

> Are you truly saying that every 4 hours you 
> have 1598 unique (as in the reply-to is unique) phishing attempts, in 
> which the phisher asks one of your users to reply with their credentials?

nope - I'm collecting generic drop boxes type of stuff and not specific 
phishes for a specific group.
these include phishes, lotto scams, etc using specific domains. (not 
rcpt domains)

> If what you are saying is true, then you are standing on a gold mine. 
> Would you mind contributing to the project?

every school, corp,ISP, soho server, etc is standing on a similar gold 
mine, I'm not re-inventing the wheel.
Only little drawback is how to centralize (or not) all this gold to make 
it useful to more than me and my dog.
Until I have some minimal metrics I can't say.

> As for the vintage of the addresses.  No, I don't have metrics.  But 
> most of the addresses are in the freemail domains, and we have no 
> indication that the freemail providers are shutting down this type of 
> account.  I don't mind scanning logs for, or blocking mail to, the "old" 
> addresses.  But we do include the date (however accurate it is) so you 
> can choose to filter the list any way you desire.

no need to got thru that trouble - you guys know its value, once apps 
are here to test the data, then others outside your space will report, 
I'm sure.

We have different targets. I misunderstood APER's

this is all work in progress so keep tuned....

Axb

Re: emailBL code

Posted by Jesse Thompson <je...@doit.wisc.edu>.

Yet Another Ninja wrote:
> I'm trying hard to convince myself this data is really useful.
> 
> the whole 
> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses 
> file has 4518 entries, including vintage 2008
> 
> compared to the big_boyz my trap feed is quite small and I collected 
> 1598 entries during the last 4 hrs

Hello Yet Another Ninja,

"big_boyz": as in a small collection of university postmasters?  I guess 
we should be honored, but I have a feeling that you were being 
condescending.

What exactly are you collecting?  Keep in mind that the APER project is 
very focused on preventing email replies to phishing (hence the name). 
We aren't trying to stop the phishing itself (directly); there are 
others that do that.

If you are the opposite of a "big_boy", that must mean that your domain 
is smaller than a large university's, so you must have less than, say, 
50,000 unique active users.  Are you truly saying that every 4 hours you 
have 1598 unique (as in the reply-to is unique) phishing attempts, in 
which the phisher asks one of your users to reply with their credentials?

If what you are saying is true, then you are standing on a gold mine. 
Would you mind contributing to the project?

Even the largest password-reply phishing campaign we've seen was only 
sent to 2500 of our users (and that was using the same reply-to).  On 
average, we see around 200 messages (30 unique reply-to's; not all new) 
of this type of phishing attempt every day.  I assume that the other 
universities see something similar.

As for the vintage of the addresses.  No, I don't have metrics.  But 
most of the addresses are in the freemail domains, and we have no 
indication that the freemail providers are shutting down this type of 
account.  I don't mind scanning logs for, or blocking mail to, the "old" 
addresses.  But we do include the date (however accurate it is) so you 
can choose to filter the list any way you desire.

Jesse

-- 
   Jesse Thompson
   Division of Information Technology, University of Wisconsin-Madison
   Email/IM: jesse.thompson@doit.wisc.edu

Re: emailBL code

Posted by Yet Another Ninja <sa...@alexb.ch>.

On 5/1/2009 3:56 PM, Adam Katz wrote:
> Jeff Moss wrote:
>> This is not to suggest that I ever understood the part about using
>> half-length MD5.
> 
> No need.  I'm using full-length hashes now, plus the SURBL/chmod style
> IP addresses.  I must have lost the email I was composing on the topic,
> but it's fully propagated by now.  I've attached my code.
> 
> Note that the code still supports the old truncated string.  I'll rip
> that out soon.  Also note that I'm not an advanced perl coder (almost
> all of my perl scripts start as POSIX shell scripts, including this one)
> .... so while I'm happy to get *suggestions*, I'm not so eager for the
> insults and hash words this list tends to give instead.

I'm trying hard to convince myself this data is really useful.

the whole 
http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses 
file has 4518 entries, including vintage 2008

compared to the big_boyz my trap feed is quite small and I collected 
1598 entries during the last 4 hrs

hmmmmm

does anybody have any hit metrics?

emailBL code

Posted by Adam Katz <an...@khopis.com>.

Jeff Moss wrote:
> This is not to suggest that I ever understood the part about using
> half-length MD5.

No need.  I'm using full-length hashes now, plus the SURBL/chmod style
IP addresses.  I must have lost the email I was composing on the topic,
but it's fully propagated by now.  I've attached my code.

Note that the code still supports the old truncated string.  I'll rip
that out soon.  Also note that I'm not an advanced perl coder (almost
all of my perl scripts start as POSIX shell scripts, including this one)
... so while I'm happy to get *suggestions*, I'm not so eager for the
insults and hash words this list tends to give instead.

RE: my emailBL is live!

Posted by Jeff Moss <jm...@Huffmancorp.com>.

>> The chance of a collision really is much smaller than I thought, even
>> including the birthday paradox.  But rather than just say it's small and
>> ask you to take my word for it I'm providing a link.  The Wikipedia page
>> for Birthday Attack has a chart that shows the probability of collision
>> for hashes of various lengths.
>>
>> http://en.wikipedia.org/wiki/Birthday_attack>
>
>Well nuts.  Unless my estimation is wrong, my half-length MD5sum would
>be 64-bit and thus the 10^-18 probability of collisions would require
a> db of 190 entries rather than full-length MD5sum's 820 billion.
>
>Unless corrected, I'll revise my algorithm this evening.

Well, a 64-bit hash with a 10^-18 probability of collisions would only require 6 entries in the DB.  However a 10^-12 probability should be good enough because there probably aren't a trillion unique email addresses.  A 10^-12 probability of collision would allow 6 million entries in the DB.
 
This is not to suggest that I ever understood the part about using half-length MD5.

  Jeff Moss

Re: my emailBL is live!

Posted by Adam Katz <an...@khopis.com>.

Jeff Moss wrote:
> The chance of a collision really is much smaller than I thought, even
> including the birthday paradox.  But rather than just say it's small and
> ask you to take my word for it I'm providing a link.  The Wikipedia page
> for Birthday Attack has a chart that shows the probability of collision
> for hashes of various lengths.
> 
> http://en.wikipedia.org/wiki/Birthday_attack

Well nuts.  Unless my estimation is wrong, my half-length MD5sum would
be 64-bit and thus the 10^-18 probability of collisions would require
a db of 190 entries rather than full-length MD5sum's 820 billion.

Unless corrected, I'll revise my algorithm this evening.

RE: my emailBL is live!

Posted by Jeff Moss <jm...@Huffmancorp.com>.

Rob McEwen wrote:

>>> A word of caution.  Be very careful how you use the list.
>>
>> OK. I was wrong. Due to this discussion, I'm convinced that MD5 of the
>> whole (lower case!) e-mail address is best, with the entire e-mail
>> address still showing up in plain text in the DNS txt record.
>>
>> But I have some questions:
>>
>> (1) is MD5 of the entire address reasonably safe from collisions.
>> (consider the 'birthday paradox' before being too quick to answer)
>
>Yes. The chance of a collision is ridiculously small. Not worth worrying
>about.

The chance of a collision really is much smaller than I thought, even including the birthday paradox.  But rather than just say it's small and ask you to take my word for it I'm providing a link.  The Wikipedia page for Birthday Attack has a chart that shows the probability of collision for hashes of various lengths.

http://en.wikipedia.org/wiki/Birthday_attack

  Jeff Moss

Re: my emailBL is live!

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Rob McEwen wrote:

>> A word of caution.  Be very careful how you use the list.
> 
> OK. I was wrong. Due to this discussion, I'm convinced that MD5 of the
> whole (lower case!) e-mail address is best, with the entire e-mail
> address still showing up in plain text in the DNS txt record.
> 
> But I have some questions:
> 
> (1) is MD5 of the entire address reasonably safe from collisions.
> (consider the 'birthday paradox' before being too quick to answer)

Yes. The chance of a collision is ridiculously small. Not worth worrying 
about.

> (2) I'm also interested in knowing more specifics about the data found
> at
> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses
> 
> (2.a.) how frequently are new scam addresses added to that list?
> 
> (2.b.) how long does an address take to expire since the last e-mail
> address is used for scams "in the wild"
> 
> (2.c.) Is the data auto-added? or must e-mail addresses go through a
> manual review first?
> 
> (2.d.) Moreover, what is a typical time between the "419" spammer's last
> spotted use of the e-mail, and appearance in that list?
> 
> (I don't need exactly precise answers which spammers might use to 'game'
> the system... just basic estimates will do)

There's actually a mailing list for the project. You're probably better 
off asking these questions there:

http://groups.google.com/group/anti-phishing-email-reply-discuss

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: my emailBL is live!

Posted by Jesse Thompson <je...@doit.wisc.edu>.

Rob McEwen wrote:
> Jesse Thompson wrote:
>> A word of caution.  Be very careful how you use the list.
> 
> OK. I was wrong. Due to this discussion, I'm convinced that MD5 of the
> whole (lower case!) e-mail address is best, with the entire e-mail
> address still showing up in plain text in the DNS txt record.
> 
> But I have some questions:
> 
> (1) is MD5 of the entire address reasonably safe from collisions.
> (consider the 'birthday paradox' before being too quick to answer)
> 
> (2) I'm also interested in knowing more specifics about the data found
> at
> http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses
> 
> (2.a.) how frequently are new scam addresses added to that list?

Every day.  Contributers add addresses when they find them.

> (2.b.) how long does an address take to expire since the last e-mail
> address is used for scams "in the wild"

They don't expire.  You can use the date to make up your own policies 
depending on what you are doing.

We do have a 'phishing_cleared_addresses' list which we use when we get 
confirmation that an account has been locked down.  Addresses on the 
cleared list are automatically removed from the 
'phishing_reply_addresses' list if the activity date is older than the 
cleared date.

> (2.c.) Is the data auto-added? or must e-mail addresses go through a
> manual review first?

Manually added.  But I can't speak for the methods of everyone that 
contributes.

> (2.d.) Moreover, what is a typical time between the "419" spammer's last
> spotted use of the e-mail, and appearance in that list?

It's reactionary, so the spam must be received before it can be discovered.

> (I don't need exactly precise answers which spammers might use to 'game'
> the system... just basic estimates will do)

Jesse

-- 
   Jesse Thompson
   Division of Information Technology, University of Wisconsin-Madison
   Email/IM: jesse.thompson@doit.wisc.edu

Re: my emailBL is live!

Posted by Rob McEwen <ro...@invaluement.com>.

Jesse Thompson wrote:
> A word of caution.  Be very careful how you use the list.

OK. I was wrong. Due to this discussion, I'm convinced that MD5 of the
whole (lower case!) e-mail address is best, with the entire e-mail
address still showing up in plain text in the DNS txt record.

But I have some questions:

(1) is MD5 of the entire address reasonably safe from collisions.
(consider the 'birthday paradox' before being too quick to answer)

(2) I'm also interested in knowing more specifics about the data found
at
http://anti-phishing-email-reply.googlecode.com/svn/trunk/phishing_reply_addresses

(2.a.) how frequently are new scam addresses added to that list?

(2.b.) how long does an address take to expire since the last e-mail
address is used for scams "in the wild"

(2.c.) Is the data auto-added? or must e-mail addresses go through a
manual review first?

(2.d.) Moreover, what is a typical time between the "419" spammer's last
spotted use of the e-mail, and appearance in that list?

(I don't need exactly precise answers which spammers might use to 'game'
the system... just basic estimates will do)

-- 
Rob McEwen
http://dnsbl.invaluement.com/
rob@invaluement.com
+1 (478) 475-9032

Re: 419 emailBL?

Posted by Mike Cardwell <sp...@lists.grepular.com>.

mouss wrote:

>>> Is the best way to do this - not via DNS.
>> Depends what you're trying to achieve. I thought the objective was a
>> block list of email addresses that could be queried via the DNS by any
>> application... Your suggestion doesn't really capture the requirements.
> and what is the benefit of using DNS? why not rsync/svn/wget/... ?
> 
>> In this particular example, the list should be used for preventing your
>> users sending emails *to* those addresses. Many organisations rightly or
>> wrongly don't perform spam filtering on their outgoing relays so
>> spamassassin is a bit over the top when you can just use another dns
>> based bl.
>
> with rsync or the like, you can simply add the addresses (no MD5, no
> anything) to an access list that your MTA can use.

It sounds like you're asking me what the benefit of distributing a block 
list via the DNS is? If yes, type "dnsbl" into google. If not, please 
clarify ...

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: 419 emailBL?

Posted by mouss <mo...@ml.netoyen.net>.

Mike Cardwell a écrit :
> Steve Freegard wrote:
> [snip]
>>
>> Is the best way to do this - not via DNS.
> 
> Depends what you're trying to achieve. I thought the objective was a
> block list of email addresses that could be queried via the DNS by any
> application... Your suggestion doesn't really capture the requirements.
> 

and what is the benefit of using DNS? why not rsync/svn/wget/... ?


> In this particular example, the list should be used for preventing your
> users sending emails *to* those addresses. Many organisations rightly or
> wrongly don't perform spam filtering on their outgoing relays so
> spamassassin is a bit over the top when you can just use another dns
> based bl.
> 

with rsync or the like, you can simply add the addresses (no MD5, no
anything) to an access list that your MTA can use.

Re: [SA] 419 emailBL?

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Adam Katz wrote:

>>>> For listing both emails and uri's it would be useful if you could add
>>>> regular expressions. [...]
> 
> Steve Freegard responded:
>>> Yuck; if you want to do stuff using regexp then:
>>>
>>> uri RULE_NAME /<regexp>/
>>> score RULE_NAME nn.nnn
>>>
>>> Is the best way to do this - not via DNS.
> 
> Mike Cardwell defended:
>> Depends what you're trying to achieve. I thought the objective was a
>> block list of email addresses that could be queried via the DNS by any
>> application... Your suggestion doesn't really capture the requirements.
>>
>> In this particular example, the list should be used for preventing your
>> users sending emails *to* those addresses. Many organisations rightly or
>> wrongly don't perform spam filtering on their outgoing relays so
>> spamassassin is a bit over the top when you can just use another dns
>> based bl.
> 
> If by "any application" you mean "any application that can handle
> full-blown perl regular expressions" ... your regex examples are
> nontrivial, so you're already pretty much catering to SA anyway.

You completely misunderstood what I was suggesting. On the server side I 
shove this in my list:

^foo-\d+@example\.com$

Then when the client looks up foo-5@example.com I return a positive 
result. The client needs no regex capability.

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: [SA] 419 emailBL?

Posted by Adam Katz <an...@khopis.com>.

Mike Cardwell wrote:
>>> For listing both emails and uri's it would be useful if you could add
>>> regular expressions. [...]

Steve Freegard responded:
>> Yuck; if you want to do stuff using regexp then:
>>
>> uri RULE_NAME /<regexp>/
>> score RULE_NAME nn.nnn
>>
>> Is the best way to do this - not via DNS.

Mike Cardwell defended:
> Depends what you're trying to achieve. I thought the objective was a
> block list of email addresses that could be queried via the DNS by any
> application... Your suggestion doesn't really capture the requirements.
> 
> In this particular example, the list should be used for preventing your
> users sending emails *to* those addresses. Many organisations rightly or
> wrongly don't perform spam filtering on their outgoing relays so
> spamassassin is a bit over the top when you can just use another dns
> based bl.

If by "any application" you mean "any application that can handle
full-blown perl regular expressions" ... your regex examples are
nontrivial, so you're already pretty much catering to SA anyway.

There's also the question of handling quotes and other forbidden
characters in the TXT field, plus its length limit.  Once that's all
solved, the question of feasibility and efficiency still looms.

Given the options of putting that kind of thing in (A) DNS or (B)
sa-channels, I'd lean towards (B) on the way to (C) something else:

I'm sure Justin Mason (for his sought channel) has thought long and
hard about this.  The mechanism for sa-update is brilliant, but
doesn't lend itself to enormous indices of frequently-changing
rulesets.  Even if it were revised to enable a diff/patch system (hint
hint), it would still fail to distribute the remaining load.

Justin:  Perhaps sa-update could support [version].torrent in addition
to [version].tar.gz on each mirror?  (This doesn't touch the current
DNS-based version/announce system.)  Channels hosted for versions of
SA after the supporting release (e.g. 0.4.3.[channel] and "higher")
would be allowed to host only the torrent file.

Either the self-healing nature of BT would implement the diffing
portion for free, or SA's BT client would merely choose which files in
the torrent to download (assuming there are perl-based clients that
support that... libtorrent does, but that's C-based), as it would
contain full.cf, [n-1].diff, [n-2].diff, [n-3].diff, and [last release
yesterday].diff (or the like).

... this is similar to my proposal for a distributed Blue Frog rehash,
http://khopesh.com/wiki/Ending_spam

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Re: 419 emailBL?

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Steve Freegard wrote:

>> For listing both emails and uri's it would be useful if you could add
>> regular expressions. I'm not sure how you'd serve such an RBL though
>> without writing your own custom software or modifying an existing dns
>> server. Eg, it would be nice if you could add entries like this to the rbl:
>>
>> ^(?i)https?://[a-z]+\.example\.com/unsubscribe\.cgi\?id=\d+$
>>
>> And:
>>
>> ^(?i)customer-service-[A-Z]\d+@example\.(?:com|co\.uk)$
>>
> 
> Yuck; if you want to do stuff using regexp then:
> 
> uri RULE_NAME /<regexp>/
> score RULE_NAME nn.nnn
> 
> Is the best way to do this - not via DNS.

Depends what you're trying to achieve. I thought the objective was a 
block list of email addresses that could be queried via the DNS by any 
application... Your suggestion doesn't really capture the requirements.

In this particular example, the list should be used for preventing your 
users sending emails *to* those addresses. Many organisations rightly or 
wrongly don't perform spam filtering on their outgoing relays so 
spamassassin is a bit over the top when you can just use another dns 
based bl.

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: 419 emailBL?

Posted by Steve Freegard <st...@stevefreegard.com>.

Mike Cardwell wrote:
> Steve Freegard wrote:
> 
>>>> A word of caution.  Be very careful how you use the list.  The
>>>> intended usage for the list is to prevent (or monitor) local users
>>>> from sending email to the listed addresses.  The phishers frequently
>>>> use compromised end-user accounts to receive the phishing replies, so
>>>> there is a high risk of false positives, especially if you attempt to
>>>> classify messages containing one these addresses as spam.
>>> Thread fork!
>>>
>>> Would it be useful to have a similar list for 419 fraud contact
>>> addresses?
>>>
>>> Discuss...
>>
>> That was always my intention - there are a couple of us looking at
>> several methods of automatically listing e-mail addresses present in the
>> body of spam or the Reply-To header to specifically target stuff that
>> often slips though with low scores.
>>
>> I'm also looking at listing URIs that are impossible to list in the
>> traditional URIBLs  e.g. groups.yahoo.com/groupname/message/1
> 
> For listing both emails and uri's it would be useful if you could add
> regular expressions. I'm not sure how you'd serve such an RBL though
> without writing your own custom software or modifying an existing dns
> server. Eg, it would be nice if you could add entries like this to the rbl:
> 
> ^(?i)https?://[a-z]+\.example\.com/unsubscribe\.cgi\?id=\d+$
> 
> And:
> 
> ^(?i)customer-service-[A-Z]\d+@example\.(?:com|co\.uk)$
> 

Yuck; if you want to do stuff using regexp then:

uri RULE_NAME /<regexp>/
score RULE_NAME nn.nnn

Is the best way to do this - not via DNS.

Regards,
Steve.

Re: 419 emailBL?

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Steve Freegard wrote:

>>> A word of caution.  Be very careful how you use the list.  The
>>> intended usage for the list is to prevent (or monitor) local users
>>> from sending email to the listed addresses.  The phishers frequently
>>> use compromised end-user accounts to receive the phishing replies, so
>>> there is a high risk of false positives, especially if you attempt to
>>> classify messages containing one these addresses as spam.
>> Thread fork!
>>
>> Would it be useful to have a similar list for 419 fraud contact addresses?
>>
>> Discuss...
> 
> That was always my intention - there are a couple of us looking at
> several methods of automatically listing e-mail addresses present in the
> body of spam or the Reply-To header to specifically target stuff that
> often slips though with low scores.
> 
> I'm also looking at listing URIs that are impossible to list in the
> traditional URIBLs  e.g. groups.yahoo.com/groupname/message/1

For listing both emails and uri's it would be useful if you could add 
regular expressions. I'm not sure how you'd serve such an RBL though 
without writing your own custom software or modifying an existing dns 
server. Eg, it would be nice if you could add entries like this to the rbl:

^(?i)https?://[a-z]+\.example\.com/unsubscribe\.cgi\?id=\d+$

And:

^(?i)customer-service-[A-Z]\d+@example\.(?:com|co\.uk)$

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: 419 emailBL?

Posted by Steve Freegard <st...@stevefreegard.com>.

John Hardin wrote:
> On Wed, 29 Apr 2009, Jesse Thompson wrote:
> 
>> A word of caution.  Be very careful how you use the list.  The
>> intended usage for the list is to prevent (or monitor) local users
>> from sending email to the listed addresses.  The phishers frequently
>> use compromised end-user accounts to receive the phishing replies, so
>> there is a high risk of false positives, especially if you attempt to
>> classify messages containing one these addresses as spam.
> 
> Thread fork!
> 
> Would it be useful to have a similar list for 419 fraud contact addresses?
> 
> Discuss...
> 

That was always my intention - there are a couple of us looking at
several methods of automatically listing e-mail addresses present in the
body of spam or the Reply-To header to specifically target stuff that
often slips though with low scores.

I'm also looking at listing URIs that are impossible to list in the
traditional URIBLs  e.g. groups.yahoo.com/groupname/message/1

Cheers,
Steve.

419 emailBL?

Posted by John Hardin <jh...@impsec.org>.

On Wed, 29 Apr 2009, Jesse Thompson wrote:

> A word of caution.  Be very careful how you use the list.  The intended 
> usage for the list is to prevent (or monitor) local users from sending 
> email to the listed addresses.  The phishers frequently use compromised 
> end-user accounts to receive the phishing replies, so there is a high 
> risk of false positives, especially if you attempt to classify messages 
> containing one these addresses as spam.

Thread fork!

Would it be useful to have a similar list for 419 fraud contact addresses?

Discuss...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   You do not examine legislation in the light of the benefits it
   will convey if properly administered, but in the light of the
   wrongs it would do and the harms it would cause if improperly
   administered.                                  -- Lyndon B. Johnson
-----------------------------------------------------------------------
  9 days until the 64th anniversary of VE day

Re: my emailBL is live!

Posted by John Hardin <jh...@impsec.org>.

On Wed, 29 Apr 2009, Adam Katz wrote:

> Okay, back to using the second half of the MD5 (simple enough, since
> that was my original implementation).  Relevant code:
>
> $hash =~ s/@.*//;
> $hash =~ tr [A-Z] [a-z];
> $hash = substr(Digest::MD5::md5_hex($hash),16); # 2nd 16 of 32 chars

...can you go through your logic for throwing away half of the MD5 again?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   We are hell-bent and determined to allocate the talent, the
   resources, the money, the innovation to absolutely become a
   powerhouse in the ad business.       -- Microsoft CEO Steve Ballmer
   ...because allocating talent to securing Windows isn't profitable?
-----------------------------------------------------------------------
  9 days until the 64th anniversary of VE day

Re: my emailBL is live!

Posted by Adam Katz <an...@khopis.com>.

Jesse Thompson wrote:
> A word of caution.  Be very careful how you use the list.  The
> intended usage for the list is to prevent (or monitor) local users
> from sending email to the listed addresses.  The phishers
> frequently use compromised end-user accounts to receive the
> phishing replies, so there is a high risk of false positives,
> especially if you attempt to classify messages containing one these
> addresses as spam.

That might just mean that the SpamAssassin rule that uses it should
require some other phishing-detection rule(s) to hit as well.

Rob McEwen wrote:
> OK. I was wrong. Due to this discussion, I'm convinced that MD5 of
> the whole (lower case!) e-mail address is best, with the entire
> e-mail address still showing up in plain text in the DNS txt
> record.

Okay, back to using the second half of the MD5 (simple enough, since
that was my original implementation).  Relevant code:

$hash =~ s/@.*//;
$hash =~ tr [A-Z] [a-z];
$hash = substr(Digest::MD5::md5_hex($hash),16); # 2nd 16 of 32 chars

You can look up an address by hash by pretending its domain is "hash"
(note the collision in the example).

I've added support for a new type, Z.  Z means the email address
should not be revealed (perhaps this should always be the case?).  The
TXT record for hash lookup will return "@hidden@" ... see the test for
hidden@ example.com below.

David B Funk recommended SURBL's merged results as a bandwidth-saver:
> EG: A == 127.0.0.2
>     B == 127.0.0.4
>     C == 127.0.0.8
>     D == 127.0.0.16
> 
> thus AB == 127.0.0.6
>     AC == 127.0.0.10
> 
> etc.

I like it!  Why not start at one? A=1 B=2 C=4 D=8 Z=n/a.  This
facilitates:

  $type_list =~ s/.*,([A-IZ]+),.*/$1/;
  if ($type_list =~ /Z/) {
    $email =~ s/\t".*"/\t"\@hidden\@"/; # hide the email address
    $type_list =~ s/Z//g;
  }
  $type_list =~ s/(?=.)/+/g;
  $type_list =~ tr [ABCD] [1248]; # rewrite when we get an E!
  $type_list = eval 0 . $type_list;

Here are some tests:

$ eblhash() { perl -MDigest::MD5 -e \
  'print substr(Digest::MD5::md5_hex(q('$1')),16)."\n"'; }
$ eblhash test
cade4e832627b4f6
$ host -t txt `eblhash test`.hash.emailbl.khopesh.com.
cade4e832627b4f6.hash.emailbl.khopesh.com descriptive text
"test@example.com"
cade4e832627b4f6.hash.emailbl.khopesh.com descriptive text
"test@emailbl.khopesh.com"
$ host `eblhash test`.example.com.emailbl.khopesh.com.
cade4e832627b4f6.example.com.emailbl.khopesh.com has address 127.0.0.15
$ host -t txt `eblhash hidden`.hash.emailbl.khopesh.com.
e8238a6c0be92190.hash.emailbl.khopesh.com descriptive text "@hidden@"

For test purposes, "text" and "hidden" are also included as their own
hashes.  For legacy purposes, the truncated username model used
earlier is still there.  I'll remove it in time.

I should also mention that this index is updated regularly and will
stay up until it suffers from its success or misuse.

The last-seen date stamp on test@example.com serves as the last day it
was updated, and the SOA record for my khopesh.com domain shows the
last DNS update (the last two digits are the EDT hour or higher if
I've been toying with it).  My sa-channels update every four hours.

-----

I'm also toying with the idea of making the khop-sc-neighbors list
(currently an sa-update channel) a DNSBL with return codes indicating
its networks' rank on an inverse scale from 1-100 (though not
necessarily with a hundred entries), so 127.0.0.100 means the
top-ranked spamming network and 127.0.0.1 is the lowest noted spamming
network.  This becomes a percent to apply a plugin configuration
option's multiplier, so a multiplier of 3 would give the top-ranked
network (127.0.0.100) a score of 3 and a network in the middle
(127.0.0.50) a score of 1.5, etc.  The khop-sc-neighbors channel
examines /24 CIDRs (x in a.b.c.x) and /8 CIDRs (x.y.z in a.x.y.z),
which would be 127.0.CIDR.PERCENT, so 127.0.24.100 would be the top
/24 offender and 127.0.8.100 would be the top /8 offender.

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-spam

Re: my emailBL is live!

Posted by Jesse Thompson <je...@doit.wisc.edu>.

Adam Katz wrote:
> This was actually rather simple to set up.  I'll publish the code
[snip]

Thanks for your efforts with this.  I forwarded your message to the APER 
mailing list.

A word of caution.  Be very careful how you use the list.  The intended 
usage for the list is to prevent (or monitor) local users from sending 
email to the listed addresses.  The phishers frequently use compromised 
end-user accounts to receive the phishing replies, so there is a high 
risk of false positives, especially if you attempt to classify messages 
containing one these addresses as spam.

Jesse

-- 
   Jesse Thompson
   Division of Information Technology, University of Wisconsin-Madison
   Email/IM: jesse.thompson@doit.wisc.edu

Re: my emailBL is live!

Posted by John Wilcock <jo...@tradoc.fr>.

Le 29/04/2009 02:40, Adam Katz a écrit :
> replaces the @ with a dot (not an underscore, that's not a legal
> character).

Won't that pose problems distinguishing between fred.bloggs@example.tld 
and fred@bloggs.example.tld ?

John.

-- 
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages    - www.tradoc.fr

Re: my emailBL is live!

Posted by Mike Cardwell <sp...@lists.grepular.com>.

Adam Katz wrote:

> Mike Cardwell contended:
>>> It would definitely require a hashing algorithm, like MD5. IIRC
>>> there is a maximum length for a hostname, and that is 255
>>> characters. What if the hostname in your email address is 255
>>> characters long on it's own...?
> 
> When MD5sums were first proposed (in place of my wild escaping), it
> seemed like a great idea.  However, a voice in the back of my head,
> now spoken (typed?) by Rob, has been growing louder.  My
> implementation now merely truncates email usernames to 16 characters
> (plus the noted defanging, which makes it complicated again ...) and
> replaces the @ with a dot (not an underscore, that's not a legal
> character).

Hmmm. I'm still not convinced you've done it the best way. That 
conversion sounds a lot more complicated than a straight MD5 conversion, 
and it doesn't deal with the fact that there is a maximum length for an 
FQDN.

> In fact, collisions here could be regarded as good, as usernames that
> long can include tracking strings (e.g. the mailer for our list,
> users-return-12345-joe=bob.com@ spamassassin.apache.org, becomes
> users-return-123.spamassassin.apache.org), which should help.

That could be seen as an advantage I suppose. But, the particular source 
list being used here wasn't meant to be used that way. Some people might 
consider such hits as false positives.

> I did fully implement my proposed latter 16 characters (of MD5's 32)
> plus dot plus the domain, complete with hash lookups, but I just
> removed it (which is why non-test lookups will fail for the next ~4h).
> 
>>> Having access to the plain text email address would only make it
>>> easier for ISPs to do anything if they had access to the zone file.
>>> In which case, you could just give them access to a separate list
>>> which has the email addresses in plain text.
> 
> Unless we're replacing the currently well-groomed upstream source at
> http://anti-phishing-email-reply.googlecode.com/#, I see no reason to
> offer such services (since they do it better).
> 
>>> So in rbldnsd, ...
> 
> Whoa, what's that?!  Interesting ... it's even in Debian.  I think I'm
> happy with BIND for the moment, since my origin point is hidden from
> use and the actual NS records are merely slaves run by zoneedit (so
> efficiency isn't really important).  I probably need to stay on BIND
> as I doubt I could use rbldnsd to host my SpamAssassin channels.

I implemented pretty much exactly the same thing that you did, except it 
uses a straight hexadecimal MD5 digest of the full address. I know this 
isn't strictly correct as the local part of an email address is 
technically case sensitive, but as email addresses in the real world are 
case *insensitive* I convert it to lower case before hashing.

Eg:

root@haven:/var/lib/rbldns# host -t a 
bda05135a5b8a92d5d2934531864442d.phishing.email.rbl.grepular.com
bda05135a5b8a92d5d2934531864442d.phishing.email.rbl.grepular.com 
A       127.0.0.3
bda05135a5b8a92d5d2934531864442d.phishing.email.rbl.grepular.com 
A       127.0.0.1
root@haven:/var/lib/rbldns# host -t txt 
bda05135a5b8a92d5d2934531864442d.phishing.email.rbl.grepular.com
bda05135a5b8a92d5d2934531864442d.phishing.email.rbl.grepular.com 
TXT     "20090411"
root@haven:/var/lib/rbldns#

That RBL wont stay public for long so don't use it for anything other 
than a quick test.

Here's the code I use to download the data and populate an rbldnsd file:

https://secure.grepular.com/phishing_addresses.txt

You might find something you can strip out and re-use.

Here are the Exim acls I use to query it for the envelope sender, From 
header and Reply-to headers:

acl_smtp_mail:

deny dnslists   = 
phishing.email.rbl.grepular.com/${md5:${lc:$sender_address}}

acl_smtp_data:

deny dnslists   = 
phishing.email.rbl.grepular.com/${md5:${lc:${address:$h_From:}}}

deny dnslists   = 
phishing.email.rbl.grepular.com/${md5:${lc:${address:$h_Reply-To:}}}

I'm not familiar enough with writing SpamAssassin rules yet to write a 
SpamAssassin recipe.

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: my emailBL is live!

Posted by Mike Cardwell <sp...@lists.grepular.com>.

David B Funk wrote:

>> When MD5sums were first proposed (in place of my wild escaping), it
>> seemed like a great idea.  However, a voice in the back of my head,
>> now spoken (typed?) by Rob, has been growing louder.  My
>> implementation now merely truncates email usernames to 16 characters
>> (plus the noted defanging, which makes it complicated again ...) and
>> replaces the @ with a dot (not an underscore, that's not a legal
>> character).
> 
> Repeat after me, ALMOST ALL characters (octets actually) are now
> LEGAL in DNS queries (see RFC-2181 section 11).
> 
> There is NO need for -any- kind of munging.

That same RFC says labels are limited to 63 chars and FQDNs are limited 
to 255 chars. So you'd need to mung for those two cases wouldn't you? 
Also, are you 100% sure there are no characters that are allowed in an 
email address local part which aren't allowed in a domain name?

> I've set up an emailBL directly from the Google list, try:
> 
>  host abuse-t@live.com.phish.icaen.uiowa.edu.

"host" on my Debian system spits out warnings. It does however do the 
lookup correctly. You must recognise that there will be compatibility 
problems with your solution in the wild though. One example being Exim's 
dnsdb lookup type, which fails outright doing that lookup.

Here's the warning I get from "host".

host -t a abuse-t@live.com.phish.icaen.uiowa.edu
  *** invalid answer name abuse-t\@live.com.phish.icaen.uiowa.edu after 
A query for abuse-t@live.com.phish.icaen.uiowa.edu
abuse-t\@live.com.phish.icaen.uiowa.edu	A	127.0.0.2
  !!! abuse-t\@live.com.phish.icaen.uiowa.edu A record has illegal name

What exactly is the problem with hashing the address anyway? We'll 
forget accidental collisions as they simply wont happen.

> IE "address.phish.icaen.uiowa.edu"
> 
> NO need for hashing, no collsions, etc.
 >
> Also makes it easier to deploy into an address filter/blocker in
> your smtp-MTA (to prevent local llusers from being reply to one
> of those addresses).
 >
> 
> BTW notice that the Google data is multi-valued in the TYPE field.
> rather than a simple enumeration of that data into an address it
> is better to turn it into a bit-mask, as then multiple values can
> be represented (and queried) in a single address/operation.
> 
> EG: A == 127.0.0.2
>     B == 127.0.0.4
>     C == 127.0.0.8
>     D == 127.0.0.16
> 
> thus AB == 127.0.0.6
>     AC == 127.0.0.10
> 
> etc.
> 
> So the entry for 'abuse-t@live.com' only has an 'A' type.
> 
>  host account-teamdept@live.com.phish.icaen.uiowa.edu. => 127.0.0.10
> 
> so the entry for 'account-teamdept@live.com' has an 'A' & 'C' type.

Yeah, that might be a good idea.

-- 
Mike Cardwell
(https://secure.grepular.com/) (http://perlcv.com/)

Re: my emailBL is live!

Posted by Adam Katz <an...@khopis.com>.

David B Funk wrote:
> Umm, I guess you didn't understand what the ".phish.icaen.uiowa.edu" part
> of "address.phish.icaen.uiowa.edu" ment.

D'oh!  Sorry, doing too many things at once.  You're right, that
worked for me.  However, you still have Mike's issue of 63 characters
per label and 255 characters total, the support issue, plus all the
wasted bandwidth with such a long name.

Also, I'd be hesitant dealing with certain special characters,
technically legal but potentially dangerous, like [@*"'?%,] et al.

> Unless you've got an obsolete version of software this does work.
> In bind if you use the "check-names ignore" option for that zone it
> does -NOT- require munging. (I'm running mine that way, so I know
> that it works.)

Isn't there a reason they recently re-enabled check-names by default?

> Have you followed the development of the SURBL service? They
> explicitly switched to the bit-mask format to reduce DNS load.

Obviously not.  Interesting.

Re: my emailBL is live!

Posted by David B Funk <db...@engineering.uiowa.edu>.

On Wed, 29 Apr 2009, Adam Katz wrote:

> David B Funk wrote:
> > Repeat after me, ALMOST ALL characters (octets actually) are now
> > LEGAL in DNS queries (see RFC-2181 section 11).
> >
> > There is NO need for -any- kind of munging.
>
> First, you must start and end a domain label ("octet" refers to IP
> addresses) with a letter or number, so munging is still required.
> Second, DNS thrives on caching, peering, and slaves; if BIND or other
> major name servers can't handle it, it won't fly.  I'm running the
> latest version of BIND and it required each of the munging steps I
> implemented (except the truncation to 16 chars, which was for
> bandwidth) in order to work.
>
> Also, some of the addresses are forged and should not be listed in the
> plain anyway.  More on that in my next email announcing my md5-enabled
> list, in which I'll propose a type Z for "do not reveal this address."
>
> >     host abuse-t@live.com.phish.icaen.uiowa.edu.
> > NO need for hashing, no collsions, etc.
>
> How about the first entry in the upstream list:
> $ host -- -helpdesk@live.com
> Host -helpdesk@live.com not found: 3(NXDOMAIN)
> $
>
> I guess you have to munge it.

Umm, I guess you didn't understand what the ".phish.icaen.uiowa.edu" part
of "address.phish.icaen.uiowa.edu" ment.

Try:
  host -- -helpdesk@live.com.phish.icaen.uiowa.edu.

Unless you've got an obsolete version of software this does work.
In bind if you use the "check-names ignore" option for that zone it
does -NOT- require munging. (I'm running mine that way, so I know
that it works.)

-- 
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: my emailBL is live!

Posted by David B Funk <db...@engineering.uiowa.edu>.

On Wed, 29 Apr 2009, Adam Katz wrote:

> But your very next topic is contrary to that philosophy...
>
> > BTW notice that the Google data is multi-valued in the TYPE field.
> > rather than a simple enumeration of that data into an address it
> > is better to turn it into a bit-mask, as then multiple values can
> > be represented (and queried) in a single address/operation.
> >
> > EG: A == 127.0.0.2
> >     B == 127.0.0.4
> >     C == 127.0.0.8
> >     D == 127.0.0.16
> >
> > thus AB == 127.0.0.6
> >     AC == 127.0.0.10
> >
> > etc.
>
> I was just following the model used by all other DNSBL/URIBLs.  Round
> robin A records for each letter.  To quote somebody you hold near and
> dear:  it "makes it easier to deploy into an address filter/blocker in
> your smtp-MTA ..."

Have you followed the development of the SURBL service? They explicitly
switched to the bit-mask format to reduce DNS load.


-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: my emailBL is live!

Posted by Adam Katz <an...@khopis.com>.

David B Funk wrote:
> Repeat after me, ALMOST ALL characters (octets actually) are now
> LEGAL in DNS queries (see RFC-2181 section 11).
> 
> There is NO need for -any- kind of munging.

First, you must start and end a domain label ("octet" refers to IP
addresses) with a letter or number, so munging is still required.
Second, DNS thrives on caching, peering, and slaves; if BIND or other
major name servers can't handle it, it won't fly.  I'm running the
latest version of BIND and it required each of the munging steps I
implemented (except the truncation to 16 chars, which was for
bandwidth) in order to work.

Also, some of the addresses are forged and should not be listed in the
plain anyway.  More on that in my next email announcing my md5-enabled
list, in which I'll propose a type Z for "do not reveal this address."

>     host abuse-t@live.com.phish.icaen.uiowa.edu.
> NO need for hashing, no collsions, etc.

How about the first entry in the upstream list:
$ host -- -helpdesk@live.com
Host -helpdesk@live.com not found: 3(NXDOMAIN)
$

I guess you have to munge it.

> Also makes it easier to deploy into an address filter/blocker in
> your smtp-MTA (to prevent local llusers from being reply to one
> of those addresses).

But your very next topic is contrary to that philosophy...

> BTW notice that the Google data is multi-valued in the TYPE field.
> rather than a simple enumeration of that data into an address it
> is better to turn it into a bit-mask, as then multiple values can
> be represented (and queried) in a single address/operation.
> 
> EG: A == 127.0.0.2
>     B == 127.0.0.4
>     C == 127.0.0.8
>     D == 127.0.0.16
> 
> thus AB == 127.0.0.6
>     AC == 127.0.0.10
> 
> etc.

I was just following the model used by all other DNSBL/URIBLs.  Round
robin A records for each letter.  To quote somebody you hold near and
dear:  it "makes it easier to deploy into an address filter/blocker in
your smtp-MTA ..."

Re: my emailBL is live!

Posted by David B Funk <db...@engineering.uiowa.edu>.

> When MD5sums were first proposed (in place of my wild escaping), it
> seemed like a great idea.  However, a voice in the back of my head,
> now spoken (typed?) by Rob, has been growing louder.  My
> implementation now merely truncates email usernames to 16 characters
> (plus the noted defanging, which makes it complicated again ...) and
> replaces the @ with a dot (not an underscore, that's not a legal
> character).

Repeat after me, ALMOST ALL characters (octets actually) are now
LEGAL in DNS queries (see RFC-2181 section 11).

There is NO need for -any- kind of munging.

I've set up an emailBL directly from the Google list, try:

 host abuse-t@live.com.phish.icaen.uiowa.edu.

IE "address.phish.icaen.uiowa.edu"

NO need for hashing, no collsions, etc.
Also makes it easier to deploy into an address filter/blocker in
your smtp-MTA (to prevent local llusers from being reply to one
of those addresses).


BTW notice that the Google data is multi-valued in the TYPE field.
rather than a simple enumeration of that data into an address it
is better to turn it into a bit-mask, as then multiple values can
be represented (and queried) in a single address/operation.

EG: A == 127.0.0.2
    B == 127.0.0.4
    C == 127.0.0.8
    D == 127.0.0.16

thus AB == 127.0.0.6
    AC == 127.0.0.10

etc.

So the entry for 'abuse-t@live.com' only has an 'A' type.

 host account-teamdept@live.com.phish.icaen.uiowa.edu. => 127.0.0.10

so the entry for 'account-teamdept@live.com' has an 'A' & 'C' type.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{