You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Ryan Thompson <ry...@sasknow.com> on 2004/09/04 18:45:44 UTC

Re: [SURBL-Discuss] checking plain domains in message bodies against SURBLs reportedly effective

Jeff Chan wrote to SpamAssassin Developers:

> Randy Brukardt of rrsoftware.com mentioned that checking
> plain domains occurring in message bodies against SURBLs
> was pretty productive.  (E.g., look for domain.com in
> addition to www.domain.com or http://www.domain.com).
>
> Perhaps this could be something interesting to at least try
> experimentally or to think about.

Yep. Good idea, overall. There are a few gotchas:

TLD extensions sometimes map file extensions. We might have to whitelist
command.com, and the entire country of Poland. :-)

Looking at the above sentence, leading/trailing punctuation might be a
potential snag.  I.e.: 4 cheap pillz, go to somethingsleazy.com, and
give us your money.

Since the domain is in plain text and doesn't contain a protocol or
subdomain (i.e., 'www'), I haven't yet seen a mail client that will
display it as a clickable URL. Thus, with this, we're probably mostly
fighting the "type this in" or "cut and paste into your browser" type of
spammer. SO, if we do this, implementers could force spammers to
obfuscate the domains beyond recognition. They'll have to do their own
munging, and we might try to catch it, but that's risky. "i looked on
the boss' computer and found porn. info forthcoming...", or even,
"spammer dot com operations are a plague on civilized nations".

Any implementations will probably have to run against large ham corpora
to see if anything like the above becomes falsely *extracted* as a URI,
regardless of whether the current data happens to cause a FP.

I'd advise keeping implementations simple and strict by default (i.e.,
no deobfuscation; maybe just clickable links only), and allow the user
to control the amount of fuzziness they'd like to match on.

- Ryan

-- 
   Ryan Thompson <ry...@sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: [SURBL-Discuss] checking plain domains in message bodies against SURBLs reportedly effective

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sat, Sep 04, 2004 at 10:45:44AM -0600, Ryan Thompson wrote:
> Yep. Good idea, overall. There are a few gotchas:
> 
> TLD extensions sometimes map file extensions. We might have to whitelist
> command.com, and the entire country of Poland. :-)
> 
> Since the domain is in plain text and doesn't contain a protocol or
> subdomain (i.e., 'www'), I haven't yet seen a mail client that will
> display it as a clickable URL.

This is generally the tact we're taking in SpamAssassin -- if a general
MUA doesn't display it as a link, then we don't consider it an URL.

Another issue for the generic domains thing is performance -- lots of
messages have lots of things like could potentially look like a domain,
and querying for them all adds a bit of a load on the client and the
server.

For instance:  /\b([a-zA-Z0-9_.-]{1,256}\.[a-zA-Z]{2,6})\b/

in theory (I haven't tested it), will grab anything that looks like a
generic domain name in text.  If you check that list against a list of
valid TLDs, you'd probably end up with a decent list, but you'd hit the top
issue quoted above where "Go take a look at command.com" isn't clear if it's
an URL or a filename.

-- 
Randomly Generated Tagline:
"Brevity is the soul of lingerie." - Dorothy Parker