You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@kluge.net> on 2004/09/14 06:25:23 UTC

Being more aggressive about finding URLs in messages

On Mon, Sep 13, 2004 at 11:12:36PM +0200, Raymond Dijkxhoorn wrote:
> >>Why not parse them, its its something like command.com its very unlikely
> >>that it gets parsed. And its its spammerpills.com they talk about it will
> >>get some score, so it will be biased... Any examples where this might get
> >>wrong ?
> 
> >Sure, "command.com". ;)   Do you mind if we took this to
> >dev@spamassassin.apache.org?  I don't really want to chat about development
> >stuff off list since it's not private.

Ok, here we are on the dev list... :)

For those not involved, we were chatting about how to be more aggressive
finding non-canonical URLs in messages to check against SURBL and such,
ala: 

"paste this in your browser

example.com

and get a big surprise!"


Right now we look for fairly definite things: either canonical
(http://www.example.com/), uris in HTML (<a href="http://www.example.com">),
or things that look like are likely to be a uri (www.example.com).

We could potentially be more aggressive, but the problem becomes FP rates.
You can look for anything like \w+\.\w+, but then things like
"Run command.com and then ...", "The structure variable is
foo.structvar ...", etc, will be caught.  So those would be checked for the
uri rules, and for some (command.com here) would get sent out as a SURBL
query.

An easy heuristic would be to send all of the guesses through uri_to_domain,
and skip ones that don't come back as valid, but then we're still stuck with
the 'command.com' issue.  I'm not sure how often that would come up though.

Thoughts?

-- 
Randomly Generated Tagline:
Lurleen, I can't get your song outta my mind.  I haven't felt this way 
 since `Funky Town.'
 
 		-- Homer Simpson
 		   Colonel Homer