You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Benoit Panizzon <be...@imp.ch> on 2005/08/10 10:55:10 UTC

URI extractor sometimes getting worng URL?

Hi all

We use the URI extractor from Spamassassin to feed our URI blacklists.

Now I seam to have found a really weird behaviour.

If the email contains something like:
<a href="http://some.url.com">please.click.here</a> we get a Blacklist entry 
for some.url.com and one for please.click.here

As please.click.here could be something that is found in multiple emails that 
are not at all spam, they get falsely scored by the blacklists.

Is this some sort of known bug?

Regards
-- 
BenoƮt Panizzon, <bp...@imp.ch>
------------------------------------------------------------------------
ImproWare AG, UNIXSP & ISP                   Phone:   +41 61 826 93 00
			     Kabelinternet-Hotline:   +41 61 826 93 07
Zurlindenstrasse 29                            Fax:   +41 61 826 93 01
CH-4133 Pratteln                               Net:   http://www.imp.ch/
------------------------------------------------------------------------

Re: URI extractor sometimes getting worng URL?

Posted by Daniel Quinlan <qu...@pathname.com>.
Theo Van Dinter <fe...@apache.org> writes:

> you'd see those as well.  I'd take a look at 3.1's get_uri_detail_list()
> which lets you see more information about where the URI was found.

I did not notice that you added POD for this (meaning it's a public API).

Is this function really an API we want to lock-in going forward?  I'm
still not entirely sure this is the way we want to go.

  raw_uri => {
    types => { a => 1, img => 1, parsed => 1 },
    cleaned => [ canonified_uri ],
    anchor_text => [ "click here", "no click here" ],
    domains => { domain1 => 1, domain2 => 1 },
  }

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: URI extractor sometimes getting worng URL?

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Aug 10, 2005 at 10:55:10AM +0200, Benoit Panizzon wrote:
> If the email contains something like:
> <a href="http://some.url.com">please.click.here</a> we get a Blacklist entry 
> for some.url.com and one for please.click.here
> 
> As please.click.here could be something that is found in multiple emails that 
> are not at all spam, they get falsely scored by the blacklists.
> 
> Is this some sort of known bug?

Without a real sample, we can't answer any questions.

However, if "please.click.here" is an URL or a FQDN or something similar,
it's fully expected to get tagged as a URI.  get_uri_list() casts a
wide net purposefully.  Even if it didn't, when spammers use a bunch of
"empty text" tags, ala:

<a href="http://www.amazon.com/"></a>

you'd see those as well.  I'd take a look at 3.1's get_uri_detail_list()
which lets you see more information about where the URI was found.
That way you could, for instance, only take URIs found in A (w/ non-empty
text blocks), FORM, and IMG HTML tags, and skip the rest.

-- 
Randomly Generated Tagline:
"HR people are generally capable of producing swank holiday parties and
 finding a dentist in your HMO group, but don't count on them to help you find
 a job." - David Clark