You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kris Deugau <kd...@vianet.ca> on 2023/04/26 15:06:56 UTC

Fine-tuning SA URI extraction

SA has long gone to great lengths to extract URIs from things which are 
not strictly URIs, on the basis that mail clients do the same and SA 
needs to inspect such things for DNSBL lookups.  I'm fine with this.

However, once in a while I come across a case where something is clearly 
being extracted and canonicalized a little too enthusiastically, which 
usually comes to my attention in the context of an FP due in large part 
to a hit on our local DNSBL.  (Which listing is in turn likely due to 
the same extraction and canonicalization on a batch of missed spam, and 
the minimal "is this an abused legit domain or a spammer domain" check I 
do before adding an entry to the DNSBL.)

The latest case is mail from the Cornell Lab of Ornithology, which has 
some message element that SA extracts "none" from, and converts it to 
"none.com" to try to look up "none.com" in DNSBLs.  At a guess, it's an 
image tag with a "background" attribute of "none".

"uridnsbl_skip_domain none" doesn't seem to suppress this lookup, either 
in 3.4.6 or a recent test install from SVN trunk.

I've worked around this specific case, and past ones, in one way or 
another, but I'd like to more precisely target the bad URI extraction. 
In particular, I'd like to suppress this at the "random crap that looks 
like a URI" stage rather than later on.  I specifically do NOT want to 
suppress lookups of the canonicalized URI, since that may be justifiably 
listed on the local DNSBL.

Am I missing some configuration option that can do this, or am I left 
with doing one of:
  - just suppressing lookups of the canonicalized URI
  - removing the canonicalized URI from the DNSBL, even if the listing 
might be justified where the *NON*-canonical version absolutely isn't
  - applying the welcomelist_* sledgehammer

-kgd

Re: Fine-tuning SA URI extraction

Posted by Benny Pedersen <me...@junc.eu>.
Kris Deugau skrev den 2023-04-26 17:06:

...
> Am I missing some configuration option that can do this, or am I left
> with doing one of:
>  - just suppressing lookups of the canonicalized URI
>  - removing the canonicalized URI from the DNSBL, even if the listing
> might be justified where the *NON*-canonical version absolutely isn't
>  - applying the welcomelist_* sledgehammer

ensure uri to be fqdn first does not work ?

none is not fqdn imho

imho you have found a bug to be made a ticket for

Re: Fine-tuning SA URI extraction

Posted by Henrik K <he...@hege.li>.
On Thu, Apr 27, 2023 at 01:45:58AM +0200, Matija Nalis wrote:
> 
> - complex but emulating browser behaviour better:
>   Add full handling of relative URIs. i.e. have push_uri() detect all
>   relative URIs and convert them to absolute URIs before adding them
>   to the list of URIs.

If you would have looked at sub push_uri(), that's what it already does:

sub push_uri {
....
  my $target = target_uri($self->{base_href} || "", $uri);

Sure, some things like this could probably be handled more intelligently
when base is missing..


Re: Fine-tuning SA URI extraction

Posted by Matija Nalis <mn...@voyager.hr>.
On Wed, Apr 26, 2023 at 03:21:50PM -0400, Kris Deugau wrote:
> http://deepnet.cx/~kdeugau/spamtools/cornell-birds.eml

Thanks. Adding some dbg() in HTML.pm of my SA 3.4.6, it seems it is
triggered this part of the email:

<td ... background="none">

"background" is deprecated (but still supported) HTML attribute:
https://www.w3.org/TR/html4/struct/global.html#adef-background


It seems to happen in this part of the SA HTML.pm code (dbg line added by myself):

sub html_uri {
  my ($self, $tag, $attr) = @_;

  use Data::Dumper; dbg ("/mn/ html_uri tag=$tag attr=" . Dumper($attr));
  
  # ordered by frequency of tag groups
  if ($tag =~ /^(?:body|table|tr|td)$/) {
    if (defined $attr->{background}) {
      $self->push_uri($tag, $attr->{background});
    }

My reading of the HTML specs (and tested in Debian Bullseye firefox and
chromium) is that "background=none" was not any special value (as the
HTML author maybe intended), but is simply taken as relative URI -
meaning picture file with a literal name of "none" in the same
directory as the HTML being viewed.

However, the issue is not restricted to that deprecated "background" attribute.
E.g. <img src="none"> or even <a href="none.com"> would likely confuse SA in the same way.


The browser would treat them as relative URLs. 

I.e. if you were viewing "https://example.com/dir/example.html" those
two would resolve to:

<img src="none">    ==> https://example.com/dir/none
<a href="none.com"> ==> https://example.com/dir/none.com

instead of "http://www.none.com" as SA seems to do (and as browser
might do if you typed "none.com" in address bar -- but NOT if it was
invoked via HTML elements)

One should also read comments about "<base>" handling in that same
file.

Now, I see two ways to change SA behaviour here:

- simple but lacking: do not call push_uri() if assumed URI does not look like absolute
  URI (i.e. if it does not contain at least '//')
  
  This would avoid false positives, but will not add relative URIs.
  e.g. it might add:
  http://www.example.com/dir
  but it would NOT also add:
  http://www.example.com/newdir/photo1.jpg 
  if for example "<a href=/newdir/photo1.jpg>" was in there.

- complex but emulating browser behaviour better:
  Add full handling of relative URIs. i.e. have push_uri() detect all
  relative URIs and convert them to absolute URIs before adding them
  to the list of URIs.
  Might not be that hard in base case as $self->{base_href} seems to
  be saved, but what happens if there are for example multiple HTML
  attachments in e-mail? Would/Should it propagate? What if there is
  no "<base>" specified, those relative URIs are invalid then?

-- 
Opinions above are GNU-copylefted.

Re: Fine-tuning SA URI extraction

Posted by Kris Deugau <kd...@vianet.ca>.
Bill Cole wrote:
> On 2023-04-26 at 11:06:56 UTC-0400 (Wed, 26 Apr 2023 11:06:56 -0400)
> Kris Deugau <kd...@vianet.ca>
> is rumored to have said:
> 
>> Am I missing some configuration option that can do this, or am I left 
>> with doing one of:
>>  - just suppressing lookups of the canonicalized URI
>>  - removing the canonicalized URI from the DNSBL, even if the listing 
>> might be justified where the *NON*-canonical version absolutely isn't
>>  - applying the welcomelist_* sledgehammer
> 
> It's extremely hard to say, given that you've not provided an actual 
> example of what you're talking about.

When I come up against these odd issues, I try not to include too much 
case-specific information, because everyone jumps in with highly 
case-specific solutions, most of which don't generalize to solve the 
actual problem.


> Yes,  I do mean an actual message. Evidence that your analysis of what 
> is happening is not entirely wrong.

I took a closer look and it was easier than expected to redact/replace 
customer details with filler or my address.

http://deepnet.cx/~kdeugau/spamtools/cornell-birds.eml


> You may be able to nail down what is actually happening by scanning a 
> problematic message with "-D all" and determining *exactly* what SA is 
> parsing as a URI that it should not.

As far as I've ever seen, the URI extraction doesn't actually spit out a 
larger surrounding chunk of text with -D to actually show where it got 
whatever it got.  So I have no way to tell what message element SA found 
the literal text "none" in, in a place that usually contains a "real" 
URI.  An extract run on this message, around some key lines for the 
problem non-URI:

Apr 26 14:57:31.646 [16796] dbg: uri: canonicalizing parsed uri: 
https://www.macaulaylibrary.org/
Apr 26 14:57:31.646 [16796] dbg: uri: cleaned uri: 
https://www.macaulaylibrary.org/
Apr 26 14:57:31.646 [16796] dbg: uri: added host: 
www.macaulaylibrary.org domain: macaulaylibrary.org
Apr 26 14:57:31.646 [16796] dbg: uri: canonicalizing html uri: none
Apr 26 14:57:31.646 [16796] dbg: uri: cleaned uri: http://none
Apr 26 14:57:31.646 [16796] dbg: uri: cleaned uri: none
Apr 26 14:57:31.646 [16796] dbg: uri: cleaned uri: http://www.none.com
Apr 26 14:57:31.646 [16796] dbg: uri: added host: www.none.com domain: 
none.com
Apr 26 14:57:31.646 [16796] dbg: uri: canonicalizing html uri: 
https://secure.birds.cornell.edu/sso-static/img/lab-logo-short.png
Apr 26 14:57:31.646 [16796] dbg: uri: cleaned uri: 
https://secure.birds.cornell.edu/sso-static/img/lab-logo-short.png
Apr 26 14:57:31.646 [16796] dbg: uri: added host: 
secure.birds.cornell.edu domain: cornell.edu
Apr 26 14:57:32.133 [16796] dbg: uri: canonicalizing domainkeys uri: 
domainkeys:birds.cornell.edu

which is...  not helpful in locating whatever SA grabbed "none" from.

-kgd

Re: Fine-tuning SA URI extraction

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 2023-04-26 at 11:06:56 UTC-0400 (Wed, 26 Apr 2023 11:06:56 -0400)
Kris Deugau <kd...@vianet.ca>
is rumored to have said:

> Am I missing some configuration option that can do this, or am I left 
> with doing one of:
>  - just suppressing lookups of the canonicalized URI
>  - removing the canonicalized URI from the DNSBL, even if the listing 
> might be justified where the *NON*-canonical version absolutely isn't
>  - applying the welcomelist_* sledgehammer

It's extremely hard to say, given that you've not provided an actual 
example of what you're talking about.

Yes,  I do mean an actual message. Evidence that your analysis of what 
is happening is not entirely wrong.

You may be able to nail down what is actually happening by scanning a 
problematic message with "-D all" and determining *exactly* what SA is 
parsing as a URI that it should not.


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire