You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/07/20 20:35:02 UTC

[Bug 6159] New: Text parsed URIs duplicated in URI detail list

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159

           Summary: Text parsed URIs duplicated in URI detail list
           Product: Spamassassin
           Version: 3.2.5
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: guenther@rudersport.de


Text parsed URIs without a protocol prepended are duplicated, showing 2 entries
in M::SA::PerMsgStatus get_uri_detail_list(). The entries are as follows.

$uri_raw .......... http://www.example.net
$info->{cleaned} .. http://www.example.net

$uri_raw .......... www.example.net
$info->{cleaned} .. http://www.example.net www.example.net

This is a little bit redundant, isn't it? Given the cleaned list contains a
version with the protocol. The second one should be sufficient, as it is with
HTML extracted URIs.

Test case, to quickly see this:

  $ echo -e "\n example.net" | spamassassin -D uridnsbl
  [14033] dbg: uridnsbl: domain example.net in skip list
  [14033] dbg: uridnsbl: domain example.net in skip list

I also can come up with a tiny plugin that dumps the relevant info, if anyone
is interested.

Text parsed URIs with protocol are fine, as are HTML extracted URIs from links,
with or without protocol.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6159] Text parsed URIs duplicated in URI detail list

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159





--- Comment #1 from Sidney Markowitz <si...@sidney.com>  2009-07-20 11:46:34 PST ---
Is this the same in Trunk? (Not set up to try it at the moment)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6159] Text parsed URIs duplicated in URI detail list

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159





--- Comment #2 from Karsten Bräckelmann <gu...@rudersport.de>  2009-07-22 09:10:59 PST ---
The reason is _get_parsed_uri_list() and get_uri_detail_list partially doing
the same work.

_get_parsed_uri_list() actually parses the rendered text parts for URIs.
Extracted URIs are stored in parsed_uri_list. The raw URI as extracted, as well
as a version with the protocol prepended (if the raw has not) are pushed onto
that list.

get_uri_detail_list() then runs each parsed URI in parsed_uri_list through
Util::uri_list_canonify(), which generates (possibly various) cleaned versions
in addition to the source URIs. One of these cleaned versions is with a
protocol prepended, if there is none. Returns cleaned, merged with the source
URIs list.


> $uri_raw .......... http://www.example.net
> $info->{cleaned} .. http://www.example.net
> 
> $uri_raw .......... www.example.net
> $info->{cleaned} .. http://www.example.net www.example.net

The existence of both of these uri_detail data structures are due to
_get_parsed_uri_list().

get_uri_detail_list() cares about expanding the cleaned list for the second one
above (the raw URI parsed out of the text parts) with a protocol prepended
version.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.