You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/07/20 20:35:02 UTC
[Bug 6159] New: Text parsed URIs duplicated in URI detail list
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159
Summary: Text parsed URIs duplicated in URI detail list
Product: Spamassassin
Version: 3.2.5
Platform: Other
OS/Version: All
Status: NEW
Severity: minor
Priority: P5
Component: Libraries
AssignedTo: dev@spamassassin.apache.org
ReportedBy: guenther@rudersport.de
Text parsed URIs without a protocol prepended are duplicated, showing 2 entries
in M::SA::PerMsgStatus get_uri_detail_list(). The entries are as follows.
$uri_raw .......... http://www.example.net
$info->{cleaned} .. http://www.example.net
$uri_raw .......... www.example.net
$info->{cleaned} .. http://www.example.net www.example.net
This is a little bit redundant, isn't it? Given the cleaned list contains a
version with the protocol. The second one should be sufficient, as it is with
HTML extracted URIs.
Test case, to quickly see this:
$ echo -e "\n example.net" | spamassassin -D uridnsbl
[14033] dbg: uridnsbl: domain example.net in skip list
[14033] dbg: uridnsbl: domain example.net in skip list
I also can come up with a tiny plugin that dumps the relevant info, if anyone
is interested.
Text parsed URIs with protocol are fine, as are HTML extracted URIs from links,
with or without protocol.
--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
[Bug 6159] Text parsed URIs duplicated in URI detail list
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159
--- Comment #1 from Sidney Markowitz <si...@sidney.com> 2009-07-20 11:46:34 PST ---
Is this the same in Trunk? (Not set up to try it at the moment)
--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
[Bug 6159] Text parsed URIs duplicated in URI detail list
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6159
--- Comment #2 from Karsten Bräckelmann <gu...@rudersport.de> 2009-07-22 09:10:59 PST ---
The reason is _get_parsed_uri_list() and get_uri_detail_list partially doing
the same work.
_get_parsed_uri_list() actually parses the rendered text parts for URIs.
Extracted URIs are stored in parsed_uri_list. The raw URI as extracted, as well
as a version with the protocol prepended (if the raw has not) are pushed onto
that list.
get_uri_detail_list() then runs each parsed URI in parsed_uri_list through
Util::uri_list_canonify(), which generates (possibly various) cleaned versions
in addition to the source URIs. One of these cleaned versions is with a
protocol prepended, if there is none. Returns cleaned, merged with the source
URIs list.
> $uri_raw .......... http://www.example.net
> $info->{cleaned} .. http://www.example.net
>
> $uri_raw .......... www.example.net
> $info->{cleaned} .. http://www.example.net www.example.net
The existence of both of these uri_detail data structures are due to
_get_parsed_uri_list().
get_uri_detail_list() cares about expanding the cleaned list for the second one
above (the raw URI parsed out of the text parts) with a protocol prepended
version.
--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.