You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2007/12/01 14:02:15 UTC
[Bug 5701] Enhancing SpamAssassin Anti-Phishing Detection Capability
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5701
------- Additional Comments From jm@jmason.org 2007-12-01 05:02 -------
(In reply to comment #4)
> URI detection in plain text is nicely implemented in
> _get_parsed_uri_list and there are a couple of tests where
> the functionality is needed for testing anchor text but
> implemented in a more ad-hoc way.
>
> One of the PILFER tests is whether the plain text URIs in
> the anchor text match the target URI.
>
> It seems like it is possible to implement this functionality
> as a separate function under Utils without losing too much
> from performance. Am I missing something?
hi Umut --
Take a look at PerMsgStatus::get_uri_detail_list(), that should be very
helpful. here's the POD doc:
$status->get_uri_detail_list ()
Returns a hash reference of all unique URIs found in the message and
various data about where the URIs were found in the message. It takes a
combination of the URIs found in the rendered (decoded and HTML stripped)
body and the URIs found when parsing the HTML in the message. Will also
set $status->{uri_detail_list} (the hash reference as returned by this
function). This function will also set $status->{uri_domain_count} (count of
unique domains).
The hash format looks something like this:
raw_uri => {
types => { a => 1, img => 1, parsed => 1 },
cleaned => [ canonified_uri ],
anchor_text => [ "click here", "no click here" ],
domains => { domain1 => 1, domain2 => 1 },
}
C<raw_uri> is whatever the URI was in the message itself
(http://spamassassin.apache%2Eorg/).
C<types> is a hash of the HTML tags (lowercase) which referenced
the raw_uri. I<parsed> is a faked type which specifies that the
raw_uri was seen in the rendered text.
C<cleaned> is an array of the raw and canonified version of the raw_uri
(http://spamassassin.apache%2Eorg/, http://spamassassin.apache.org/).
C<anchor_text> is an array of the anchor text (text between <a> and
</a>), if any, which linked to the URI.
C<domains> is a hash of the domains found in the canonified URIs.
...so the anchor text for each link can be easily found that way. does that help?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.