You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/10/09 19:14:08 UTC

[Bug 6219] New: URI detectiong issues since the great URI detection reform of Jan 2008

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6219

           Summary: URI detectiong issues since the great URI detection
                    reform of Jan 2008
           Product: Spamassassin
           Version: 3.3.0
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: julian@mehnle.net


Keywords: URI URL detection parse parsing

The great URL detection reform by Sidney Markowitz in January 2008 (r616097)
vastly improved things, but it also broke a few URL forms that were previously
detected and it detects other forms that aren't legal URLs.

For example, scheme-less URLs enclosed in parentheses are no longer detected:

  (example.com)
  (example.com/foo)

Enclosing brackets, angle brackets, or curly braces don't pose a problem.

Then, URLs with oddly-cased "http", "https", "ftp" schemes such as the
following are not detected:

  Http://www.example.com
  HTTP://www.example.com
  ftP://ftp.example.com

(FWIW, this had been fixed once before the great reform per bug 4111 but
apparently no test case had been added.)

Then, URLs in the "known-scheme" category (cf. the regexps used in
PerMsgStatus.pm), i.e., ones starting with, e.g., "http:" or "www.", are
detected even if their domain name(!) (not URL path or query string) contains
extended characters such as "(" or ")":

  www.example(.com) --> www.example(.com
  http://example(.com) --> http://example(.com

Obviously those aren't linkified by most MUAs, and even if they were they
wouldn't lead the user anywhere.

Finally, bare e-mail addresses starting with "www."(!) are misdetected as
"http:" URLs and *not* "mailto:" ones:

  www.x@example.com --> http://www.x@example.com --> http://example.com

I guess the $uriknownscheme regexp should be split into
$uri(really)knownscheme" and $uriassumedscheme, and the latter should be
deprioritized below/after $urimailscheme in the definition of $tbirdurire.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.