You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2012/11/01 10:55:12 UTC

[jira] [Created] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

Sebastian Nagel created NUTCH-1485:
--------------------------------------

             Summary: TableUtil reverseURL to keep userinfo part
                 Key: NUTCH-1485
                 URL: https://issues.apache.org/jira/browse/NUTCH-1485
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 2.1
            Reporter: Sebastian Nagel
            Priority: Minor


The reversed URL key does not contain the userinfo part of an URL (user name and password: {{ftp://user:password@ftp.xyz/file.txt}}, cf. [RFC 3986|http://tools.ietf.org/html/rfc3986] and [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it easy to crawl a fixed list of protected content. However, URLs with userinfo can be tricky, eg [http://cnn.com&story=breaking_news@199.239.136.200/mostpopular], so it's ok when the default is to remove the userinfo. But this should be done in default URL normalizers.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira