You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/06/22 13:49:00 UTC
[jira] [Commented] (NUTCH-2547) urlnormalizer-basic fails on special characters in path/query

    [ https://issues.apache.org/jira/browse/NUTCH-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520369#comment-16520369 ] 

ASF GitHub Bot commented on NUTCH-2547:
---------------------------------------

sebastian-nagel opened a new pull request #353: NUTCH-2547 NUTCH-2609 urlnormalizer-basic
URL: https://github.com/apache/nutch/pull/353
 
 
   Fixes for
   - NUTCH-2547 urlnormalizer-basic fails on special characters in path/query
   - NUTCH-2609 urlnormalizer-basic to normalize path of file: URLs
   In detail:
   - escape more special characters
   - escape percent when not followed by a valid escape sequence
     (two-digit hex number)
   - escape special characters before normalizing the path
     so that URI.normalize() can be used on valid URIs
   - also normalize path '/..'
   - normalize path on file: URLs
   - complete unit tests

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> urlnormalizer-basic fails on special characters in path/query
> -------------------------------------------------------------
>
>                 Key: NUTCH-2547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2547
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> If a URL contains one of the characters {{|"<>^`}} or a single {{%}} (not followed by a 2-characther hex-value), BasicURLNormalizer fails to normalize the URL path (here: remove {{/c/..}}):
> {noformat}
> % for c in "" $(echo '|%"^<>`' | grep -o .); do
>     echo "http://www.example.com/a/c/../b/search?q=foobar$c"
>   done \
>   | nutch normalizerchecker -normalizer urlnormalizer-basic -stdin
> Checking combination of these URLNormalizers: BasicURLNormalizer 
> http://www.example.com/a/b/search?q=foobar
> http://www.example.com/a/c/../b/search?q=foobar|
> http://www.example.com/a/c/../b/search?q=foobar%
> http://www.example.com/a/c/../b/search?q=foobar"
> http://www.example.com/a/c/../b/search?q=foobar^
> http://www.example.com/a/c/../b/search?q=foobar<
> http://www.example.com/a/c/../b/search?q=foobar>
> http://www.example.com/a/c/../b/search?q=foobar`
> {noformat}
> The reason is that these characters (should check for more, including control characters) are not valid as part of a [URI|https://docs.oracle.com/javase/9/docs/api/java/net/URI.html] (cf. [RFC3986|https://tools.ietf.org/html/rfc3986]). BasicURLNormalizer normalizes the path by converting the URL to a URI and calling [normalize()|https://docs.oracle.com/javase/9/docs/api/java/net/URI.html#normalize--].
> There are two possible solutions:
> # do not use java.net.URI
> # ensure that every URL returned (or used internally) by urlnormalizer-basic is a valid URI (resp. its String representation).
> I would opt for #2 because the class URI is used practically everywhere in Nutch and libraries (e.g. HttpClient). Any thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)