You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/07/21 23:13:04 UTC

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

    [ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635824#comment-14635824 ] 

Sebastian Nagel commented on NUTCH-2064:
----------------------------------------

Definitely a nice-to-have feature, to get rid of duplicates or to avoid errors if a protocol plugin does not support non-ASCII characters.  -1 for the patch so far:
* given the unfortunate discussions in NUTCH-1098 it's probably better to write a patch from scratch (I would volunteer!)
* the patched urlnormalizer-basic does unescape more than it should according to [RFC3986|https://tools.ietf.org/html/rfc3986#section-2.1]. Ampersand and colon (and other characters) should stay escaped:
{noformat}
% cat test_urls.txt 
http://x.com/s?q=a%26b&m=10
http://x.com/show?http%3A%2F%2Fx.com%2Fb
% cat test_urls.txt | nutch plugin urlnormalizer-basic org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
http://x.com/s?q=a&b&m=10
http://x.com/show?http:%2F%2Fx.com%2Fb
{noformat}
* would be good to have unit tests with realistic URLs to test for such cases

> URLNormalizer basic to properly encode non-ASCII characters
> -----------------------------------------------------------
>
>                 Key: NUTCH-2064
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2064
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)