You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/07/21 23:13:04 UTC
[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly
encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635824#comment-14635824 ]
Sebastian Nagel commented on NUTCH-2064:
----------------------------------------
Definitely a nice-to-have feature, to get rid of duplicates or to avoid errors if a protocol plugin does not support non-ASCII characters. -1 for the patch so far:
* given the unfortunate discussions in NUTCH-1098 it's probably better to write a patch from scratch (I would volunteer!)
* the patched urlnormalizer-basic does unescape more than it should according to [RFC3986|https://tools.ietf.org/html/rfc3986#section-2.1]. Ampersand and colon (and other characters) should stay escaped:
{noformat}
% cat test_urls.txt
http://x.com/s?q=a%26b&m=10
http://x.com/show?http%3A%2F%2Fx.com%2Fb
% cat test_urls.txt | nutch plugin urlnormalizer-basic org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
http://x.com/s?q=a&b&m=10
http://x.com/show?http:%2F%2Fx.com%2Fb
{noformat}
* would be good to have unit tests with realistic URLs to test for such cases
> URLNormalizer basic to properly encode non-ASCII characters
> -----------------------------------------------------------
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)