You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2013/12/23 11:42:50 UTC

[jira] [Closed] (NUTCH-1685) URLUtil.toUNICODE fails on IDNs

     [ https://issues.apache.org/jira/browse/NUTCH-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel closed NUTCH-1685.
----------------------------------

    Resolution: Duplicate

You are right, [~markus17]. 

> URLUtil.toUNICODE fails on IDNs
> -------------------------------
>
>                 Key: NUTCH-1685
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1685
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.7, 2.2.1
>         Environment: Java 7, OpenJDK 64-Bit, 1.7.0_25
>            Reporter: Sebastian Nagel
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1685-2x-test.patch
>
>
> URLUtil.toUNICODE() fails on IDNs and returns null instead of the Unicode URL. The constructor of URI obviously does not accept IDN host names. For {{http://www.xn--evir-zoa.com/}} the constructor IDN() throws the exception:
> {code}
> java.net.URISyntaxException: Illegal character in hostname at index 11: http://www.çevir.com/
> {code}
> Principally, IDN.toUnicode() can convert URLs (not only domain or host names). However, it does not convert URLs with host part consisting of only two parts: {{http://xn--uni-tbingen-xhb.de/}}. Is that the reason why we need URLUtil.toUNICODE() ?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)