You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org> on 2012/03/30 12:14:32 UTC

[jira] [Commented] (NUTCH-1321) IDNNormalizer

    [ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242219#comment-13242219 ] 

Markus Jelsma commented on NUTCH-1321:
--------------------------------------

...or, we could do a toUnicode for outlinks or directly in the fetcher. This also makes sense because as ASCII these URL's are longer, sometimes much longer. This can stir trouble for filters that, partly, rely on string length. If both conversions are implemented in the fetcher or protocol library then we don't have to worry about it, and have better logging in the fetcher!


                
> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer so it will encode ASCII URL's to their proper unicode equivalant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira