You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 16:27:06 UTC

[jira] [Closed] (NUTCH-18) Windows servers include illegal characters in URLs

     [ https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-18.
------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> Windows servers include illegal characters in URLs
> --------------------------------------------------
>
>                 Key: NUTCH-18
>                 URL: https://issues.apache.org/jira/browse/NUTCH-18
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Stefan Groschupf
>            Priority: Minor
>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira