You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dominic Xu (JIRA)" <ji...@apache.org> on 2011/03/04 02:20:36 UTC

[jira] Created: (NUTCH-968) Crawling - File Error 404 when fetching file with an chinese word in the file name

Crawling - File Error 404 when fetching file with an chinese word in the file name 
-----------------------------------------------------------------------------------

                 Key: NUTCH-968
                 URL: https://issues.apache.org/jira/browse/NUTCH-968
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.2
         Environment: CentOS 5.4 with zh_CN.UTF8
            Reporter: Dominic Xu


I am performing a local file system crawling.
My problem is the following: all files that contain some chinese words in the file name do not get crawled.
example:
fetching  /mnt/中文.txt

I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.

and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
and I patch with trunk : Committed revision 1056394.

but the bug no fix.

I fix the problem by modifying  the file : src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java 

262    for (int i=0; i<list.length; i++) {
263      f = list[i];
264      String name = f.getName();
265 +try {
266 +      // specify the encoding via the config later?
267 +      name = java.net.URLEncoder.encode(name, "UTF-8");
268 +    } catch (UnsupportedEncodingException ex) {
269 +    }
270 +
271 String time = HttpDateFormat.toString(f.lastModified());

There is must encode by utf8.

and I modify the content with meta tag.
251- StringBuffer x = new StringBuffer("<html><head>");
251+ StringBuffer x = new StringBuffer("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />");



 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-968) Crawling - File Error 404 when fetching file with an chinese word in the file name

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052100#comment-13052100 ] 

Markus Jelsma commented on NUTCH-968:
-------------------------------------

Hi, can you submit the modification as a patch?

> Crawling - File Error 404 when fetching file with an chinese word in the file name 
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-968
>                 URL: https://issues.apache.org/jira/browse/NUTCH-968
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: CentOS 5.4 with zh_CN.UTF8
>            Reporter: Dominic Xu
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> I am performing a local file system crawling.
> My problem is the following: all files that contain some chinese words in the file name do not get crawled.
> example:
> fetching  /mnt/中文.txt
> I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.
> and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
> and I patch with trunk : Committed revision 1056394.
> but the bug no fix.
> I fix the problem by modifying  the file : src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java 
> 262    for (int i=0; i<list.length; i++) {
> 263      f = list[i];
> 264      String name = f.getName();
> 265 +try {
> 266 +      // specify the encoding via the config later?
> 267 +      name = java.net.URLEncoder.encode(name, "UTF-8");
> 268 +    } catch (UnsupportedEncodingException ex) {
> 269 +    }
> 270 +
> 271 String time = HttpDateFormat.toString(f.lastModified());
> There is must encode by utf8.
> and I modify the content with meta tag.
> 251- StringBuffer x = new StringBuffer("<html><head>");
> 251+ StringBuffer x = new StringBuffer("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />");
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira