You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dominic Xu (JIRA)" <ji...@apache.org> on 2011/03/04 02:20:36 UTC
[jira] Created: (NUTCH-968) Crawling - File Error 404 when fetching
file with an chinese word in the file name
Crawling - File Error 404 when fetching file with an chinese word in the file name
-----------------------------------------------------------------------------------
Key: NUTCH-968
URL: https://issues.apache.org/jira/browse/NUTCH-968
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.2
Environment: CentOS 5.4 with zh_CN.UTF8
Reporter: Dominic Xu
I am performing a local file system crawling.
My problem is the following: all files that contain some chinese words in the file name do not get crawled.
example:
fetching /mnt/中文.txt
I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.
and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
and I patch with trunk : Committed revision 1056394.
but the bug no fix.
I fix the problem by modifying the file : src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java
262 for (int i=0; i<list.length; i++) {
263 f = list[i];
264 String name = f.getName();
265 +try {
266 + // specify the encoding via the config later?
267 + name = java.net.URLEncoder.encode(name, "UTF-8");
268 + } catch (UnsupportedEncodingException ex) {
269 + }
270 +
271 String time = HttpDateFormat.toString(f.lastModified());
There is must encode by utf8.
and I modify the content with meta tag.
251- StringBuffer x = new StringBuffer("<html><head>");
251+ StringBuffer x = new StringBuffer("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />");
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-968) Crawling - File Error 404 when
fetching file with an chinese word in the file name
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052100#comment-13052100 ]
Markus Jelsma commented on NUTCH-968:
-------------------------------------
Hi, can you submit the modification as a patch?
> Crawling - File Error 404 when fetching file with an chinese word in the file name
> -----------------------------------------------------------------------------------
>
> Key: NUTCH-968
> URL: https://issues.apache.org/jira/browse/NUTCH-968
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.2
> Environment: CentOS 5.4 with zh_CN.UTF8
> Reporter: Dominic Xu
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> I am performing a local file system crawling.
> My problem is the following: all files that contain some chinese words in the file name do not get crawled.
> example:
> fetching /mnt/中文.txt
> I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.
> and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
> and I patch with trunk : Committed revision 1056394.
> but the bug no fix.
> I fix the problem by modifying the file : src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java
> 262 for (int i=0; i<list.length; i++) {
> 263 f = list[i];
> 264 String name = f.getName();
> 265 +try {
> 266 + // specify the encoding via the config later?
> 267 + name = java.net.URLEncoder.encode(name, "UTF-8");
> 268 + } catch (UnsupportedEncodingException ex) {
> 269 + }
> 270 +
> 271 String time = HttpDateFormat.toString(f.lastModified());
> There is must encode by utf8.
> and I modify the content with meta tag.
> 251- StringBuffer x = new StringBuffer("<html><head>");
> 251+ StringBuffer x = new StringBuffer("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />");
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira