You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michela Becchi (JIRA)" <ji...@apache.org> on 2010/05/27 18:30:44 UTC
[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

     [ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michela Becchi resolved NUTCH-824.
----------------------------------

    Fix Version/s: 1.0.0
       Resolution: Fixed

Hi,

I fixed (or, at least, circumvented) this by modifying the org/apache.nutch.protocol.file.FileResponse class belonging to the protocol-file plugin.

In particular, at line 120, I added

120     String path = "".equals(url.getPath()) ? "/" : url.getPath();
121     +String decoded_path = path;  //@Michela 
122 
123     +try {
124     +    decoded_path=java.net.URLDecoder.decode(path,"UTF-8");
125     +}catch(Exception ex){
126     +}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela

> Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-824
>                 URL: https://issues.apache.org/jira/browse/NUTCH-824
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux
>            Reporter: Michela Becchi
>             Fix For: 1.0.0
>
>
> Hello,
> I am performing a local file system crawling.
> My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled.
> For example, I will see the following error:
> fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
> org.apache.nutch.protocol.file.FileError: File Error: 404
>         at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
> fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> I am using nutch-1.0.
> Among other standard settings, I configured nutch-site.conf as follows:
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with the
>   underlying commons-httpclient library.
>   </description>
> </property>
> <property>
>   <name>file.content.limit</name>
>   <value>-1</value>
> </property>
> Moreover, crawl-urlfilter.txt   looks like:
> # skip http:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> # accept everything else
> +.*
> ~    
> ---
> Thanks,
> Michela

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.