You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2013/08/07 22:55:50 UTC

[jira] [Updated] (NUTCH-911) recrawls file protocol causes Errors/Exceptions when actually not modified or gone

     [ https://issues.apache.org/jira/browse/NUTCH-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-911:
----------------------------------

    Fix Version/s:     (was: 1.9)
                   1.8
    
> recrawls file protocol causes Errors/Exceptions when actually not modified or gone
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-911
>                 URL: https://issues.apache.org/jira/browse/NUTCH-911
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, protocol
>    Affects Versions: 1.1
>            Reporter: Peter Lundberg
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-911-trunk.patch
>
>
> When recrawling file systems file are marked as error and logging occurs such as:
> java.net.MalformedURLException
> 	at java.net.URL.<init>(URL.java:601)
> 	at java.net.URL.<init>(URL.java:464)
> 	at java.net.URL.<init>(URL.java:413)
> 	at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85)
> 	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627)
> fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter Lundberg 20090929.pdf failed with: java.net.MalformedURLException
> This is due to FileResponse and File not working well together. The same is true for files that after a while disappear from the file system being crawled (ie error instead of GONE). I am too new with nutch to know the design rational behind this or any sideaffect. Below is a patch that I have used that cleans up the segment data and removevs false errors in the log file.
> --- src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java	(revision 997976)
> +++ src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java	(working copy)
> @@ -79,6 +79,10 @@
>          if (code == 200) {                          // got a good response
>            return new ProtocolOutput(response.toContent());              // return it
>    
> +        } else if (code == 404) {                   // handle no such file
> +          return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_GONE );  
> +        } else if (code == 304) {                   // handle not modified
> +          return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_NOTMODIFIED );  
>          } else if (code >= 300 && code < 400) {     // handle redirect
>            if (redirects == MAX_REDIRECTS)
>              throw new FileException("Too many redirects: " + url);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira