You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2013/08/07 22:55:50 UTC
[jira] [Updated] (NUTCH-911) recrawls file protocol causes
Errors/Exceptions when actually not modified or gone
[ https://issues.apache.org/jira/browse/NUTCH-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-911:
----------------------------------
Fix Version/s: (was: 1.9)
1.8
> recrawls file protocol causes Errors/Exceptions when actually not modified or gone
> ----------------------------------------------------------------------------------
>
> Key: NUTCH-911
> URL: https://issues.apache.org/jira/browse/NUTCH-911
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, protocol
> Affects Versions: 1.1
> Reporter: Peter Lundberg
> Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-911-trunk.patch
>
>
> When recrawling file systems file are marked as error and logging occurs such as:
> java.net.MalformedURLException
> at java.net.URL.<init>(URL.java:601)
> at java.net.URL.<init>(URL.java:464)
> at java.net.URL.<init>(URL.java:413)
> at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627)
> fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter Lundberg 20090929.pdf failed with: java.net.MalformedURLException
> This is due to FileResponse and File not working well together. The same is true for files that after a while disappear from the file system being crawled (ie error instead of GONE). I am too new with nutch to know the design rational behind this or any sideaffect. Below is a patch that I have used that cleans up the segment data and removevs false errors in the log file.
> --- src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java (revision 997976)
> +++ src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java (working copy)
> @@ -79,6 +79,10 @@
> if (code == 200) { // got a good response
> return new ProtocolOutput(response.toContent()); // return it
>
> + } else if (code == 404) { // handle no such file
> + return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_GONE );
> + } else if (code == 304) { // handle not modified
> + return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_NOTMODIFIED );
> } else if (code >= 300 && code < 400) { // handle redirect
> if (redirects == MAX_REDIRECTS)
> throw new FileException("Too many redirects: " + url);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira