You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/06/18 17:20:28 UTC

[jira] [Comment Edited] (NUTCH-1084) ReadDB url throws exception

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035797#comment-14035797 ] 

Julien Nioche edited comment on NUTCH-1084 at 6/18/14 3:18 PM:
---------------------------------------------------------------

See [Hadoop In Action|http://books.google.co.uk/books?id=Wu_xeGdU4G8C&lpg=PA163&ots=i8tPPFWcWq&dq=hadoop%20definitive%20guide%20packaging%20a%20job&pg=PA162#v=onepage&q&f=false] for an explanation of HADOOP_CLASSPATH

It happens in distributed mode only because the readdb -url commands is not a mapreduce job but runs on the master node. For some reason it finds the class for readdb but the deserialization fails to get the class for the underlying serialized objects. Note that this issue occurs for URLs that have been fetched, probably because the ProtocolStatus is not used otherwise.

Setting HADOOP_CLASSPATH does it indeed. Will send a patch for its inclusion in the Nutch script.


was (Author: jnioche):
See [Hadoop In Action|http://books.google.co.uk/books?id=Wu_xeGdU4G8C&lpg=PA163&ots=i8tPPFWcWq&dq=hadoop%20definitive%20guide%20packaging%20a%20job&pg=PA162#v=onepage&q&f=false] for an explanation of HADOOP_CLASSPATH


> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)