You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/07 01:09:35 UTC

Exception thrown when loading class org.apache.nutch.protocol.Protocol while reading the "crawldb" SequenceFile

Hi,

I'm reading through the Nutch crawldb using a modified
SequenceFileInputFormat class, whose inner workings are essentially
identical to the original SequenceFileInputFormat.

However, I'm seeing an error when a CrawlDatum object is being
de-serialized. This happens after a few records have been successfully
de-serialized, so I know that the input format is "working". I have also
seen this same exception being talked of in Nutch-1084, but the resolution
seems to be slated for v1.6, which I cannot wait for.

Here is the exception:
java.io.IOException: java.io.IOException: can't find class:
org.apache.nutch.protocol.ProtocolStatus because
org.apache.nutch.protocol.ProtocolStatus
        at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:341)
        at
org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:133)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1114)
        at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:232)
        at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.io.IOException: can't find class:
org.apache.nutch.protocol.ProtocolStatus because
org.apache.nutch.protocol.ProtocolStatus
        at
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1832)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1805)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at
org.qcri.alt.searchengine.nutch.NutchCrawlDBSequenceFileRecordReader.next(NutchCrawlDBSequenceFileRecordReader.java:58)

The file that I get this error on is attached (It is a MapFile). If you
upload to HDFS, and run this with a regular SequenceFileInputFormat, you
will see the same error after a few records have been read. It appears that
when there is some metadata (MapWritable) data to be de-serialized, the
AbstractMapWritable.readFields() seems to have an issue when loading the
Protocol class and encounters a ClassNotFoundException, in response to
which it throws an IOException....(perhaps it is some classloader issue
when running map-reduce in non-local mode?). I don't see why that should
happen, since all the nutch jars (from under nutch/runtime/local/lib) are
in the classpath...unless I'm missing something else that needs to be
included and isn't already included under the nutch/runtime/local/lib
folder? I've got all the Hadoop jars too, in the classpath.

Basically, I'm trying to find out why this is happening, and if there is
any Jar that I can include in the classpath that will resolve this
ClassNotFoundException. I read that running the map-reduce job in local
mode fixes the issue, but that doens't work for my project...I need this to
be running in distributed mode.

This is crucial to my being able to utilize nutch in my project. So any
help would be greatly appreciated...

Many thanks in advance!

Regards,
Safdar