You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Nadeem Douba (JIRA)" <ji...@apache.org> on 2015/09/12 08:57:46 UTC

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741933#comment-14741933 ] 

Nadeem Douba commented on NUTCH-1084:
-------------------------------------

I think I found the issue and I don't think it's related to Nutch. AbstractMapWritable uses the Class.forName method which throws the CNFE. This is because Class.forName uses the system class loader which is different than the current thread's class loader in that it does not include the job jar as part of its class path. I recompiled hadoop-common to see if it would fix the issue by replacing the Class.forName call with Thread.currentThread().getContextClassLoader().loadClass(class). This seems to fix the issue.

> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)