You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2014/06/18 17:28:24 UTC

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

     [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1084:
---------------------------------

    Attachment: NUTCH-1084.patch

Could do the export only for the actions that need it (i.e the ones that read entries directly from the MapFiles) but it's probably alright to set it for any task in distributed mode 

> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)