You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/08/19 18:56:27 UTC

[jira] [Created] (NUTCH-1084) ReadDB url throws exception

ReadDB url throws exception
---------------------------

                 Key: NUTCH-1084
                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4
            Reporter: Markus Jelsma
             Fix For: 1.4


Readdb -url suffers from two problems:
1. it trips over the _SUCCESS file generated by newer Hadoop version
2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)

The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.

The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:


{code}
Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
        at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
        at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
        at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
        at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
        at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
        at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1084:
---------------------------------

    Affects Version/s:     (was: 1.4)
                       1.3
        Fix Version/s:     (was: 1.4)
                       1.5
    
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1084:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Andy Xue (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264660#comment-13264660 ] 

Andy Xue commented on NUTCH-1084:
---------------------------------

Have the same issue when running "nutch readseg -get" command. Very inefficient. And apparently it is not easy to dump the entire segment content since it is too large. The JVM heap space error occurred all the time. 
                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Tal Rotbart (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230919#comment-13230919 ] 

Tal Rotbart commented on NUTCH-1084:
------------------------------------

Having the same issue, any workarounds?
                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "gee (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221762#comment-13221762 ] 

gee commented on NUTCH-1084:
----------------------------

invoke nutch jdwp and used eclipse to debug nutch on distributed hadoop configuration.
and i was able to dump classpath of certain class instance by using following expression
java.util.Arrays.toString(((java.net.URLClassLoader)this.getClass().getClassLoader()).getURLs())

MapWritable(AbstractMapWritable).readFields(DataInput) line: 203	
MapWritable.readFields(DataInput) line: 171	
CrawlDatum.readFields(DataInput) line: 278	
NutchWritable(GenericWritableConfigurable).readFields(DataInput) line: 54	
WritableSerialization$WritableDeserializer.deserialize(Writable) line: 73	
WritableSerialization$WritableDeserializer.deserialize(Object) line: 44	
ReduceTask$ReduceValuesIterator<KEY,VALUE>(Task$ValuesIterator<KEY,VALUE>).readNextValue() line: 1392	
ReduceTask$ReduceValuesIterator<KEY,VALUE>(Task$ValuesIterator<KEY,VALUE>).next() line: 1332	
ReduceTask$ReduceValuesIterator<KEY,VALUE>.moveToNext() line: 212	
ReduceTask$ReduceValuesIterator<KEY,VALUE>.next() line: 208	
IdentityReducer<K,V>.reduce(K, Iterator<V>, OutputCollector<K,V>, Reporter) line: 45	
ReduceTask.runOldReducer(JobConf, TaskUmbilicalProtocol, TaskReporter, RawKeyValueIterator, RawComparator<INKEY>, Class<INKEY>, Class<INVALUE>) line: 448	
ReduceTask.run(JobConf, TaskUmbilicalProtocol) line: 399	
LocalJobRunner$Job.run() line: 442	

classpath of instance CrawlDatum.readFields(DataInput) line: 278 was
'java.lang.String [file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/, file:/D:/nutch/nutch-1.5-SNAPSHOT.job, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/classes/, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/activation-1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/aopalliance-1.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/asm-3.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/aspectjrt-1.6.5.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/avro-1.5.3.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/clover-3.0.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-beanutils-1.7.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-beanutils-core-1.8.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-cli-1.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-codec-1.3.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-codec-1.4.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-collections-3.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-configuration-1.6.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-daemon-1.0.3.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-digester-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-el-1.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-httpclient-3.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-io-1.4.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-lang-2.4.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-logging-1.0.4.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-logging-1.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-logging-api-1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-math-2.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/commons-net-1.4.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/core-3.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/geronimo-stax-api_1.0_spec-1.0.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/gmbal-api-only-3.0.0-b023.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-framework-2.1.1-tests.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-framework-2.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-http-2.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-http-server-2.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-http-servlet-2.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/grizzly-rcm-2.1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/guava-r09.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/guice-3.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/guice-servlet-3.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/hsqldb-1.8.0.7.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/icu4j-4.0.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jackson-core-asl-1.7.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jackson-jaxrs-1.7.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jackson-mapper-asl-1.8.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jackson-xc-1.7.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jasper-compiler-5.5.23.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jasper-runtime-5.5.23.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/javax.inject-1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/javax.servlet-3.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jaxb-api-2.2.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jaxb-impl-2.2.3-1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jdiff-1.0.9.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-client-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-core-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-grizzly2-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-guice-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-json-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-server-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-test-framework-core-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jersey-test-framework-grizzly2-1.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jets3t-0.6.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jettison-1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jetty-6.1.22.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jetty-client-6.1.22.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jetty-sslengine-6.1.22.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jetty-util-6.1.22.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jetty-util5-6.1.22.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jline-0.9.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jline-0.9.94.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/json-simple-1.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/jsp-api-2.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/junit-3.8.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/kfs-0.3.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/log4j-1.2.15.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/lucene-core-3.1.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/lucene-core-3.4.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/mail-1.4.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/mail-1.4.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/management-api-3.0.0-b012.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/netty-3.2.3.Final.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/oro-2.0.8.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/paranamer-2.3.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/protobuf-java-2.4.0a.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/servlet-api-2.5-20081211.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/servlet-api-2.5.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/slf4j-api-1.5.5.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/slf4j-api-1.6.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/slf4j-log4j12-1.5.5.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/slf4j-log4j12-1.6.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/snappy-java-1.0.3.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/solr-solrj-3.1.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/solr-solrj-3.4.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/stax-api-1.0.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/tika-core-0.10.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/tika-core-1.0.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/wstx-asl-3.2.7.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/xercesImpl-2.9.1.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/xml-apis-1.3.04.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/xmlenc-0.52.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/xmlParserAPIs-2.6.2.jar, file:/D:/nutch/build/test/hadoop-unjar8586001369738345875/lib/zookeeper-3.3.1.jar]' referenced from:	

but, classpath of instance MapWritable(AbstractMapWritable).readFields(DataInput) line: 203(which throwed ClassNotFoundException) was

'java.lang.String [file:/D:/nutch/D, file:/D:/cygwin/nutch/etc/hadoop, file:/D:/nutch/D, file:/D:/cygwin/nutch/etc/hadoop, file:/D:/nutch/D, file:/D:/cygwin/nutch/etc/hadoop, file:/D:/nutch/share/hadoop/common/lib/activation-1.1.jar, file:/D:/nutch/share/hadoop/common/lib/asm-3.2.jar, file:/D:/nutch/share/hadoop/common/lib/aspectjrt-1.6.5.jar, file:/D:/nutch/share/hadoop/common/lib/avro-1.5.3.jar, file:/D:/nutch/share/hadoop/common/lib/commons-beanutils-1.7.0.jar, file:/D:/nutch/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar, file:/D:/nutch/share/hadoop/common/lib/commons-cli-1.2.jar, file:/D:/nutch/share/hadoop/common/lib/commons-codec-1.4.jar, file:/D:/nutch/share/hadoop/common/lib/commons-collections-3.2.1.jar, file:/D:/nutch/share/hadoop/common/lib/commons-configuration-1.6.jar, file:/D:/nutch/share/hadoop/common/lib/commons-digester-1.8.jar, file:/D:/nutch/share/hadoop/common/lib/commons-el-1.0.jar, file:/D:/nutch/share/hadoop/common/lib/commons-httpclient-3.1.jar, file:/D:/nutch/share/hadoop/common/lib/commons-lang-2.5.jar, file:/D:/nutch/share/hadoop/common/lib/commons-logging-1.1.1.jar, file:/D:/nutch/share/hadoop/common/lib/commons-logging-api-1.1.jar, file:/D:/nutch/share/hadoop/common/lib/commons-math-2.1.jar, file:/D:/nutch/share/hadoop/common/lib/commons-net-1.4.1.jar, file:/D:/nutch/share/hadoop/common/lib/core-3.1.1.jar, file:/D:/nutch/share/hadoop/common/lib/guava-r09.jar, file:/D:/nutch/share/hadoop/common/lib/hadoop-auth-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/common/lib/hsqldb-1.8.0.7.jar, file:/D:/nutch/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar, file:/D:/nutch/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/common/lib/jackson-xc-1.8.8.jar, file:/D:/nutch/share/hadoop/common/lib/jasper-compiler-5.5.23.jar, file:/D:/nutch/share/hadoop/common/lib/jasper-runtime-5.5.23.jar, file:/D:/nutch/share/hadoop/common/lib/jaxb-api-2.2.2.jar, file:/D:/nutch/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar, file:/D:/nutch/share/hadoop/common/lib/jersey-core-1.8.jar, file:/D:/nutch/share/hadoop/common/lib/jersey-json-1.8.jar, file:/D:/nutch/share/hadoop/common/lib/jersey-server-1.8.jar, file:/D:/nutch/share/hadoop/common/lib/jets3t-0.6.1.jar, file:/D:/nutch/share/hadoop/common/lib/jettison-1.1.jar, file:/D:/nutch/share/hadoop/common/lib/jetty-6.1.26.jar, file:/D:/nutch/share/hadoop/common/lib/jetty-util-6.1.26.jar, file:/D:/nutch/share/hadoop/common/lib/json-simple-1.1.jar, file:/D:/nutch/share/hadoop/common/lib/jsp-api-2.1.jar, file:/D:/nutch/share/hadoop/common/lib/kfs-0.3.jar, file:/D:/nutch/share/hadoop/common/lib/log4j-1.2.15.jar, file:/D:/nutch/share/hadoop/common/lib/oro-2.0.8.jar, file:/D:/nutch/share/hadoop/common/lib/paranamer-2.3.jar, file:/D:/nutch/share/hadoop/common/lib/protobuf-java-2.4.0a.jar, file:/D:/nutch/share/hadoop/common/lib/servlet-api-2.5.jar, file:/D:/nutch/share/hadoop/common/lib/slf4j-api-1.6.1.jar, file:/D:/nutch/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar, file:/D:/nutch/share/hadoop/common/lib/snappy-java-1.0.3.2.jar, file:/D:/nutch/share/hadoop/common/lib/stax-api-1.0.1.jar, file:/D:/nutch/share/hadoop/common/lib/xmlenc-0.52.jar, file:/D:/nutch/share/hadoop/common/hadoop-common-0.24.0-SNAPSHOT-sources.jar, file:/D:/nutch/share/hadoop/common/hadoop-common-0.24.0-SNAPSHOT-test-sources.jar, file:/D:/nutch/share/hadoop/common/hadoop-common-0.24.0-SNAPSHOT-tests.jar, file:/D:/nutch/share/hadoop/common/hadoop-common-0.24.0-SNAPSHOT.jar, file:/D:/cygwin/contrib/capacity-scheduler/*.jar, file:/D:/cygwin/contrib/capacity-scheduler/*.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-sources.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-test-sources.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-tests.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/hdfs/lib/avro-1.5.3.jar, file:/D:/nutch/share/hadoop/hdfs/lib/commons-daemon-1.0.3.jar, file:/D:/nutch/share/hadoop/hdfs/lib/commons-logging-1.1.1.jar, file:/D:/nutch/share/hadoop/hdfs/lib/jackson-core-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/hdfs/lib/jackson-mapper-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/hdfs/lib/log4j-1.2.15.jar, file:/D:/nutch/share/hadoop/hdfs/lib/paranamer-2.3.jar, file:/D:/nutch/share/hadoop/hdfs/lib/protobuf-java-2.4.0a.jar, file:/D:/nutch/share/hadoop/hdfs/lib/slf4j-api-1.6.1.jar, file:/D:/nutch/share/hadoop/hdfs/lib/slf4j-log4j12-1.6.1.jar, file:/D:/nutch/share/hadoop/hdfs/lib/snappy-java-1.0.3.2.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-sources.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-test-sources.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT-tests.jar, file:/D:/nutch/share/hadoop/hdfs/hadoop-hdfs-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/aopalliance-1.0.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/asm-3.2.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/avro-1.5.3.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/clover-3.0.2.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/commons-daemon-1.0.3.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/commons-io-2.1.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/commons-logging-1.1.1.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/guice-3.0.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/hadoop-annotations-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jackson-core-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.8.8.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/javax.inject-1.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jdiff-1.0.9.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jersey-core-1.8.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jersey-guice-1.8.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/jersey-server-1.8.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/junit-4.8.2.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/log4j-1.2.15.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/netty-3.2.3.Final.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/paranamer-2.3.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/protobuf-java-2.4.0a.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/slf4j-api-1.6.1.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/slf4j-log4j12-1.6.1.jar, file:/D:/nutch/share/hadoop/mapreduce/lib/snappy-java-1.0.3.2.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-app-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-common-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-core-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.24.0-SNAPSHOT-tests.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-api-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-common-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-server-common-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-server-nodemanager-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-server-resourcemanager-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-server-web-proxy-0.24.0-SNAPSHOT.jar, file:/D:/nutch/share/hadoop/mapreduce/hadoop-yarn-site-0.24.0-SNAPSHOT.jar, file:/D:/nutch/D, file:/D:/cygwin/nutch/share/hadoop/mapreduce/*, file:/D:/nutch/D, file:/D:/cygwin/nutch/share/hadoop/mapreduce/lib/*]' 

we can conclude that any deserializer requires accompanying classpath(classloader) to correctly deserialize the serialized class.
so, we need to instantize appropriate classloader or use the serialized classloader given by the submitter of job.

                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103481#comment-13103481 ] 

Markus Jelsma commented on NUTCH-1084:
--------------------------------------

Anyone seen anything similar before? I can't pinpoint the issue with this more than cryptic exception message.

> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Steven Bedrick (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104103#comment-13104103 ] 

Steven Bedrick commented on NUTCH-1084:
---------------------------------------

Oddly enough, I'm seeing exactly the same thing. I'm still getting the hang of Nutch, and so am writing a little tool to dump the contents of a crawldb (similar to the built-in readdb command) as a way to become familiar with some of the Nutch classes. In my case, all I'm doing is iterating over the crawldb MapFile, and just printing keys and values. It works just fine until it gets to the first CrawlDatum that has metadata, at which point it chokes and dies with an identical stack trace. Interestingly, just running "bin/nutch readdb ... -dump ..." works flawlessly.

> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125715#comment-13125715 ] 

Markus Jelsma commented on NUTCH-1084:
--------------------------------------

I've checked the write and read methods and it all sums up. It's also not happening when running locally which makes me think it has to be with MapWritable holding the meta data.
                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1084) ReadDB url throws exception

Posted by "Markus Jelsma (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1084:
------------------------------------

    Assignee: Markus Jelsma
    
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by Markus Jelsma <ma...@openindex.io>.

>     [
> https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.p
> lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162170#comm
> ent-13162170 ]
> 
> Marek Bachmann commented on NUTCH-1084:
> ---------------------------------------
> 
> The same Exception is thrown when using the -get option in readseg.
yes

> Is there some workaround yet? It is not efficient to copy the whole seg dir
> to a local drive... :-/
no, unfortunately. I tried to crack the thing but ended up no nowhere. I still 
think it's something to do with the writable for metadata.

> 
> > ReadDB url throws exception
> > ---------------------------
> > 
> >                 Key: NUTCH-1084
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
> >             
> >             Project: Nutch
> >          
> >          Issue Type: Bug
> >    
> >    Affects Versions: 1.3
> >    
> >            Reporter: Markus Jelsma
> >            Assignee: Markus Jelsma
> >            
> >             Fix For: 1.5
> > 
> > Readdb -url suffers from two problems:
> > 1. it trips over the _SUCCESS file generated by newer Hadoop version
> > 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus
> > (???) The first problem can be remedied by not allowing the injector or
> > updater to write the _SUCCESS file. Until now that's the solution
> > implemented for similar issues. I've not been successful as to make the
> > Hadoop readers simply skip the file. The second issue seems a bit
> > strange and did not happen on a local check out. I'm not yet sure
> > whether this is a Hadoop issue or something being corrupt in the
> > CrawlDB. Here's the stack trace: {code}
> > Exception in thread "main" java.io.IOException: can't find class:
> > org.apache.nutch.protocol.ProtocolStatus because
> > org.apache.nutch.protocol.ProtocolStatus
> > 
> >         at
> >         org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapW
> >         ritable.java:204) at
> >         org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146
> >         ) at
> >         org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278
> >         ) at
> >         org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(Sequenc
> >         eFile.java:1751) at
> >         org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) at
> >         org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOut
> >         putFormat.java:105) at
> >         org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> >         at
> >         org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:
> >         389) at
> >         org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514
> >         ) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> >         sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessor
> >         Impl.java:39) at
> >         sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> >         AccessorImpl.java:25) at
> >         java.lang.reflect.Method.invoke(Method.java:597)
> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > 
> > {code}
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

Posted by "Marek Bachmann (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162170#comment-13162170 ] 

Marek Bachmann commented on NUTCH-1084:
---------------------------------------

The same Exception is thrown when using the -get option in readseg. 
Is there some workaround yet? It is not efficient to copy the whole seg dir to a local drive... :-/ 
                
> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira