You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by "Alfonso Nishikawa (JIRA)" <ji...@apache.org> on 2017/12/21 09:23:00 UTC
[jira] [Comment Edited] (GORA-476) Nutch 2.X GeneratorJob creates
NullPointerException when using DataFileAvroStore
[ https://issues.apache.org/jira/browse/GORA-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299746#comment-16299746 ]
Alfonso Nishikawa edited comment on GORA-476 at 12/21/17 9:22 AM:
------------------------------------------------------------------
Am I wrong or DataFileAvroStore is not suitable to be used with Nutch?
[DataFileAvroStore|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/avro/store/DataFileAvroStore.java#L58] does not write the keys when saving, and thus, retrieves them as 'null' and {{#unreverseUrl(key)}} fails with NullPointerException.
[Here|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/avro/query/AvroResult.java#L56] it can be seen that when retrieven the next record at a scan, it only fills the "persistent" information (not the key).
DataFileAvroStore does not support indexed (by key) access because there are no keys.
I don't know if Avro supports any type of indexed writes when writing, nor how to add gracefully the keys to the written information.
was (Author: alfonso.nishikawa):
Am I wrong or DataFileAvroStore is not suitable to be used with Nutch?
[DataFileAvroStore|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/avro/store/DataFileAvroStore.java#L58] does not write the keys when saving, and thus, retrieves them as 'null' and {{#unreverseUrl(key)}} fails with NullPointerException)
[Here|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/avro/query/AvroResult.java#L56] it can be seen that when retrieven the next record at a scan, it only fills the "persistent" information (not the key).
DataFileAvroStore does not support indexed (by key) access because there are no keys.
I don't know if Avro supports any type of indexed writes when writing, nor how to add gracefully the keys to the written information.
> Nutch 2.X GeneratorJob creates NullPointerException when using DataFileAvroStore
> --------------------------------------------------------------------------------
>
> Key: GORA-476
> URL: https://issues.apache.org/jira/browse/GORA-476
> Project: Apache Gora
> Issue Type: Bug
> Components: avro, gora-core
> Affects Versions: 0.6.1
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 0.9
>
>
> When running the Nuth 2.X GeneratorJob I get the following
> {code}
> 2016-05-12 17:27:30,191 INFO crawl.GeneratorJob - GeneratorJob: starting
> 2016-05-12 17:27:30,191 INFO crawl.GeneratorJob - GeneratorJob: filtering: false
> 2016-05-12 17:27:30,191 INFO crawl.GeneratorJob - GeneratorJob: normalizing: false
> 2016-05-12 17:27:30,191 INFO crawl.GeneratorJob - GeneratorJob: topN: 50000
> 2016-05-12 17:27:30,319 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2016-05-12 17:27:30,333 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2016-05-12 17:27:30,334 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
> 2016-05-12 17:27:30,334 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
> 2016-05-12 17:27:31,012 WARN conf.Configuration - file:/tmp/hadoop-lmcgibbn/mapred/staging/lmcgibbn997854508/.staging/job_local997854508_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2016-05-12 17:27:31,014 WARN conf.Configuration - file:/tmp/hadoop-lmcgibbn/mapred/staging/lmcgibbn997854508/.staging/job_local997854508_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
> 2016-05-12 17:27:31,091 WARN conf.Configuration - file:/tmp/hadoop-lmcgibbn/mapred/local/localRunner/lmcgibbn/job_local997854508_0001/job_local997854508_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2016-05-12 17:27:31,094 WARN conf.Configuration - file:/tmp/hadoop-lmcgibbn/mapred/local/localRunner/lmcgibbn/job_local997854508_0001/job_local997854508_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
> 2016-05-12 17:27:31,309 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2016-05-12 17:27:31,309 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
> 2016-05-12 17:27:31,309 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
> 2016-05-12 17:27:31,381 WARN mapred.LocalJobRunner - job_local997854508_0001
> java.lang.Exception: java.lang.NullPointerException
> at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.NullPointerException
> at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:88)
> at org.apache.nutch.crawl.GeneratorMapper.map(GeneratorMapper.java:51)
> at org.apache.nutch.crawl.GeneratorMapper.map(GeneratorMapper.java:1)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-05-12 17:27:32,107 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=[test]generate: 1463099249-21154, jobid=job_local997854508_0001
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:232)
> at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:272)
> at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:343)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:351)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)