You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by aaronjosephs <aa...@placeiq.com> on 2014/07/18 22:55:44 UTC

Re: NullPointerException When Reading Avro Sequence Files

I think you probably want to use `AvroSequenceFileOutputFormat` with
`newAPIHadoopFile`. I'm not even sure that in hadoop you would use
SequenceFileInput format to read an avro sequence file



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.

To me this looks like an internal error to the REPL. I am not sure what is
causing that.
Personally I never use the REPL, can you try typing up your program and
running it from an IDE or spark-submit and see if you still get the same
error?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, Dec 15, 2014 at 4:54 PM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:
>
>  Sure, thanks:
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>         at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>         at org.apache.hadoop.mapreduce.Job.toString(Job.java:462)
>         at
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>         at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>         at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>         at .<init>(<console>:10)
>         at .<clinit>(<console>)
>         at $print(<console>)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
>         at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
>         at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
>         at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
>         at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
>         at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
>         at
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
>         at
> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
>         at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>         at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
>         at org.apache.spark.repl.Main$.main(Main.scala:31)
>         at org.apache.spark.repl.Main.main(Main.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
>
> Could something you omitted in your snippet be chaining this exception?
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 15 December 2014 16:52
>
> *To:* Cristovao Jose Domingues Cordeiro
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   Ok, I have no idea what that is. That appears to be an internal Spark
> exception. Maybe if you can post the entire stack trace it would give some
> more details to understand what is going on.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Mon, Dec 15, 2014 at 4:50 PM, Cristovao Jose Domingues Cordeiro <
> cristovao.cordeiro@cern.ch> wrote:
>>
>>  Hi,
>>
>> thanks for that.
>> But yeah the 2nd line is an exception. jobread is not created.
>>
>>  Cumprimentos / Best regards,
>> Cristóvão José Domingues Cordeiro
>> IT Department - 28/R-018
>> CERN
>>    ------------------------------
>> *From:* Simone Franzini [captainfranz@gmail.com]
>> *Sent:* 15 December 2014 16:39
>>
>> *To:* Cristovao Jose Domingues Cordeiro
>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>
>>    I did not mention the imports needed in my code. I think these are
>> all of them:
>>
>>  import org.apache.hadoop.mapreduce.Job
>> import org.apache.hadoop.io.NullWritable
>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>> import org.apache.hadoop.fs.{ FileSystem, Path }
>> import org.apache.avro.{ Schema, SchemaBuilder }
>>  import org.apache.avro.SchemaBuilder._
>> import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
>> AvroKeyOutputFormat }
>> import org.apache.avro.mapred.AvroKey
>>
>>  However, what you mentioned is a warning that I think can be ignored. I
>> don't see any exception.
>>
>>  Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>> On Mon, Dec 15, 2014 at 3:10 PM, Cristovao Jose Domingues Cordeiro <
>> cristovao.cordeiro@cern.ch> wrote:
>>>
>>>  Hi Simone,
>>>
>>> I was finally able to get the chill package, but still, something
>>> unrelated which I can not run from your snippet is:
>>> val jobread = new Job()
>>>
>>> I get:
>>> warning: there were 1 deprecation warning(s); re-run with -deprecation
>>> for details
>>> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>>>
>>>
>>>  Cumprimentos / Best regards,
>>> Cristóvão José Domingues Cordeiro
>>> IT Department - 28/R-018
>>> CERN
>>>    ------------------------------
>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>> *Sent:* 09 December 2014 17:06
>>>
>>> *To:* Cristovao Jose Domingues Cordeiro; user
>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>
>>>    You can use this Maven dependency:
>>>
>>>  <dependency>
>>>     <groupId>com.twitter</groupId>
>>>     <artifactId>chill-avro</artifactId>
>>>     <version>0.4.0</version>
>>> </dependency>
>>>
>>>  Simone Franzini, PhD
>>>
>>> http://www.linkedin.com/in/simonefranzini
>>>
>>> On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro <
>>> cristovao.cordeiro@cern.ch> wrote:
>>>
>>>>  Thanks for the reply!
>>>>
>>>> I've tried in fact your code. But I lack the twiter chill package and I
>>>> can not find it online. So I am now trying this
>>>> http://spark.apache.org/docs/latest/tuning.html#data-serialization .
>>>> But in case I can't do it, could you tell me where to get that Twiter
>>>> package you used?
>>>>
>>>> Thanks
>>>>
>>>>  Cumprimentos / Best regards,
>>>> Cristóvão José Domingues Cordeiro
>>>> IT Department - 28/R-018
>>>> CERN
>>>>    ------------------------------
>>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>>> *Sent:* 09 December 2014 16:42
>>>> *To:* Cristovao Jose Domingues Cordeiro; user
>>>>
>>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>>
>>>>    Hi Cristovao,
>>>>
>>>> I have seen a very similar issue that I have posted about in this
>>>> thread:
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
>>>>  I think your main issue here is somewhat similar, in that the
>>>> MapWrapper Scala class is not registered. This gets registered by the
>>>> Twitter chill-scala AllScalaRegistrar class that you are currently not
>>>> using.
>>>>
>>>>  As far as I understand, in order to use Avro with Spark, you also
>>>> have to use Kryo. This means you have to use the Spark KryoSerializer. This
>>>> in turn uses Twitter chill. I posted the basic code that I am using here:
>>>>
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491
>>>>
>>>>  Maybe there is a simpler solution to your problem but I am not that
>>>> much of an expert yet. I hope this helps.
>>>>
>>>>  Simone Franzini, PhD
>>>>
>>>> http://www.linkedin.com/in/simonefranzini
>>>>
>>>> On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
>>>> cristovao.cordeiro@cern.ch> wrote:
>>>>
>>>>>  Hi Simone,
>>>>>
>>>>> thanks but I don't think that's it.
>>>>> I've tried several libraries within the --jar argument. Some do give
>>>>> what you said. But other times (when I put the right version I guess) I get
>>>>> the following:
>>>>> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>>> (TID 0)
>>>>> java.io.NotSerializableException:
>>>>> scala.collection.convert.Wrappers$MapWrapper
>>>>>         at
>>>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>>>>>         at
>>>>> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>>>>>
>>>>>
>>>>> Which is odd since I am reading a Avro I wrote...with the same piece
>>>>> of code:
>>>>> https://gist.github.com/MLnick/5864741781b9340cb211
>>>>>
>>>>>  Cumprimentos / Best regards,
>>>>> Cristóvão José Domingues Cordeiro
>>>>> IT Department - 28/R-018
>>>>> CERN
>>>>>    ------------------------------
>>>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>>>> *Sent:* 06 December 2014 15:48
>>>>> *To:* Cristovao Jose Domingues Cordeiro
>>>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>>>
>>>>>    java.lang.IncompatibleClassChangeError: Found interface
>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>
>>>>>  That is a sign that you are mixing up versions of Hadoop. This is
>>>>> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
>>>>> you will need to get the hadoop 2 version of avro-mapred. In Maven you
>>>>> would do this with the <classifier> hadoop2 </classifier> tag.
>>>>>
>>>>>  Simone Franzini, PhD
>>>>>
>>>>> http://www.linkedin.com/in/simonefranzini
>>>>>
>>>>> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've tried the above example on Gist, but it doesn't work (at least
>>>>>> for me).
>>>>>> Did anyone get this:
>>>>>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>>>> (TID 0)
>>>>>> java.lang.IncompatibleClassChangeError: Found interface
>>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>>         at
>>>>>>
>>>>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>>>>         at
>>>>>>
>>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>>>>         at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>>>>>> exception
>>>>>> in thread Thread[Executor task launch worker-0,5,main]
>>>>>> java.lang.IncompatibleClassChangeError: Found interface
>>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>>         at
>>>>>>
>>>>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>>>>         at
>>>>>>
>>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>>>>         at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>>>>>> times;
>>>>>> aborting job
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.

You can use this Maven dependency:

<dependency>
    <groupId>com.twitter</groupId>
    <artifactId>chill-avro</artifactId>
    <version>0.4.0</version>
</dependency>

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:

>  Thanks for the reply!
>
> I've tried in fact your code. But I lack the twiter chill package and I
> can not find it online. So I am now trying this
> http://spark.apache.org/docs/latest/tuning.html#data-serialization . But
> in case I can't do it, could you tell me where to get that Twiter package
> you used?
>
> Thanks
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 09 December 2014 16:42
> *To:* Cristovao Jose Domingues Cordeiro; user
>
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   Hi Cristovao,
>
> I have seen a very similar issue that I have posted about in this thread:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
>  I think your main issue here is somewhat similar, in that the MapWrapper
> Scala class is not registered. This gets registered by the Twitter
> chill-scala AllScalaRegistrar class that you are currently not using.
>
>  As far as I understand, in order to use Avro with Spark, you also have
> to use Kryo. This means you have to use the Spark KryoSerializer. This in
> turn uses Twitter chill. I posted the basic code that I am using here:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491
>
>  Maybe there is a simpler solution to your problem but I am not that much
> of an expert yet. I hope this helps.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
> cristovao.cordeiro@cern.ch> wrote:
>
>>  Hi Simone,
>>
>> thanks but I don't think that's it.
>> I've tried several libraries within the --jar argument. Some do give what
>> you said. But other times (when I put the right version I guess) I get the
>> following:
>> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>> java.io.NotSerializableException:
>> scala.collection.convert.Wrappers$MapWrapper
>>         at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>>         at
>> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>>
>>
>> Which is odd since I am reading a Avro I wrote...with the same piece of
>> code:
>> https://gist.github.com/MLnick/5864741781b9340cb211
>>
>>  Cumprimentos / Best regards,
>> Cristóvão José Domingues Cordeiro
>> IT Department - 28/R-018
>> CERN
>>    ------------------------------
>> *From:* Simone Franzini [captainfranz@gmail.com]
>> *Sent:* 06 December 2014 15:48
>> *To:* Cristovao Jose Domingues Cordeiro
>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>
>>    java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>
>>  That is a sign that you are mixing up versions of Hadoop. This is
>> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
>> you will need to get the hadoop 2 version of avro-mapred. In Maven you
>> would do this with the <classifier> hadoop2 </classifier> tag.
>>
>>  Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch> wrote:
>>
>>> Hi all,
>>>
>>> I've tried the above example on Gist, but it doesn't work (at least for
>>> me).
>>> Did anyone get this:
>>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
>>> (TID 0)
>>> java.lang.IncompatibleClassChangeError: Found interface
>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>         at
>>>
>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>>> exception
>>> in thread Thread[Executor task launch worker-0,5,main]
>>> java.lang.IncompatibleClassChangeError: Found interface
>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>         at
>>>
>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>>> times;
>>> aborting job
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.

Hi Cristovao,

I have seen a very similar issue that I have posted about in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
I think your main issue here is somewhat similar, in that the MapWrapper
Scala class is not registered. This gets registered by the Twitter
chill-scala AllScalaRegistrar class that you are currently not using.

As far as I understand, in order to use Avro with Spark, you also have to
use Kryo. This means you have to use the Spark KryoSerializer. This in turn
uses Twitter chill. I posted the basic code that I am using here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

Maybe there is a simpler solution to your problem but I am not that much of
an expert yet. I hope this helps.

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:

>  Hi Simone,
>
> thanks but I don't think that's it.
> I've tried several libraries within the --jar argument. Some do give what
> you said. But other times (when I put the right version I guess) I get the
> following:
> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
> java.io.NotSerializableException:
> scala.collection.convert.Wrappers$MapWrapper
>         at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>         at
> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>
>
> Which is odd since I am reading a Avro I wrote...with the same piece of
> code:
> https://gist.github.com/MLnick/5864741781b9340cb211
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 06 December 2014 15:48
> *To:* Cristovao Jose Domingues Cordeiro
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>
>  That is a sign that you are mixing up versions of Hadoop. This is
> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
> you will need to get the hadoop 2 version of avro-mapred. In Maven you
> would do this with the <classifier> hadoop2 </classifier> tag.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch> wrote:
>
>> Hi all,
>>
>> I've tried the above example on Gist, but it doesn't work (at least for
>> me).
>> Did anyone get this:
>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>> java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>         at
>>
>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>> exception
>> in thread Thread[Executor task launch worker-0,5,main]
>> java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>         at
>>
>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>> times;
>> aborting job
>>
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>