You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by aaronjosephs <aa...@placeiq.com> on 2014/07/18 22:55:44 UTC

Re: NullPointerException When Reading Avro Sequence Files

I think you probably want to use `AvroSequenceFileOutputFormat` with
`newAPIHadoopFile`. I'm not even sure that in hadoop you would use
SequenceFileInput format to read an avro sequence file



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.
To me this looks like an internal error to the REPL. I am not sure what is
causing that.
Personally I never use the REPL, can you try typing up your program and
running it from an IDE or spark-submit and see if you still get the same
error?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, Dec 15, 2014 at 4:54 PM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:
>
>  Sure, thanks:
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>         at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>         at org.apache.hadoop.mapreduce.Job.toString(Job.java:462)
>         at
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>         at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>         at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>         at .<init>(<console>:10)
>         at .<clinit>(<console>)
>         at $print(<console>)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
>         at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
>         at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
>         at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
>         at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
>         at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
>         at
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
>         at
> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
>         at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>         at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
>         at org.apache.spark.repl.Main$.main(Main.scala:31)
>         at org.apache.spark.repl.Main.main(Main.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
>
> Could something you omitted in your snippet be chaining this exception?
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 15 December 2014 16:52
>
> *To:* Cristovao Jose Domingues Cordeiro
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   Ok, I have no idea what that is. That appears to be an internal Spark
> exception. Maybe if you can post the entire stack trace it would give some
> more details to understand what is going on.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Mon, Dec 15, 2014 at 4:50 PM, Cristovao Jose Domingues Cordeiro <
> cristovao.cordeiro@cern.ch> wrote:
>>
>>  Hi,
>>
>> thanks for that.
>> But yeah the 2nd line is an exception. jobread is not created.
>>
>>  Cumprimentos / Best regards,
>> Cristóvão José Domingues Cordeiro
>> IT Department - 28/R-018
>> CERN
>>    ------------------------------
>> *From:* Simone Franzini [captainfranz@gmail.com]
>> *Sent:* 15 December 2014 16:39
>>
>> *To:* Cristovao Jose Domingues Cordeiro
>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>
>>    I did not mention the imports needed in my code. I think these are
>> all of them:
>>
>>  import org.apache.hadoop.mapreduce.Job
>> import org.apache.hadoop.io.NullWritable
>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>> import org.apache.hadoop.fs.{ FileSystem, Path }
>> import org.apache.avro.{ Schema, SchemaBuilder }
>>  import org.apache.avro.SchemaBuilder._
>> import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
>> AvroKeyOutputFormat }
>> import org.apache.avro.mapred.AvroKey
>>
>>  However, what you mentioned is a warning that I think can be ignored. I
>> don't see any exception.
>>
>>  Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>> On Mon, Dec 15, 2014 at 3:10 PM, Cristovao Jose Domingues Cordeiro <
>> cristovao.cordeiro@cern.ch> wrote:
>>>
>>>  Hi Simone,
>>>
>>> I was finally able to get the chill package, but still, something
>>> unrelated which I can not run from your snippet is:
>>> val jobread = new Job()
>>>
>>> I get:
>>> warning: there were 1 deprecation warning(s); re-run with -deprecation
>>> for details
>>> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>>>
>>>
>>>  Cumprimentos / Best regards,
>>> Cristóvão José Domingues Cordeiro
>>> IT Department - 28/R-018
>>> CERN
>>>    ------------------------------
>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>> *Sent:* 09 December 2014 17:06
>>>
>>> *To:* Cristovao Jose Domingues Cordeiro; user
>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>
>>>    You can use this Maven dependency:
>>>
>>>  <dependency>
>>>     <groupId>com.twitter</groupId>
>>>     <artifactId>chill-avro</artifactId>
>>>     <version>0.4.0</version>
>>> </dependency>
>>>
>>>  Simone Franzini, PhD
>>>
>>> http://www.linkedin.com/in/simonefranzini
>>>
>>> On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro <
>>> cristovao.cordeiro@cern.ch> wrote:
>>>
>>>>  Thanks for the reply!
>>>>
>>>> I've tried in fact your code. But I lack the twiter chill package and I
>>>> can not find it online. So I am now trying this
>>>> http://spark.apache.org/docs/latest/tuning.html#data-serialization .
>>>> But in case I can't do it, could you tell me where to get that Twiter
>>>> package you used?
>>>>
>>>> Thanks
>>>>
>>>>  Cumprimentos / Best regards,
>>>> Cristóvão José Domingues Cordeiro
>>>> IT Department - 28/R-018
>>>> CERN
>>>>    ------------------------------
>>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>>> *Sent:* 09 December 2014 16:42
>>>> *To:* Cristovao Jose Domingues Cordeiro; user
>>>>
>>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>>
>>>>    Hi Cristovao,
>>>>
>>>> I have seen a very similar issue that I have posted about in this
>>>> thread:
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
>>>>  I think your main issue here is somewhat similar, in that the
>>>> MapWrapper Scala class is not registered. This gets registered by the
>>>> Twitter chill-scala AllScalaRegistrar class that you are currently not
>>>> using.
>>>>
>>>>  As far as I understand, in order to use Avro with Spark, you also
>>>> have to use Kryo. This means you have to use the Spark KryoSerializer. This
>>>> in turn uses Twitter chill. I posted the basic code that I am using here:
>>>>
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491
>>>>
>>>>  Maybe there is a simpler solution to your problem but I am not that
>>>> much of an expert yet. I hope this helps.
>>>>
>>>>  Simone Franzini, PhD
>>>>
>>>> http://www.linkedin.com/in/simonefranzini
>>>>
>>>> On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
>>>> cristovao.cordeiro@cern.ch> wrote:
>>>>
>>>>>  Hi Simone,
>>>>>
>>>>> thanks but I don't think that's it.
>>>>> I've tried several libraries within the --jar argument. Some do give
>>>>> what you said. But other times (when I put the right version I guess) I get
>>>>> the following:
>>>>> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>>> (TID 0)
>>>>> java.io.NotSerializableException:
>>>>> scala.collection.convert.Wrappers$MapWrapper
>>>>>         at
>>>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>>>>>         at
>>>>> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>>>>>
>>>>>
>>>>> Which is odd since I am reading a Avro I wrote...with the same piece
>>>>> of code:
>>>>> https://gist.github.com/MLnick/5864741781b9340cb211
>>>>>
>>>>>  Cumprimentos / Best regards,
>>>>> Cristóvão José Domingues Cordeiro
>>>>> IT Department - 28/R-018
>>>>> CERN
>>>>>    ------------------------------
>>>>> *From:* Simone Franzini [captainfranz@gmail.com]
>>>>> *Sent:* 06 December 2014 15:48
>>>>> *To:* Cristovao Jose Domingues Cordeiro
>>>>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>>>>
>>>>>    java.lang.IncompatibleClassChangeError: Found interface
>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>
>>>>>  That is a sign that you are mixing up versions of Hadoop. This is
>>>>> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
>>>>> you will need to get the hadoop 2 version of avro-mapred. In Maven you
>>>>> would do this with the <classifier> hadoop2 </classifier> tag.
>>>>>
>>>>>  Simone Franzini, PhD
>>>>>
>>>>> http://www.linkedin.com/in/simonefranzini
>>>>>
>>>>> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've tried the above example on Gist, but it doesn't work (at least
>>>>>> for me).
>>>>>> Did anyone get this:
>>>>>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>>>> (TID 0)
>>>>>> java.lang.IncompatibleClassChangeError: Found interface
>>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>>         at
>>>>>>
>>>>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>>>>         at
>>>>>>
>>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>>>>         at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>>>>>> exception
>>>>>> in thread Thread[Executor task launch worker-0,5,main]
>>>>>> java.lang.IncompatibleClassChangeError: Found interface
>>>>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>>>>         at
>>>>>>
>>>>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>>>>         at
>>>>>>
>>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>>>>         at
>>>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>>>>         at
>>>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>>>>         at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>>>>         at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>>
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>>>>>> times;
>>>>>> aborting job
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.
You can use this Maven dependency:

<dependency>
    <groupId>com.twitter</groupId>
    <artifactId>chill-avro</artifactId>
    <version>0.4.0</version>
</dependency>

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:

>  Thanks for the reply!
>
> I've tried in fact your code. But I lack the twiter chill package and I
> can not find it online. So I am now trying this
> http://spark.apache.org/docs/latest/tuning.html#data-serialization . But
> in case I can't do it, could you tell me where to get that Twiter package
> you used?
>
> Thanks
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 09 December 2014 16:42
> *To:* Cristovao Jose Domingues Cordeiro; user
>
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   Hi Cristovao,
>
> I have seen a very similar issue that I have posted about in this thread:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
>  I think your main issue here is somewhat similar, in that the MapWrapper
> Scala class is not registered. This gets registered by the Twitter
> chill-scala AllScalaRegistrar class that you are currently not using.
>
>  As far as I understand, in order to use Avro with Spark, you also have
> to use Kryo. This means you have to use the Spark KryoSerializer. This in
> turn uses Twitter chill. I posted the basic code that I am using here:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491
>
>  Maybe there is a simpler solution to your problem but I am not that much
> of an expert yet. I hope this helps.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
> cristovao.cordeiro@cern.ch> wrote:
>
>>  Hi Simone,
>>
>> thanks but I don't think that's it.
>> I've tried several libraries within the --jar argument. Some do give what
>> you said. But other times (when I put the right version I guess) I get the
>> following:
>> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>> java.io.NotSerializableException:
>> scala.collection.convert.Wrappers$MapWrapper
>>         at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>>         at
>> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>>
>>
>> Which is odd since I am reading a Avro I wrote...with the same piece of
>> code:
>> https://gist.github.com/MLnick/5864741781b9340cb211
>>
>>  Cumprimentos / Best regards,
>> Cristóvão José Domingues Cordeiro
>> IT Department - 28/R-018
>> CERN
>>    ------------------------------
>> *From:* Simone Franzini [captainfranz@gmail.com]
>> *Sent:* 06 December 2014 15:48
>> *To:* Cristovao Jose Domingues Cordeiro
>> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>>
>>    java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>
>>  That is a sign that you are mixing up versions of Hadoop. This is
>> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
>> you will need to get the hadoop 2 version of avro-mapred. In Maven you
>> would do this with the <classifier> hadoop2 </classifier> tag.
>>
>>  Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch> wrote:
>>
>>> Hi all,
>>>
>>> I've tried the above example on Gist, but it doesn't work (at least for
>>> me).
>>> Did anyone get this:
>>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
>>> (TID 0)
>>> java.lang.IncompatibleClassChangeError: Found interface
>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>         at
>>>
>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>>> exception
>>> in thread Thread[Executor task launch worker-0,5,main]
>>> java.lang.IncompatibleClassChangeError: Found interface
>>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>>         at
>>>
>>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>>         at
>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>>> times;
>>> aborting job
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: NullPointerException When Reading Avro Sequence Files

Posted by Simone Franzini <ca...@gmail.com>.
Hi Cristovao,

I have seen a very similar issue that I have posted about in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
I think your main issue here is somewhat similar, in that the MapWrapper
Scala class is not registered. This gets registered by the Twitter
chill-scala AllScalaRegistrar class that you are currently not using.

As far as I understand, in order to use Avro with Spark, you also have to
use Kryo. This means you have to use the Spark KryoSerializer. This in turn
uses Twitter chill. I posted the basic code that I am using here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

Maybe there is a simpler solution to your problem but I am not that much of
an expert yet. I hope this helps.

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro <
cristovao.cordeiro@cern.ch> wrote:

>  Hi Simone,
>
> thanks but I don't think that's it.
> I've tried several libraries within the --jar argument. Some do give what
> you said. But other times (when I put the right version I guess) I get the
> following:
> 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
> java.io.NotSerializableException:
> scala.collection.convert.Wrappers$MapWrapper
>         at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>         at
> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
>
>
> Which is odd since I am reading a Avro I wrote...with the same piece of
> code:
> https://gist.github.com/MLnick/5864741781b9340cb211
>
>  Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
> IT Department - 28/R-018
> CERN
>    ------------------------------
> *From:* Simone Franzini [captainfranz@gmail.com]
> *Sent:* 06 December 2014 15:48
> *To:* Cristovao Jose Domingues Cordeiro
> *Subject:* Re: NullPointerException When Reading Avro Sequence Files
>
>   java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>
>  That is a sign that you are mixing up versions of Hadoop. This is
> particularly an issue when dealing with AVRO. If you are using Hadoop 2,
> you will need to get the hadoop 2 version of avro-mapred. In Maven you
> would do this with the <classifier> hadoop2 </classifier> tag.
>
>  Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Fri, Dec 5, 2014 at 3:52 AM, cjdc <cr...@cern.ch> wrote:
>
>> Hi all,
>>
>> I've tried the above example on Gist, but it doesn't work (at least for
>> me).
>> Did anyone get this:
>> 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>> java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>         at
>>
>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>> 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
>> exception
>> in thread Thread[Executor task launch worker-0,5,main]
>> java.lang.IncompatibleClassChangeError: Found interface
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>         at
>>
>> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
>>         at
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>> 14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
>> times;
>> aborting job
>>
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: NullPointerException When Reading Avro Sequence Files

Posted by cjdc <cr...@cern.ch>.
Hi all,

I've tried the above example on Gist, but it doesn't work (at least for me).
Did anyone get this:
14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
        at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
        at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
        at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
        at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times;
aborting job


Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: NullPointerException When Reading Avro Sequence Files

Posted by Sparky <Gu...@bah.com>.
For those curious I used the JavaSparkContext and got access to an
AvroSequenceFile (wrapper around Sequence File) using the following:

file = sc.newAPIHadoopFile("<hdfs path to my file>",
        AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
new Configuration())



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Sparky <Gu...@bah.com>.
Thanks for the gist.  I'm just now learning about Avro.  I think when you use
a DataFileWriter you are writing to an Avro Container (which is different
than an Avro Sequence File).  I have a system where data was written to an
HDFS Sequence File using  AvroSequenceFile.Writer (which is a wrapper around
sequence file).  

I'll put together an example of the problem so others can better understand
what I'm talking about.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Nick Pentreath <ni...@gmail.com>.
I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211

But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?



On Sat, Jul 19, 2014 at 11:00 AM, Sparky <Gu...@bah.com> wrote:

> To be more specific, I'm working with a system that stores data in
> org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is
> "A wrapper around a Hadoop SequenceFile that also supports reading and
> writing Avro data."
>
> It seems that Spark does not support this out of the box.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: NullPointerException When Reading Avro Sequence Files

Posted by Sparky <Gu...@bah.com>.
To be more specific, I'm working with a system that stores data in
org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is 
"A wrapper around a Hadoop SequenceFile that also supports reading and
writing Avro data."

It seems that Spark does not support this out of the box.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Sparky <Gu...@bah.com>.
I see Spark is using AvroRecordReaderBase, which is used to grab Avro
Container Files, which is different from Sequence Files.  If anyone is using
Avro Sequence Files with success and has an example, please let me know.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

Posted by Sparky <Gu...@bah.com>.
Thanks for responding.  I tried using the newAPIHadoopFile method and got an
IO Exception with the message "Not a data file".  

If anyone has an example of this working I'd appreciate your input or
examples.  

What I entered at the repl and what I got back are below:

val myAvroSequenceFile = sc.newAPIHadoopFile("hdfs://<my url", 
classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]],
classOf[NullWritable])

scala> myAvroSequenceFile.first()
14/07/18 17:02:38 INFO FileInputFormat: Total input paths to process : 1
14/07/18 17:02:38 INFO SparkContext: Starting job: first at <console>:19
14/07/18 17:02:38 INFO DAGScheduler: Got job 0 (first at <console>:19) with
1 output partitions (allowLocal=true)
14/07/18 17:02:38 INFO DAGScheduler: Final stage: Stage 0(first at
<console>:19)
14/07/18 17:02:38 INFO DAGScheduler: Parents of final stage: List()
14/07/18 17:02:38 INFO DAGScheduler: Missing parents: List()
14/07/18 17:02:38 INFO DAGScheduler: Computing the requested partition
locally
14/07/18 17:02:38 INFO NewHadoopRDD: Input split: hdfs:<my url>
14/07/18 17:02:38 WARN AvroKeyInputFormat: Reader schema was not set. Use
AvroJob.setInputKeySchema() if desired.
14/07/18 17:02:38 INFO AvroKeyInputFormat: Using a reader schema equal to
the writer schema.
14/07/18 17:02:38 INFO DAGScheduler: Failed to run first at <console>:19
org.apache.spark.SparkDriverExecutionException: Execution error
	at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:585)
	at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:563)
Caused by: java.io.IOException: Not a data file.
	at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
	at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
	at
org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:180)
	at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:90)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:114)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
	at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:578)
	... 1 more



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.