You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Uri Laserson <la...@cloudera.com> on 2014/02/06 04:44:24 UTC

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time
(like a week or two ago).

If you're referring to the cloneRecords parameter, it appears to default to
true, but even when I add it explicitly, I get the same error.


On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft
<fn...@berkeley.edu>wrote:

> Uri,
>
> Which version of Spark are you running? If it is >0.9.0, you need to add
> an optional true argument at the end of the sc.newApiHadoopFile(…) call to
> read Parquet data.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com> wrote:
>
> I am cross-posting on the parquet mailing list.  Short recap: I am trying
> to read Parquet data from the spark interactive shell.
>
> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>
> export
> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>
> From the spark-shell, I run:
>
> val job = new Job(sc.hadoopConfiguration)
> ParquetInputFormat.setReadSupportClass(job,
> classOf[AvroReadSupport[GenericRecord]])
> val records1 =
> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
> classOf[GenericRecord], job.getConfiguration)
>
> Then I try
>
> records1.count
>
> Which gives the following error:
>
> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
> java.lang.NoSuchMethodError:
> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>  at
> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
> at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>  at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
> at
> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>  at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>  at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>  at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>  at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>  at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:744)
>
>
> My hypothesis is that this a shading problem.  It appears that the code is
> trying to call a constructor that looks like this:
>
> String.Field(String, Schema, String, *parquet*
> .org.codehaus.jackson.JsonNode)
>
> but the signature from the spark-assembly jar is
>
> public org.apache.avro.Schema$Field(java.lang.String,
> org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode);
>
> Where do I go from here?
>
> Uri
>
>
>
>
>
>
>
> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com>wrote:
>
>> Yep, I did not include that jar in the class path.  Now I've got some
>> "real" errors to try to work through.  Thanks!
>>
>>
>>  On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>wrote:
>>
>>> Hi Uri,
>>>
>>> Could you try adding the parquet-jackson JAR to your classpath? There
>>> may possibly be other parquet-avro dependencies that are missing too.
>>>
>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>>
>>> -Jey
>>>
>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <la...@cloudera.com>
>>> wrote:
>>> > Yes, of course.  That class is a jackson class, and I'm not sure why
>>> it's
>>> > being referred to as
>>> parquet.org.codehaus.jackson.JsonGenerationException.
>>> >
>>> > org.codehaus.jackson.JsonGenerationException is on the classpath.  But
>>> not
>>> > when it's prefixed by parquet.
>>> >
>>> >
>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <an...@andrewash.com>
>>> wrote:
>>> >>
>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to confirm
>>> that
>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class exists in
>>> one of
>>> >> them?
>>> >>
>>> >>
>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson <la...@cloudera.com>
>>> >> wrote:
>>> >>>
>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>> GenericRecords
>>> >>> from a Parquet file. I'm having a bit of trouble with respect to
>>> >>> dependencies.  My latest attempt looks like this:
>>> >>>
>>> >>> export
>>> >>>
>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>> >>>
>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>> >>>
>>> >>> Then in the shell:
>>> >>>
>>> >>> val records1 =
>>> >>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>> classOf[IndexedRecord],
>>> >>> sc.hadoopConfiguration)
>>> >>> records1.collect
>>> >>>
>>> >>> At which point it barfs:
>>> >>>
>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to process
>>> : 3
>>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>>> further
>>> >>> details.
>>> >>> java.io.IOException: Could not read footer:
>>> >>> java.lang.NoClassDefFoundError:
>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>> >>> at
>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>> >>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>> >>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>> >>> at $iwC.<init>(<console>:29)
>>> >>> at <init>(<console>:31)
>>> >>> at .<init>(<console>:35)
>>> >>> at .<clinit>(<console>)
>>> >>> at .<init>(<console>:7)
>>> >>> at .<clinit>(<console>)
>>> >>> at $print(<console>)
>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> >>> at
>>> >>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> >>> at
>>> >>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>> >>> at
>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>> >>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>> >>> at
>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>> >>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>> >>> at
>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>> >>> at
>>> >>>
>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>> >>> at
>>> >>>
>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>> >>> at
>>> >>>
>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>> >>> at
>>> >>>
>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>> >>> at
>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>> >>> at
>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> >>> at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>> at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>> at java.lang.Thread.run(Thread.java:744)
>>> >>> Caused by: java.lang.ClassNotFoundException:
>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> >>> ... 9 more
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Uri Laserson, PhD
>>> >>> Data Scientist, Cloudera
>>> >>> Twitter/GitHub: @laserson
>>> >>> +1 617 910 0447
>>> >>> laserson@cloudera.com
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Uri Laserson, PhD
>>> > Data Scientist, Cloudera
>>> > Twitter/GitHub: @laserson
>>> > +1 617 910 0447
>>> > laserson@cloudera.com
>>>
>>
>>
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>
>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com
>
>
>  --
> http://parquet.github.com/
> ---
> You received this message because you are subscribed to the Google Groups
> "Parquet" group.
> To post to this group, send email to parquet-dev@googlegroups.com.
>



-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Julien Le Dem <ju...@twitter.com>.

(sending again now that I'm subscribed to spark user mailing list)
Hi Uri,
Parquet shades Jackson to avoid dependency conflicts with Hadoop. Hadoop
depends on an ancient version of Jackson, also Parquet works with several
versions of Hadoop independently of what jackson version they pull.

It appears that this creates a problem in parquet-avro here:
https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191

notice that NullNode here is a org.codehaus.jackson.node.NullNode which
looks weird to me as we are building an Avro schema. Why would we use a
Jackson type in there?

I see 2 solutions:
 - parquet-avro should not shade Jackson (but really I don't see why we
depend on Jackson at all here)
 - AvroSchemaConverter should not depend on jackson.

Do you know why the Avro abstraction is leaking jackson here?


On Thu, Feb 6, 2014 at 9:13 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Hi Uri,
> Parquet shades Jackson to avoid dependency conflicts with Hadoop. Hadoop
> depends on an ancient version of Jackson, also Parquet works with several
> versions of Hadoop independently of what jackson version they pull.
>
> It appears that this creates a problem in parquet-avro here:
>
> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191
>
> notice that NullNode here is a org.codehaus.jackson.node.NullNode which
> looks weird to me as we are building an Avro schema. Why would we use a
> Jackson type in there?
>
> I see 2 solutions:
>  - parquet-avro should not shade Jackson (but really I don't see why we
> depend on Jackson at all here)
>  - AvroSchemaConverter should not depend on jackson.
>
> Do you know why the Avro abstraction is leaking jackson here?
>
>
>
>
> On Thu, Feb 6, 2014 at 12:40 AM, Uri Laserson <la...@cloudera.com>wrote:
>
>> I am skeptical that will solve my problem, though.  Either way, I just
>> pulled the latest master and built that, and the same problem remains.
>>
>>
>> On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <sc...@gmail.com>wrote:
>>
>>> That cloneRecords parameter is gone, so either use the released 0.9.0 or
>>> the current master.
>>>
>>>
>>> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Uri,
>>>>
>>>> Er, yes, it is the cloneRecords, and when I said true, I meant false...
>>>> Apologies for the misdirection there.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466
>>>>
>>>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com> wrote:
>>>>
>>>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time
>>>> (like a week or two ago).
>>>>
>>>> If you're referring to the cloneRecords parameter, it appears to
>>>> default to true, but even when I add it explicitly, I get the same error.
>>>>
>>>>
>>>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft <
>>>> fnothaft@berkeley.edu> wrote:
>>>>
>>>>> Uri,
>>>>>
>>>>> Which version of Spark are you running? If it is >0.9.0, you need to
>>>>> add an optional true argument at the end of the sc.newApiHadoopFile(...) call
>>>>> to read Parquet data.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnothaft@berkeley.edu
>>>>> fnothaft@eecs.berkeley.edu
>>>>> 202-340-0466
>>>>>
>>>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>> I am cross-posting on the parquet mailing list.  Short recap: I am
>>>>> trying to read Parquet data from the spark interactive shell.
>>>>>
>>>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>>>>>
>>>>> export
>>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>>>>>
>>>>> From the spark-shell, I run:
>>>>>
>>>>> val job = new Job(sc.hadoopConfiguration)
>>>>> ParquetInputFormat.setReadSupportClass(job,
>>>>> classOf[AvroReadSupport[GenericRecord]])
>>>>> val records1 =
>>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>>>>> classOf[GenericRecord], job.getConfiguration)
>>>>>
>>>>> Then I try
>>>>>
>>>>> records1.count
>>>>>
>>>>> Which gives the following error:
>>>>>
>>>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>>>>> java.lang.NoSuchMethodError:
>>>>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>>>>>  at
>>>>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>>>>> at
>>>>> parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>>>>>  at
>>>>> parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>>>>> at
>>>>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>>>>>  at
>>>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>>>>> at
>>>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>>>>>  at
>>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>>>>>  at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>>>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>>>>>  at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>>>  at
>>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>>>  at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>
>>>>>
>>>>> My hypothesis is that this a shading problem.  It appears that the
>>>>> code is trying to call a constructor that looks like this:
>>>>>
>>>>> String.Field(String, Schema, String, *parquet*
>>>>> .org.codehaus.jackson.JsonNode)
>>>>>
>>>>> but the signature from the spark-assembly jar is
>>>>>
>>>>> public org.apache.avro.Schema$Field(java.lang.String,
>>>>> org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode);
>>>>>
>>>>> Where do I go from here?
>>>>>
>>>>> Uri
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com>wrote:
>>>>>
>>>>>> Yep, I did not include that jar in the class path.  Now I've got some
>>>>>> "real" errors to try to work through.  Thanks!
>>>>>>
>>>>>>
>>>>>>  On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>wrote:
>>>>>>
>>>>>>> Hi Uri,
>>>>>>>
>>>>>>> Could you try adding the parquet-jackson JAR to your classpath? There
>>>>>>> may possibly be other parquet-avro dependencies that are missing too.
>>>>>>>
>>>>>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>>>>>>
>>>>>>> -Jey
>>>>>>>
>>>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <la...@cloudera.com>
>>>>>>> wrote:
>>>>>>> > Yes, of course.  That class is a jackson class, and I'm not sure
>>>>>>> why it's
>>>>>>> > being referred to as
>>>>>>> parquet.org.codehaus.jackson.JsonGenerationException.
>>>>>>> >
>>>>>>> > org.codehaus.jackson.JsonGenerationException is on the classpath.
>>>>>>>  But not
>>>>>>> > when it's prefixed by parquet.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <an...@andrewash.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to
>>>>>>> confirm that
>>>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class exists
>>>>>>> in one of
>>>>>>> >> them?
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson <
>>>>>>> laserson@cloudera.com>
>>>>>>> >> wrote:
>>>>>>> >>>
>>>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>>>>>> GenericRecords
>>>>>>> >>> from a Parquet file. I'm having a bit of trouble with respect to
>>>>>>> >>> dependencies.  My latest attempt looks like this:
>>>>>>> >>>
>>>>>>> >>> export
>>>>>>> >>>
>>>>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>>>>>> >>>
>>>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>>>>>> >>>
>>>>>>> >>> Then in the shell:
>>>>>>> >>>
>>>>>>> >>> val records1 =
>>>>>>> >>>
>>>>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>>>>>> classOf[IndexedRecord],
>>>>>>> >>> sc.hadoopConfiguration)
>>>>>>> >>> records1.collect
>>>>>>> >>>
>>>>>>> >>> At which point it barfs:
>>>>>>> >>>
>>>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>>>>>>> process : 3
>>>>>>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>>>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinderfor further
>>>>>>> >>> details.
>>>>>>> >>> java.io.IOException: Could not read footer:
>>>>>>> >>> java.lang.NoClassDefFoundError:
>>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>>>>>> >>> at
>>>>>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>>>>>> >>> at
>>>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>>>>>> >>> at
>>>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>>>>>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>>>>>> >>> at $iwC.<init>(<console>:29)
>>>>>>> >>> at <init>(<console>:31)
>>>>>>> >>> at .<init>(<console>:35)
>>>>>>> >>> at .<clinit>(<console>)
>>>>>>> >>> at .<init>(<console>:7)
>>>>>>> >>> at .<clinit>(<console>)
>>>>>>> >>> at $print(<console>)
>>>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>>>>>> >>> at
>>>>>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>>>>>> >>> at
>>>>>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>>>>>> >>> at
>>>>>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>>>>>> >>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>>>>>> >>> at
>>>>>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>>>>>> >>> at
>>>>>>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>>>>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>>>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>>>>>> >>> at
>>>>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>>>>>> >>> at
>>>>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>> >>> at
>>>>>>> >>>
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>> >>> at java.lang.Thread.run(Thread.java:744)
>>>>>>> >>> Caused by: java.lang.ClassNotFoundException:
>>>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>>>>> >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>>>>> >>> ... 9 more
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> Uri Laserson, PhD
>>>>>>> >>> Data Scientist, Cloudera
>>>>>>> >>> Twitter/GitHub: @laserson
>>>>>>> >>> +1 617 910 0447
>>>>>>> >>> laserson@cloudera.com
>>>>>>> >>
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Uri Laserson, PhD
>>>>>>> > Data Scientist, Cloudera
>>>>>>> > Twitter/GitHub: @laserson
>>>>>>> > +1 617 910 0447
>>>>>>> > laserson@cloudera.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Uri Laserson, PhD
>>>>>> Data Scientist, Cloudera
>>>>>> Twitter/GitHub: @laserson
>>>>>> +1 617 910 0447
>>>>>> laserson@cloudera.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Uri Laserson, PhD
>>>>> Data Scientist, Cloudera
>>>>> Twitter/GitHub: @laserson
>>>>> +1 617 910 0447
>>>>> laserson@cloudera.com
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://parquet.github.com/
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Parquet" group.
>>>>> To post to this group, send email to parquet-dev@googlegroups.com.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Uri Laserson, PhD
>>>> Data Scientist, Cloudera
>>>> Twitter/GitHub: @laserson
>>>> +1 617 910 0447
>>>> laserson@cloudera.com
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Prashant
>>>
>>
>>
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>
>> --
>> http://parquet.github.com/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Parquet" group.
>> To post to this group, send email to parquet-dev@googlegroups.com.
>>
>
>

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Andrew Ash <an...@andrewash.com>.

There has been talk of Spark's Kryo serializer delegating to Avro
serialization for avro objects before on the Spark mailing list, and an ER
filed on Spark's Jira, but it hasn't happened yet.

https://spark-project.atlassian.net/browse/SPARK-746

The right thing is to put Kryo->Avro delegation in Chill so other projects
besides Spark get the improvement too.


On Thu, Feb 6, 2014 at 10:43 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Spark uses Kryo under the hood for serializing objects which may need
> configuration to automatically handle Avro records.
> The fine Scalding authors made chill which is a library to add a bunch of
> custom serialization to Kryo.
> https://github.com/twitter/chill
> Which Spark is using now:
> https://github.com/mesos/spark/pull/732
>
> So possibly this is the way to investigate and add Avro support?
>
>
>
>
> On Thu, Feb 6, 2014 at 10:31 AM, Uri Laserson <la...@cloudera.com>wrote:
>
>> You're a good man, Julien.  Commenting out those lines from the pom.xml
>> fixed the problem.  I can now create an RDD of GenericRecord objects, map
>> them to strings, and spit out the corresponding JSON from the interactive
>> spark-shell.
>>
>> Just to recap how it works for me:
>>
>> Set the SPARK_CLASSPATH to include the necessary Parquet jars:
>> export
>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-encoding/target/parquet-encoding-1.3.3-SNAPSHOT.jar"
>>
>> Run the Spark shell:
>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>
>> Then:
>>
>> val job = new Job(sc.hadoopConfiguration)
>>  ParquetInputFormat.setReadSupportClass(job,
>> classOf[AvroReadSupport[GenericRecord]])
>> val records1 =
>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>> classOf[GenericRecord], job.getConfiguration)
>> val records2 = records1.map(p => p._2)
>> val records3 = records2.map(p => p.toString)
>>
>> One thing that I now need to work out is that while records3 results in
>> an RDD of JSON strings, records2 gives me java.io.NotSerializableException:
>> org.apache.avro.generic.GenericData$Record, which I find surprising.  Is
>> this expected behavior?
>>
>> Thanks for all the help!
>> Uri
>>
>>
>>
>>
>> On Thu, Feb 6, 2014 at 9:30 AM, Julien Le Dem <ju...@twitter.com> wrote:
>>
>>> Then probably just removing those 4 lines would fix it as parquet-avro
>>> does not use jackson outside of that (I think):
>>>
>>> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/pom.xml#L101
>>> Can you try it out?
>>> Thanks
>>>
>>>
>>> On Thu, Feb 6, 2014 at 9:25 AM, Tom White <to...@cloudera.com> wrote:
>>>
>>>> On Thu, Feb 6, 2014 at 5:13 PM, Julien Le Dem <ju...@twitter.com>
>>>> wrote:
>>>> > Hi Uri,
>>>> > Parquet shades Jackson to avoid dependency conflicts with Hadoop.
>>>> Hadoop
>>>> > depends on an ancient version of Jackson, also Parquet works with
>>>> several
>>>> > versions of Hadoop independently of what jackson version they pull.
>>>> >
>>>> > It appears that this creates a problem in parquet-avro here:
>>>> >
>>>> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191
>>>> >
>>>> > notice that NullNode here is a org.codehaus.jackson.node.NullNode
>>>> which
>>>> > looks weird to me as we are building an Avro schema. Why would we use
>>>> a
>>>> > Jackson type in there?
>>>>
>>>> Avro uses Jackson to construct default values since they are expressed
>>>> as JSON objects.
>>>>
>>>> >
>>>> > I see 2 solutions:
>>>> >  - parquet-avro should not shade Jackson (but really I don't see why
>>>> we
>>>> > depend on Jackson at all here)
>>>>
>>>> This is probably the best solution here, assuming Hadoop, Spark etc
>>>> all use the same (or at least compatible) versions of Jackson.
>>>>
>>>> >  - AvroSchemaConverter should not depend on jackson.
>>>> >
>>>> > Do you know why the Avro abstraction is leaking jackson here?
>>>>
>>>> Unfortunately Avro does leak the Jackson dependency. There's been a
>>>> bit of discussion about avoiding this
>>>> (https://issues.apache.org/jira/browse/AVRO-1126), but not patches
>>>> yet.
>>>>
>>>> Tom
>>>>
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Feb 6, 2014 at 12:40 AM, Uri Laserson <la...@cloudera.com>
>>>> wrote:
>>>> >>
>>>> >> I am skeptical that will solve my problem, though.  Either way, I
>>>> just
>>>> >> pulled the latest master and built that, and the same problem
>>>> remains.
>>>> >>
>>>> >>
>>>> >> On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <
>>>> scrapcodes@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> That cloneRecords parameter is gone, so either use the released
>>>> 0.9.0 or
>>>> >>> the current master.
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft
>>>> >>> <fn...@berkeley.edu> wrote:
>>>> >>>>
>>>> >>>> Uri,
>>>> >>>>
>>>> >>>> Er, yes, it is the cloneRecords, and when I said true, I meant
>>>> false...
>>>> >>>> Apologies for the misdirection there.
>>>> >>>>
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>>
>>>> >>>> Frank Austin Nothaft
>>>> >>>> fnothaft@berkeley.edu
>>>> >>>> fnothaft@eecs.berkeley.edu
>>>> >>>> 202-340-0466
>>>> >>>>
>>>> >>>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the
>>>> time
>>>> >>>> (like a week or two ago).
>>>> >>>>
>>>> >>>> If you're referring to the cloneRecords parameter, it appears to
>>>> default
>>>> >>>> to true, but even when I add it explicitly, I get the same error.
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft
>>>> >>>> <fn...@berkeley.edu> wrote:
>>>> >>>>>
>>>> >>>>> Uri,
>>>> >>>>>
>>>> >>>>> Which version of Spark are you running? If it is >0.9.0, you need
>>>> to
>>>> >>>>> add an optional true argument at the end of the
>>>> sc.newApiHadoopFile(...) call
>>>> >>>>> to read Parquet data.
>>>> >>>>>
>>>> >>>>> Regards,
>>>> >>>>>
>>>> >>>>> Frank Austin Nothaft
>>>> >>>>> fnothaft@berkeley.edu
>>>> >>>>> fnothaft@eecs.berkeley.edu
>>>> >>>>> 202-340-0466
>>>> >>>>>
>>>> >>>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I am cross-posting on the parquet mailing list.  Short recap: I am
>>>> >>>>> trying to read Parquet data from the spark interactive shell.
>>>> >>>>>
>>>> >>>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>>>> >>>>>
>>>> >>>>> export
>>>> >>>>>
>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>>>> >>>>>
>>>> >>>>> From the spark-shell, I run:
>>>> >>>>>
>>>> >>>>> val job = new Job(sc.hadoopConfiguration)
>>>> >>>>> ParquetInputFormat.setReadSupportClass(job,
>>>> >>>>> classOf[AvroReadSupport[GenericRecord]])
>>>> >>>>> val records1 =
>>>> >>>>>
>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>> >>>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>>>> >>>>> classOf[GenericRecord], job.getConfiguration)
>>>> >>>>>
>>>> >>>>> Then I try
>>>> >>>>>
>>>> >>>>> records1.count
>>>> >>>>>
>>>> >>>>> Which gives the following error:
>>>> >>>>>
>>>> >>>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>>>> >>>>> java.lang.NoSuchMethodError:
>>>> >>>>>
>>>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>>>> >>>>> at
>>>> >>>>>
>>>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>>>> >>>>> at
>>>> >>>>>
>>>> parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>>>> >>>>> at
>>>> parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>>>> >>>>> at
>>>> >>>>>
>>>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>>>> >>>>> at
>>>> >>>>>
>>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>>>> >>>>> at
>>>> >>>>>
>>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>>>> >>>>> at
>>>> >>>>>
>>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>>>> >>>>> at
>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>>>> >>>>> at
>>>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>>>> >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>>> >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>>> >>>>> at
>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>>>> >>>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>>> >>>>> at
>>>> >>>>>
>>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>> >>>>> at
>>>> >>>>>
>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>>> >>>>> at
>>>> >>>>>
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>> >>>>> at
>>>> >>>>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> >>>>> at
>>>> >>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >>>>> at java.lang.Thread.run(Thread.java:744)
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> My hypothesis is that this a shading problem.  It appears that
>>>> the code
>>>> >>>>> is trying to call a constructor that looks like this:
>>>> >>>>>
>>>> >>>>> String.Field(String, Schema, String,
>>>> >>>>> parquet.org.codehaus.jackson.JsonNode)
>>>> >>>>>
>>>> >>>>> but the signature from the spark-assembly jar is
>>>> >>>>>
>>>> >>>>> public org.apache.avro.Schema$Field(java.lang.String,
>>>> >>>>> org.apache.avro.Schema, java.lang.String,
>>>> org.codehaus.jackson.JsonNode);
>>>> >>>>>
>>>> >>>>> Where do I go from here?
>>>> >>>>>
>>>> >>>>> Uri
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <
>>>> laserson@cloudera.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> Yep, I did not include that jar in the class path.  Now I've got
>>>> some
>>>> >>>>>> "real" errors to try to work through.  Thanks!
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <
>>>> jey@cs.berkeley.edu>
>>>> >>>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Uri,
>>>> >>>>>>>
>>>> >>>>>>> Could you try adding the parquet-jackson JAR to your classpath?
>>>> There
>>>> >>>>>>> may possibly be other parquet-avro dependencies that are
>>>> missing too.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>>> >>>>>>>
>>>> >>>>>>> -Jey
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <
>>>> laserson@cloudera.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>> > Yes, of course.  That class is a jackson class, and I'm not
>>>> sure
>>>> >>>>>>> > why it's
>>>> >>>>>>> > being referred to as
>>>> >>>>>>> > parquet.org.codehaus.jackson.JsonGenerationException.
>>>> >>>>>>> >
>>>> >>>>>>> > org.codehaus.jackson.JsonGenerationException is on the
>>>> classpath.
>>>> >>>>>>> > But not
>>>> >>>>>>> > when it's prefixed by parquet.
>>>> >>>>>>> >
>>>> >>>>>>> >
>>>> >>>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <
>>>> andrew@andrewash.com>
>>>> >>>>>>> > wrote:
>>>> >>>>>>> >>
>>>> >>>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to
>>>> >>>>>>> >> confirm that
>>>> >>>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class
>>>> exists
>>>> >>>>>>> >> in one of
>>>> >>>>>>> >> them?
>>>> >>>>>>> >>
>>>> >>>>>>> >>
>>>> >>>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson
>>>> >>>>>>> >> <la...@cloudera.com>
>>>> >>>>>>> >> wrote:
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>>> >>>>>>> >>> GenericRecords
>>>> >>>>>>> >>> from a Parquet file. I'm having a bit of trouble with
>>>> respect to
>>>> >>>>>>> >>> dependencies.  My latest attempt looks like this:
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> export
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> Then in the shell:
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> val records1 =
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>> >>>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>>> >>>>>>> >>> classOf[IndexedRecord],
>>>> >>>>>>> >>> sc.hadoopConfiguration)
>>>> >>>>>>> >>> records1.collect
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> At which point it barfs:
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>>>> >>>>>>> >>> process : 3
>>>> >>>>>>> >>> SLF4J: Failed to load class
>>>> "org.slf4j.impl.StaticLoggerBinder".
>>>> >>>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger
>>>> implementation
>>>> >>>>>>> >>> SLF4J: See
>>>> http://www.slf4j.org/codes.html#StaticLoggerBinder for
>>>> >>>>>>> >>> further
>>>> >>>>>>> >>> details.
>>>> >>>>>>> >>> java.io.IOException: Could not read footer:
>>>> >>>>>>> >>> java.lang.NoClassDefFoundError:
>>>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>>> >>>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>>> >>>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>>> >>>>>>> >>> at
>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>>> >>>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>>> >>>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>>> >>>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>>> >>>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>>> >>>>>>> >>> at $iwC.<init>(<console>:29)
>>>> >>>>>>> >>> at <init>(<console>:31)
>>>> >>>>>>> >>> at .<init>(<console>:35)
>>>> >>>>>>> >>> at .<clinit>(<console>)
>>>> >>>>>>> >>> at .<init>(<console>:7)
>>>> >>>>>>> >>> at .<clinit>(<console>)
>>>> >>>>>>> >>> at $print(<console>)
>>>> >>>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>> Method)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >>>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>>> >>>>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>>> >>>>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>> >>>>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>>> >>>>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>>> >>>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>> >>>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>>> >>>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>>> >>>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> >>>>>>> >>> at
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >>>>>>> >>> at java.lang.Thread.run(Thread.java:744)
>>>> >>>>>>> >>> Caused by: java.lang.ClassNotFoundException:
>>>> >>>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>> >>>>>>> >>> at java.security.AccessController.doPrivileged(Native
>>>> Method)
>>>> >>>>>>> >>> at
>>>> java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>> >>>>>>> >>> at
>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>> >>>>>>> >>> ... 9 more
>>>> >>>>>>> >>>
>>>> >>>>>>> >>>
>>>> >>>>>>> >>> --
>>>> >>>>>>> >>> Uri Laserson, PhD
>>>> >>>>>>> >>> Data Scientist, Cloudera
>>>> >>>>>>> >>> Twitter/GitHub: @laserson
>>>> >>>>>>> >>> +1 617 910 0447
>>>> >>>>>>> >>> laserson@cloudera.com
>>>> >>>>>>> >>
>>>> >>>>>>> >>
>>>> >>>>>>> >
>>>> >>>>>>> >
>>>> >>>>>>> >
>>>> >>>>>>> > --
>>>> >>>>>>> > Uri Laserson, PhD
>>>> >>>>>>> > Data Scientist, Cloudera
>>>> >>>>>>> > Twitter/GitHub: @laserson
>>>> >>>>>>> > +1 617 910 0447
>>>> >>>>>>> > laserson@cloudera.com
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> Uri Laserson, PhD
>>>> >>>>>> Data Scientist, Cloudera
>>>> >>>>>> Twitter/GitHub: @laserson
>>>> >>>>>> +1 617 910 0447
>>>> >>>>>> laserson@cloudera.com
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Uri Laserson, PhD
>>>> >>>>> Data Scientist, Cloudera
>>>> >>>>> Twitter/GitHub: @laserson
>>>> >>>>> +1 617 910 0447
>>>> >>>>> laserson@cloudera.com
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> http://parquet.github.com/
>>>> >>>>> ---
>>>> >>>>> You received this message because you are subscribed to the Google
>>>> >>>>> Groups "Parquet" group.
>>>> >>>>> To post to this group, send email to parquet-dev@googlegroups.com
>>>> .
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Uri Laserson, PhD
>>>> >>>> Data Scientist, Cloudera
>>>> >>>> Twitter/GitHub: @laserson
>>>> >>>> +1 617 910 0447
>>>> >>>> laserson@cloudera.com
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Prashant
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Uri Laserson, PhD
>>>> >> Data Scientist, Cloudera
>>>> >> Twitter/GitHub: @laserson
>>>> >> +1 617 910 0447
>>>> >> laserson@cloudera.com
>>>> >>
>>>> >> --
>>>> >> http://parquet.github.com/
>>>> >> ---
>>>> >> You received this message because you are subscribed to the Google
>>>> Groups
>>>> >> "Parquet" group.
>>>> >> To post to this group, send email to parquet-dev@googlegroups.com.
>>>> >
>>>> >
>>>> > --
>>>> > http://parquet.github.com/
>>>> > ---
>>>> > You received this message because you are subscribed to the Google
>>>> Groups
>>>> > "Parquet" group.
>>>> > To post to this group, send email to parquet-dev@googlegroups.com.
>>>>
>>>
>>>
>>
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>
>> --
>> http://parquet.github.com/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Parquet" group.
>> To post to this group, send email to parquet-dev@googlegroups.com.
>>
>
>

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Julien Le Dem <ju...@twitter.com>.

Spark uses Kryo under the hood for serializing objects which may need
configuration to automatically handle Avro records.
The fine Scalding authors made chill which is a library to add a bunch of
custom serialization to Kryo.
https://github.com/twitter/chill
Which Spark is using now:
https://github.com/mesos/spark/pull/732

So possibly this is the way to investigate and add Avro support?




On Thu, Feb 6, 2014 at 10:31 AM, Uri Laserson <la...@cloudera.com> wrote:

> You're a good man, Julien.  Commenting out those lines from the pom.xml
> fixed the problem.  I can now create an RDD of GenericRecord objects, map
> them to strings, and spit out the corresponding JSON from the interactive
> spark-shell.
>
> Just to recap how it works for me:
>
> Set the SPARK_CLASSPATH to include the necessary Parquet jars:
> export
> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-encoding/target/parquet-encoding-1.3.3-SNAPSHOT.jar"
>
> Run the Spark shell:
> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>
> Then:
>
> val job = new Job(sc.hadoopConfiguration)
> ParquetInputFormat.setReadSupportClass(job,
> classOf[AvroReadSupport[GenericRecord]])
> val records1 =
> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
> classOf[GenericRecord], job.getConfiguration)
> val records2 = records1.map(p => p._2)
> val records3 = records2.map(p => p.toString)
>
> One thing that I now need to work out is that while records3 results in an
> RDD of JSON strings, records2 gives me java.io.NotSerializableException:
> org.apache.avro.generic.GenericData$Record, which I find surprising.  Is
> this expected behavior?
>
> Thanks for all the help!
> Uri
>
>
>
>
> On Thu, Feb 6, 2014 at 9:30 AM, Julien Le Dem <ju...@twitter.com> wrote:
>
>> Then probably just removing those 4 lines would fix it as parquet-avro
>> does not use jackson outside of that (I think):
>>
>> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/pom.xml#L101
>> Can you try it out?
>> Thanks
>>
>>
>> On Thu, Feb 6, 2014 at 9:25 AM, Tom White <to...@cloudera.com> wrote:
>>
>>> On Thu, Feb 6, 2014 at 5:13 PM, Julien Le Dem <ju...@twitter.com>
>>> wrote:
>>> > Hi Uri,
>>> > Parquet shades Jackson to avoid dependency conflicts with Hadoop.
>>> Hadoop
>>> > depends on an ancient version of Jackson, also Parquet works with
>>> several
>>> > versions of Hadoop independently of what jackson version they pull.
>>> >
>>> > It appears that this creates a problem in parquet-avro here:
>>> >
>>> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191
>>> >
>>> > notice that NullNode here is a org.codehaus.jackson.node.NullNode which
>>> > looks weird to me as we are building an Avro schema. Why would we use a
>>> > Jackson type in there?
>>>
>>> Avro uses Jackson to construct default values since they are expressed
>>> as JSON objects.
>>>
>>> >
>>> > I see 2 solutions:
>>> >  - parquet-avro should not shade Jackson (but really I don't see why we
>>> > depend on Jackson at all here)
>>>
>>> This is probably the best solution here, assuming Hadoop, Spark etc
>>> all use the same (or at least compatible) versions of Jackson.
>>>
>>> >  - AvroSchemaConverter should not depend on jackson.
>>> >
>>> > Do you know why the Avro abstraction is leaking jackson here?
>>>
>>> Unfortunately Avro does leak the Jackson dependency. There's been a
>>> bit of discussion about avoiding this
>>> (https://issues.apache.org/jira/browse/AVRO-1126), but not patches
>>> yet.
>>>
>>> Tom
>>>
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Feb 6, 2014 at 12:40 AM, Uri Laserson <la...@cloudera.com>
>>> wrote:
>>> >>
>>> >> I am skeptical that will solve my problem, though.  Either way, I just
>>> >> pulled the latest master and built that, and the same problem remains.
>>> >>
>>> >>
>>> >> On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <scrapcodes@gmail.com
>>> >
>>> >> wrote:
>>> >>>
>>> >>> That cloneRecords parameter is gone, so either use the released
>>> 0.9.0 or
>>> >>> the current master.
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft
>>> >>> <fn...@berkeley.edu> wrote:
>>> >>>>
>>> >>>> Uri,
>>> >>>>
>>> >>>> Er, yes, it is the cloneRecords, and when I said true, I meant
>>> false...
>>> >>>> Apologies for the misdirection there.
>>> >>>>
>>> >>>>
>>> >>>> Regards,
>>> >>>>
>>> >>>> Frank Austin Nothaft
>>> >>>> fnothaft@berkeley.edu
>>> >>>> fnothaft@eecs.berkeley.edu
>>> >>>> 202-340-0466
>>> >>>>
>>> >>>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com>
>>> wrote:
>>> >>>>
>>> >>>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the
>>> time
>>> >>>> (like a week or two ago).
>>> >>>>
>>> >>>> If you're referring to the cloneRecords parameter, it appears to
>>> default
>>> >>>> to true, but even when I add it explicitly, I get the same error.
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft
>>> >>>> <fn...@berkeley.edu> wrote:
>>> >>>>>
>>> >>>>> Uri,
>>> >>>>>
>>> >>>>> Which version of Spark are you running? If it is >0.9.0, you need
>>> to
>>> >>>>> add an optional true argument at the end of the
>>> sc.newApiHadoopFile(...) call
>>> >>>>> to read Parquet data.
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>>
>>> >>>>> Frank Austin Nothaft
>>> >>>>> fnothaft@berkeley.edu
>>> >>>>> fnothaft@eecs.berkeley.edu
>>> >>>>> 202-340-0466
>>> >>>>>
>>> >>>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com>
>>> wrote:
>>> >>>>>
>>> >>>>> I am cross-posting on the parquet mailing list.  Short recap: I am
>>> >>>>> trying to read Parquet data from the spark interactive shell.
>>> >>>>>
>>> >>>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>>> >>>>>
>>> >>>>> export
>>> >>>>>
>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>>> >>>>>
>>> >>>>> From the spark-shell, I run:
>>> >>>>>
>>> >>>>> val job = new Job(sc.hadoopConfiguration)
>>> >>>>> ParquetInputFormat.setReadSupportClass(job,
>>> >>>>> classOf[AvroReadSupport[GenericRecord]])
>>> >>>>> val records1 =
>>> >>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>> >>>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>>> >>>>> classOf[GenericRecord], job.getConfiguration)
>>> >>>>>
>>> >>>>> Then I try
>>> >>>>>
>>> >>>>> records1.count
>>> >>>>>
>>> >>>>> Which gives the following error:
>>> >>>>>
>>> >>>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>>> >>>>> java.lang.NoSuchMethodError:
>>> >>>>>
>>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>>> >>>>> at
>>> >>>>>
>>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>>> >>>>> at
>>> >>>>>
>>> parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>>> >>>>> at
>>> parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>>> >>>>> at
>>> >>>>>
>>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>>> >>>>> at
>>> >>>>>
>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>>> >>>>> at
>>> >>>>>
>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>>> >>>>> at
>>> >>>>>
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>>> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>>> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>>> >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>> >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>> >>>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>>> >>>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>> >>>>> at
>>> >>>>>
>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>> >>>>> at
>>> >>>>>
>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>> >>>>> at
>>> >>>>>
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>> >>>>> at
>>> >>>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>>>> at
>>> >>>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>>> at java.lang.Thread.run(Thread.java:744)
>>> >>>>>
>>> >>>>>
>>> >>>>> My hypothesis is that this a shading problem.  It appears that the
>>> code
>>> >>>>> is trying to call a constructor that looks like this:
>>> >>>>>
>>> >>>>> String.Field(String, Schema, String,
>>> >>>>> parquet.org.codehaus.jackson.JsonNode)
>>> >>>>>
>>> >>>>> but the signature from the spark-assembly jar is
>>> >>>>>
>>> >>>>> public org.apache.avro.Schema$Field(java.lang.String,
>>> >>>>> org.apache.avro.Schema, java.lang.String,
>>> org.codehaus.jackson.JsonNode);
>>> >>>>>
>>> >>>>> Where do I go from here?
>>> >>>>>
>>> >>>>> Uri
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <
>>> laserson@cloudera.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> Yep, I did not include that jar in the class path.  Now I've got
>>> some
>>> >>>>>> "real" errors to try to work through.  Thanks!
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <jey@cs.berkeley.edu
>>> >
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi Uri,
>>> >>>>>>>
>>> >>>>>>> Could you try adding the parquet-jackson JAR to your classpath?
>>> There
>>> >>>>>>> may possibly be other parquet-avro dependencies that are missing
>>> too.
>>> >>>>>>>
>>> >>>>>>>
>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>> >>>>>>>
>>> >>>>>>> -Jey
>>> >>>>>>>
>>> >>>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <
>>> laserson@cloudera.com>
>>> >>>>>>> wrote:
>>> >>>>>>> > Yes, of course.  That class is a jackson class, and I'm not
>>> sure
>>> >>>>>>> > why it's
>>> >>>>>>> > being referred to as
>>> >>>>>>> > parquet.org.codehaus.jackson.JsonGenerationException.
>>> >>>>>>> >
>>> >>>>>>> > org.codehaus.jackson.JsonGenerationException is on the
>>> classpath.
>>> >>>>>>> > But not
>>> >>>>>>> > when it's prefixed by parquet.
>>> >>>>>>> >
>>> >>>>>>> >
>>> >>>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <
>>> andrew@andrewash.com>
>>> >>>>>>> > wrote:
>>> >>>>>>> >>
>>> >>>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to
>>> >>>>>>> >> confirm that
>>> >>>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class
>>> exists
>>> >>>>>>> >> in one of
>>> >>>>>>> >> them?
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson
>>> >>>>>>> >> <la...@cloudera.com>
>>> >>>>>>> >> wrote:
>>> >>>>>>> >>>
>>> >>>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>> >>>>>>> >>> GenericRecords
>>> >>>>>>> >>> from a Parquet file. I'm having a bit of trouble with
>>> respect to
>>> >>>>>>> >>> dependencies.  My latest attempt looks like this:
>>> >>>>>>> >>>
>>> >>>>>>> >>> export
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>> >>>>>>> >>>
>>> >>>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>> >>>>>>> >>>
>>> >>>>>>> >>> Then in the shell:
>>> >>>>>>> >>>
>>> >>>>>>> >>> val records1 =
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>> >>>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>> >>>>>>> >>> classOf[IndexedRecord],
>>> >>>>>>> >>> sc.hadoopConfiguration)
>>> >>>>>>> >>> records1.collect
>>> >>>>>>> >>>
>>> >>>>>>> >>> At which point it barfs:
>>> >>>>>>> >>>
>>> >>>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>>> >>>>>>> >>> process : 3
>>> >>>>>>> >>> SLF4J: Failed to load class
>>> "org.slf4j.impl.StaticLoggerBinder".
>>> >>>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>> >>>>>>> >>> SLF4J: See
>>> http://www.slf4j.org/codes.html#StaticLoggerBinder for
>>> >>>>>>> >>> further
>>> >>>>>>> >>> details.
>>> >>>>>>> >>> java.io.IOException: Could not read footer:
>>> >>>>>>> >>> java.lang.NoClassDefFoundError:
>>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>> >>>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>> >>>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>> >>>>>>> >>> at
>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>> >>>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>> >>>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>> >>>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>> >>>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>> >>>>>>> >>> at $iwC.<init>(<console>:29)
>>> >>>>>>> >>> at <init>(<console>:31)
>>> >>>>>>> >>> at .<init>(<console>:35)
>>> >>>>>>> >>> at .<clinit>(<console>)
>>> >>>>>>> >>> at .<init>(<console>:7)
>>> >>>>>>> >>> at .<clinit>(<console>)
>>> >>>>>>> >>> at $print(<console>)
>>> >>>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >>>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>> >>>>>>> >>> at
>>> org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>> >>>>>>> >>> at
>>> org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>> >>>>>>> >>> at
>>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>> >>>>>>> >>> at
>>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>> >>>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>> >>>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>> >>>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>> >>>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>>>>>> >>> at
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>>>>> >>> at java.lang.Thread.run(Thread.java:744)
>>> >>>>>>> >>> Caused by: java.lang.ClassNotFoundException:
>>> >>>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> >>>>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>>> >>>>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> >>>>>>> >>> at
>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> >>>>>>> >>> ... 9 more
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> >>>>>>> >>> --
>>> >>>>>>> >>> Uri Laserson, PhD
>>> >>>>>>> >>> Data Scientist, Cloudera
>>> >>>>>>> >>> Twitter/GitHub: @laserson
>>> >>>>>>> >>> +1 617 910 0447
>>> >>>>>>> >>> laserson@cloudera.com
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>> >
>>> >>>>>>> >
>>> >>>>>>> >
>>> >>>>>>> > --
>>> >>>>>>> > Uri Laserson, PhD
>>> >>>>>>> > Data Scientist, Cloudera
>>> >>>>>>> > Twitter/GitHub: @laserson
>>> >>>>>>> > +1 617 910 0447
>>> >>>>>>> > laserson@cloudera.com
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Uri Laserson, PhD
>>> >>>>>> Data Scientist, Cloudera
>>> >>>>>> Twitter/GitHub: @laserson
>>> >>>>>> +1 617 910 0447
>>> >>>>>> laserson@cloudera.com
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Uri Laserson, PhD
>>> >>>>> Data Scientist, Cloudera
>>> >>>>> Twitter/GitHub: @laserson
>>> >>>>> +1 617 910 0447
>>> >>>>> laserson@cloudera.com
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> http://parquet.github.com/
>>> >>>>> ---
>>> >>>>> You received this message because you are subscribed to the Google
>>> >>>>> Groups "Parquet" group.
>>> >>>>> To post to this group, send email to parquet-dev@googlegroups.com.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Uri Laserson, PhD
>>> >>>> Data Scientist, Cloudera
>>> >>>> Twitter/GitHub: @laserson
>>> >>>> +1 617 910 0447
>>> >>>> laserson@cloudera.com
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Prashant
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Uri Laserson, PhD
>>> >> Data Scientist, Cloudera
>>> >> Twitter/GitHub: @laserson
>>> >> +1 617 910 0447
>>> >> laserson@cloudera.com
>>> >>
>>> >> --
>>> >> http://parquet.github.com/
>>> >> ---
>>> >> You received this message because you are subscribed to the Google
>>> Groups
>>> >> "Parquet" group.
>>> >> To post to this group, send email to parquet-dev@googlegroups.com.
>>> >
>>> >
>>> > --
>>> > http://parquet.github.com/
>>> > ---
>>> > You received this message because you are subscribed to the Google
>>> Groups
>>> > "Parquet" group.
>>> > To post to this group, send email to parquet-dev@googlegroups.com.
>>>
>>
>>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com
>
> --
> http://parquet.github.com/
> ---
> You received this message because you are subscribed to the Google Groups
> "Parquet" group.
> To post to this group, send email to parquet-dev@googlegroups.com.
>

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Uri Laserson <la...@cloudera.com>.

You're a good man, Julien.  Commenting out those lines from the pom.xml
fixed the problem.  I can now create an RDD of GenericRecord objects, map
them to strings, and spit out the corresponding JSON from the interactive
spark-shell.

Just to recap how it works for me:

Set the SPARK_CLASSPATH to include the necessary Parquet jars:
export
SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-encoding/target/parquet-encoding-1.3.3-SNAPSHOT.jar"

Run the Spark shell:
MASTER=local ~/repos/incubator-spark/bin/spark-shell

Then:

val job = new Job(sc.hadoopConfiguration)
ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[GenericRecord]])
val records1 =
sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
classOf[GenericRecord], job.getConfiguration)
val records2 = records1.map(p => p._2)
val records3 = records2.map(p => p.toString)

One thing that I now need to work out is that while records3 results in an
RDD of JSON strings, records2 gives me java.io.NotSerializableException:
org.apache.avro.generic.GenericData$Record, which I find surprising.  Is
this expected behavior?

Thanks for all the help!
Uri




On Thu, Feb 6, 2014 at 9:30 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Then probably just removing those 4 lines would fix it as parquet-avro
> does not use jackson outside of that (I think):
>
> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/pom.xml#L101
> Can you try it out?
> Thanks
>
>
> On Thu, Feb 6, 2014 at 9:25 AM, Tom White <to...@cloudera.com> wrote:
>
>> On Thu, Feb 6, 2014 at 5:13 PM, Julien Le Dem <ju...@twitter.com> wrote:
>> > Hi Uri,
>> > Parquet shades Jackson to avoid dependency conflicts with Hadoop. Hadoop
>> > depends on an ancient version of Jackson, also Parquet works with
>> several
>> > versions of Hadoop independently of what jackson version they pull.
>> >
>> > It appears that this creates a problem in parquet-avro here:
>> >
>> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191
>> >
>> > notice that NullNode here is a org.codehaus.jackson.node.NullNode which
>> > looks weird to me as we are building an Avro schema. Why would we use a
>> > Jackson type in there?
>>
>> Avro uses Jackson to construct default values since they are expressed
>> as JSON objects.
>>
>> >
>> > I see 2 solutions:
>> >  - parquet-avro should not shade Jackson (but really I don't see why we
>> > depend on Jackson at all here)
>>
>> This is probably the best solution here, assuming Hadoop, Spark etc
>> all use the same (or at least compatible) versions of Jackson.
>>
>> >  - AvroSchemaConverter should not depend on jackson.
>> >
>> > Do you know why the Avro abstraction is leaking jackson here?
>>
>> Unfortunately Avro does leak the Jackson dependency. There's been a
>> bit of discussion about avoiding this
>> (https://issues.apache.org/jira/browse/AVRO-1126), but not patches
>> yet.
>>
>> Tom
>>
>> >
>> >
>> >
>> >
>> > On Thu, Feb 6, 2014 at 12:40 AM, Uri Laserson <la...@cloudera.com>
>> wrote:
>> >>
>> >> I am skeptical that will solve my problem, though.  Either way, I just
>> >> pulled the latest master and built that, and the same problem remains.
>> >>
>> >>
>> >> On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <sc...@gmail.com>
>> >> wrote:
>> >>>
>> >>> That cloneRecords parameter is gone, so either use the released 0.9.0
>> or
>> >>> the current master.
>> >>>
>> >>>
>> >>> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft
>> >>> <fn...@berkeley.edu> wrote:
>> >>>>
>> >>>> Uri,
>> >>>>
>> >>>> Er, yes, it is the cloneRecords, and when I said true, I meant
>> false...
>> >>>> Apologies for the misdirection there.
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Frank Austin Nothaft
>> >>>> fnothaft@berkeley.edu
>> >>>> fnothaft@eecs.berkeley.edu
>> >>>> 202-340-0466
>> >>>>
>> >>>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com>
>> wrote:
>> >>>>
>> >>>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the
>> time
>> >>>> (like a week or two ago).
>> >>>>
>> >>>> If you're referring to the cloneRecords parameter, it appears to
>> default
>> >>>> to true, but even when I add it explicitly, I get the same error.
>> >>>>
>> >>>>
>> >>>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft
>> >>>> <fn...@berkeley.edu> wrote:
>> >>>>>
>> >>>>> Uri,
>> >>>>>
>> >>>>> Which version of Spark are you running? If it is >0.9.0, you need to
>> >>>>> add an optional true argument at the end of the
>> sc.newApiHadoopFile(...) call
>> >>>>> to read Parquet data.
>> >>>>>
>> >>>>> Regards,
>> >>>>>
>> >>>>> Frank Austin Nothaft
>> >>>>> fnothaft@berkeley.edu
>> >>>>> fnothaft@eecs.berkeley.edu
>> >>>>> 202-340-0466
>> >>>>>
>> >>>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com>
>> wrote:
>> >>>>>
>> >>>>> I am cross-posting on the parquet mailing list.  Short recap: I am
>> >>>>> trying to read Parquet data from the spark interactive shell.
>> >>>>>
>> >>>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>> >>>>>
>> >>>>> export
>> >>>>>
>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>> >>>>>
>> >>>>> From the spark-shell, I run:
>> >>>>>
>> >>>>> val job = new Job(sc.hadoopConfiguration)
>> >>>>> ParquetInputFormat.setReadSupportClass(job,
>> >>>>> classOf[AvroReadSupport[GenericRecord]])
>> >>>>> val records1 =
>> >>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>> >>>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>> >>>>> classOf[GenericRecord], job.getConfiguration)
>> >>>>>
>> >>>>> Then I try
>> >>>>>
>> >>>>> records1.count
>> >>>>>
>> >>>>> Which gives the following error:
>> >>>>>
>> >>>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>> >>>>> java.lang.NoSuchMethodError:
>> >>>>>
>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>> >>>>> at
>> >>>>>
>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>> >>>>> at
>> >>>>>
>> parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>> >>>>> at
>> parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>> >>>>> at
>> >>>>>
>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>> >>>>> at
>> >>>>>
>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>> >>>>> at
>> >>>>>
>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>> >>>>> at
>> >>>>>
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>> >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>> >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>> >>>>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>> >>>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>> >>>>> at
>> >>>>>
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>> >>>>> at
>> >>>>>
>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>> >>>>> at
>> >>>>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> >>>>> at
>> >>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>>>> at
>> >>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>>> at java.lang.Thread.run(Thread.java:744)
>> >>>>>
>> >>>>>
>> >>>>> My hypothesis is that this a shading problem.  It appears that the
>> code
>> >>>>> is trying to call a constructor that looks like this:
>> >>>>>
>> >>>>> String.Field(String, Schema, String,
>> >>>>> parquet.org.codehaus.jackson.JsonNode)
>> >>>>>
>> >>>>> but the signature from the spark-assembly jar is
>> >>>>>
>> >>>>> public org.apache.avro.Schema$Field(java.lang.String,
>> >>>>> org.apache.avro.Schema, java.lang.String,
>> org.codehaus.jackson.JsonNode);
>> >>>>>
>> >>>>> Where do I go from here?
>> >>>>>
>> >>>>> Uri
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <laserson@cloudera.com
>> >
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Yep, I did not include that jar in the class path.  Now I've got
>> some
>> >>>>>> "real" errors to try to work through.  Thanks!
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Hi Uri,
>> >>>>>>>
>> >>>>>>> Could you try adding the parquet-jackson JAR to your classpath?
>> There
>> >>>>>>> may possibly be other parquet-avro dependencies that are missing
>> too.
>> >>>>>>>
>> >>>>>>>
>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>> >>>>>>>
>> >>>>>>> -Jey
>> >>>>>>>
>> >>>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <
>> laserson@cloudera.com>
>> >>>>>>> wrote:
>> >>>>>>> > Yes, of course.  That class is a jackson class, and I'm not sure
>> >>>>>>> > why it's
>> >>>>>>> > being referred to as
>> >>>>>>> > parquet.org.codehaus.jackson.JsonGenerationException.
>> >>>>>>> >
>> >>>>>>> > org.codehaus.jackson.JsonGenerationException is on the
>> classpath.
>> >>>>>>> > But not
>> >>>>>>> > when it's prefixed by parquet.
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <
>> andrew@andrewash.com>
>> >>>>>>> > wrote:
>> >>>>>>> >>
>> >>>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to
>> >>>>>>> >> confirm that
>> >>>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class
>> exists
>> >>>>>>> >> in one of
>> >>>>>>> >> them?
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson
>> >>>>>>> >> <la...@cloudera.com>
>> >>>>>>> >> wrote:
>> >>>>>>> >>>
>> >>>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>> >>>>>>> >>> GenericRecords
>> >>>>>>> >>> from a Parquet file. I'm having a bit of trouble with respect
>> to
>> >>>>>>> >>> dependencies.  My latest attempt looks like this:
>> >>>>>>> >>>
>> >>>>>>> >>> export
>> >>>>>>> >>>
>> >>>>>>> >>>
>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>> >>>>>>> >>>
>> >>>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>> >>>>>>> >>>
>> >>>>>>> >>> Then in the shell:
>> >>>>>>> >>>
>> >>>>>>> >>> val records1 =
>> >>>>>>> >>>
>> >>>>>>> >>>
>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>> >>>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>> >>>>>>> >>> classOf[IndexedRecord],
>> >>>>>>> >>> sc.hadoopConfiguration)
>> >>>>>>> >>> records1.collect
>> >>>>>>> >>>
>> >>>>>>> >>> At which point it barfs:
>> >>>>>>> >>>
>> >>>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>> >>>>>>> >>> process : 3
>> >>>>>>> >>> SLF4J: Failed to load class
>> "org.slf4j.impl.StaticLoggerBinder".
>> >>>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> >>>>>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinderfor
>> >>>>>>> >>> further
>> >>>>>>> >>> details.
>> >>>>>>> >>> java.io.IOException: Could not read footer:
>> >>>>>>> >>> java.lang.NoClassDefFoundError:
>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>> >>>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>> >>>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>> >>>>>>> >>> at
>> org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>> >>>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>> >>>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>> >>>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>> >>>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>> >>>>>>> >>> at $iwC.<init>(<console>:29)
>> >>>>>>> >>> at <init>(<console>:31)
>> >>>>>>> >>> at .<init>(<console>:35)
>> >>>>>>> >>> at .<clinit>(<console>)
>> >>>>>>> >>> at .<init>(<console>:7)
>> >>>>>>> >>> at .<clinit>(<console>)
>> >>>>>>> >>> at $print(<console>)
>> >>>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>> >>>>>>> >>> at
>> org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>> >>>>>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> >>>>>>> >>> at
>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>> >>>>>>> >>> at
>> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>> >>>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>> >>>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>> >>>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>> >>>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>>>>>> >>> at
>> >>>>>>> >>>
>> >>>>>>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>>>>> >>> at java.lang.Thread.run(Thread.java:744)
>> >>>>>>> >>> Caused by: java.lang.ClassNotFoundException:
>> >>>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> >>>>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>> >>>>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> >>>>>>> >>> at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> >>>>>>> >>> ... 9 more
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>> --
>> >>>>>>> >>> Uri Laserson, PhD
>> >>>>>>> >>> Data Scientist, Cloudera
>> >>>>>>> >>> Twitter/GitHub: @laserson
>> >>>>>>> >>> +1 617 910 0447
>> >>>>>>> >>> laserson@cloudera.com
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > Uri Laserson, PhD
>> >>>>>>> > Data Scientist, Cloudera
>> >>>>>>> > Twitter/GitHub: @laserson
>> >>>>>>> > +1 617 910 0447
>> >>>>>>> > laserson@cloudera.com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Uri Laserson, PhD
>> >>>>>> Data Scientist, Cloudera
>> >>>>>> Twitter/GitHub: @laserson
>> >>>>>> +1 617 910 0447
>> >>>>>> laserson@cloudera.com
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Uri Laserson, PhD
>> >>>>> Data Scientist, Cloudera
>> >>>>> Twitter/GitHub: @laserson
>> >>>>> +1 617 910 0447
>> >>>>> laserson@cloudera.com
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> http://parquet.github.com/
>> >>>>> ---
>> >>>>> You received this message because you are subscribed to the Google
>> >>>>> Groups "Parquet" group.
>> >>>>> To post to this group, send email to parquet-dev@googlegroups.com.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Uri Laserson, PhD
>> >>>> Data Scientist, Cloudera
>> >>>> Twitter/GitHub: @laserson
>> >>>> +1 617 910 0447
>> >>>> laserson@cloudera.com
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Prashant
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Uri Laserson, PhD
>> >> Data Scientist, Cloudera
>> >> Twitter/GitHub: @laserson
>> >> +1 617 910 0447
>> >> laserson@cloudera.com
>> >>
>> >> --
>> >> http://parquet.github.com/
>> >> ---
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "Parquet" group.
>> >> To post to this group, send email to parquet-dev@googlegroups.com.
>> >
>> >
>> > --
>> > http://parquet.github.com/
>> > ---
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "Parquet" group.
>> > To post to this group, send email to parquet-dev@googlegroups.com.
>>
>
>


-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Julien Le Dem <ju...@twitter.com>.

Then probably just removing those 4 lines would fix it as parquet-avro does
not use jackson outside of that (I think):
https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/pom.xml#L101
Can you try it out?
Thanks


On Thu, Feb 6, 2014 at 9:25 AM, Tom White <to...@cloudera.com> wrote:

> On Thu, Feb 6, 2014 at 5:13 PM, Julien Le Dem <ju...@twitter.com> wrote:
> > Hi Uri,
> > Parquet shades Jackson to avoid dependency conflicts with Hadoop. Hadoop
> > depends on an ancient version of Jackson, also Parquet works with several
> > versions of Hadoop independently of what jackson version they pull.
> >
> > It appears that this creates a problem in parquet-avro here:
> >
> https://github.com/Parquet/parquet-mr/blob/137b1e292eacbccb06c9723e9b86d2259045b860/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L191
> >
> > notice that NullNode here is a org.codehaus.jackson.node.NullNode which
> > looks weird to me as we are building an Avro schema. Why would we use a
> > Jackson type in there?
>
> Avro uses Jackson to construct default values since they are expressed
> as JSON objects.
>
> >
> > I see 2 solutions:
> >  - parquet-avro should not shade Jackson (but really I don't see why we
> > depend on Jackson at all here)
>
> This is probably the best solution here, assuming Hadoop, Spark etc
> all use the same (or at least compatible) versions of Jackson.
>
> >  - AvroSchemaConverter should not depend on jackson.
> >
> > Do you know why the Avro abstraction is leaking jackson here?
>
> Unfortunately Avro does leak the Jackson dependency. There's been a
> bit of discussion about avoiding this
> (https://issues.apache.org/jira/browse/AVRO-1126), but not patches
> yet.
>
> Tom
>
> >
> >
> >
> >
> > On Thu, Feb 6, 2014 at 12:40 AM, Uri Laserson <la...@cloudera.com>
> wrote:
> >>
> >> I am skeptical that will solve my problem, though.  Either way, I just
> >> pulled the latest master and built that, and the same problem remains.
> >>
> >>
> >> On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <sc...@gmail.com>
> >> wrote:
> >>>
> >>> That cloneRecords parameter is gone, so either use the released 0.9.0
> or
> >>> the current master.
> >>>
> >>>
> >>> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft
> >>> <fn...@berkeley.edu> wrote:
> >>>>
> >>>> Uri,
> >>>>
> >>>> Er, yes, it is the cloneRecords, and when I said true, I meant
> false...
> >>>> Apologies for the misdirection there.
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Frank Austin Nothaft
> >>>> fnothaft@berkeley.edu
> >>>> fnothaft@eecs.berkeley.edu
> >>>> 202-340-0466
> >>>>
> >>>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com>
> wrote:
> >>>>
> >>>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time
> >>>> (like a week or two ago).
> >>>>
> >>>> If you're referring to the cloneRecords parameter, it appears to
> default
> >>>> to true, but even when I add it explicitly, I get the same error.
> >>>>
> >>>>
> >>>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft
> >>>> <fn...@berkeley.edu> wrote:
> >>>>>
> >>>>> Uri,
> >>>>>
> >>>>> Which version of Spark are you running? If it is >0.9.0, you need to
> >>>>> add an optional true argument at the end of the
> sc.newApiHadoopFile(...) call
> >>>>> to read Parquet data.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Frank Austin Nothaft
> >>>>> fnothaft@berkeley.edu
> >>>>> fnothaft@eecs.berkeley.edu
> >>>>> 202-340-0466
> >>>>>
> >>>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com>
> wrote:
> >>>>>
> >>>>> I am cross-posting on the parquet mailing list.  Short recap: I am
> >>>>> trying to read Parquet data from the spark interactive shell.
> >>>>>
> >>>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
> >>>>>
> >>>>> export
> >>>>>
> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
> >>>>>
> >>>>> From the spark-shell, I run:
> >>>>>
> >>>>> val job = new Job(sc.hadoopConfiguration)
> >>>>> ParquetInputFormat.setReadSupportClass(job,
> >>>>> classOf[AvroReadSupport[GenericRecord]])
> >>>>> val records1 =
> >>>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
> >>>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
> >>>>> classOf[GenericRecord], job.getConfiguration)
> >>>>>
> >>>>> Then I try
> >>>>>
> >>>>> records1.count
> >>>>>
> >>>>> Which gives the following error:
> >>>>>
> >>>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
> >>>>> java.lang.NoSuchMethodError:
> >>>>>
> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
> >>>>> at
> >>>>>
> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
> >>>>> at
> >>>>>
> parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
> >>>>> at
> parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
> >>>>> at
> >>>>>
> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
> >>>>> at
> >>>>>
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
> >>>>> at
> >>>>>
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
> >>>>> at
> >>>>>
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
> >>>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
> >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> >>>>> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
> >>>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> >>>>> at
> >>>>>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> >>>>> at
> >>>>>
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> >>>>> at
> >>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>>> at java.lang.Thread.run(Thread.java:744)
> >>>>>
> >>>>>
> >>>>> My hypothesis is that this a shading problem.  It appears that the
> code
> >>>>> is trying to call a constructor that looks like this:
> >>>>>
> >>>>> String.Field(String, Schema, String,
> >>>>> parquet.org.codehaus.jackson.JsonNode)
> >>>>>
> >>>>> but the signature from the spark-assembly jar is
> >>>>>
> >>>>> public org.apache.avro.Schema$Field(java.lang.String,
> >>>>> org.apache.avro.Schema, java.lang.String,
> org.codehaus.jackson.JsonNode);
> >>>>>
> >>>>> Where do I go from here?
> >>>>>
> >>>>> Uri
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Yep, I did not include that jar in the class path.  Now I've got
> some
> >>>>>> "real" errors to try to work through.  Thanks!
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi Uri,
> >>>>>>>
> >>>>>>> Could you try adding the parquet-jackson JAR to your classpath?
> There
> >>>>>>> may possibly be other parquet-avro dependencies that are missing
> too.
> >>>>>>>
> >>>>>>>
> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
> >>>>>>>
> >>>>>>> -Jey
> >>>>>>>
> >>>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <
> laserson@cloudera.com>
> >>>>>>> wrote:
> >>>>>>> > Yes, of course.  That class is a jackson class, and I'm not sure
> >>>>>>> > why it's
> >>>>>>> > being referred to as
> >>>>>>> > parquet.org.codehaus.jackson.JsonGenerationException.
> >>>>>>> >
> >>>>>>> > org.codehaus.jackson.JsonGenerationException is on the classpath.
> >>>>>>> > But not
> >>>>>>> > when it's prefixed by parquet.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <
> andrew@andrewash.com>
> >>>>>>> > wrote:
> >>>>>>> >>
> >>>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to
> >>>>>>> >> confirm that
> >>>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class
> exists
> >>>>>>> >> in one of
> >>>>>>> >> them?
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson
> >>>>>>> >> <la...@cloudera.com>
> >>>>>>> >> wrote:
> >>>>>>> >>>
> >>>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
> >>>>>>> >>> GenericRecords
> >>>>>>> >>> from a Parquet file. I'm having a bit of trouble with respect
> to
> >>>>>>> >>> dependencies.  My latest attempt looks like this:
> >>>>>>> >>>
> >>>>>>> >>> export
> >>>>>>> >>>
> >>>>>>> >>>
> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
> >>>>>>> >>>
> >>>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
> >>>>>>> >>>
> >>>>>>> >>> Then in the shell:
> >>>>>>> >>>
> >>>>>>> >>> val records1 =
> >>>>>>> >>>
> >>>>>>> >>>
> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
> >>>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
> >>>>>>> >>> classOf[IndexedRecord],
> >>>>>>> >>> sc.hadoopConfiguration)
> >>>>>>> >>> records1.collect
> >>>>>>> >>>
> >>>>>>> >>> At which point it barfs:
> >>>>>>> >>>
> >>>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
> >>>>>>> >>> process : 3
> >>>>>>> >>> SLF4J: Failed to load class
> "org.slf4j.impl.StaticLoggerBinder".
> >>>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
> >>>>>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinderfor
> >>>>>>> >>> further
> >>>>>>> >>> details.
> >>>>>>> >>> java.io.IOException: Could not read footer:
> >>>>>>> >>> java.lang.NoClassDefFoundError:
> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> >>>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
> >>>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
> >>>>>>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
> >>>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
> >>>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
> >>>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
> >>>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
> >>>>>>> >>> at $iwC.<init>(<console>:29)
> >>>>>>> >>> at <init>(<console>:31)
> >>>>>>> >>> at .<init>(<console>:35)
> >>>>>>> >>> at .<clinit>(<console>)
> >>>>>>> >>> at .<init>(<console>:7)
> >>>>>>> >>> at .<clinit>(<console>)
> >>>>>>> >>> at $print(<console>)
> >>>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
> >>>>>>> >>> at
> org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
> >>>>>>> >>> at
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
> >>>>>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> >>>>>>> >>> at
> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
> >>>>>>> >>> at
> org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
> >>>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
> >>>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
> >>>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
> >>>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
> >>>>>>> >>> at
> >>>>>>> >>>
> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
> >>>>>>> >>> at
> >>>>>>> >>>
> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
> >>>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>>>>> >>> at
> >>>>>>> >>>
> >>>>>>> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>>>>> >>> at java.lang.Thread.run(Thread.java:744)
> >>>>>>> >>> Caused by: java.lang.ClassNotFoundException:
> >>>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> >>>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> >>>>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
> >>>>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> >>>>>>> >>> at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> >>>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> >>>>>>> >>> ... 9 more
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>> --
> >>>>>>> >>> Uri Laserson, PhD
> >>>>>>> >>> Data Scientist, Cloudera
> >>>>>>> >>> Twitter/GitHub: @laserson
> >>>>>>> >>> +1 617 910 0447
> >>>>>>> >>> laserson@cloudera.com
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > --
> >>>>>>> > Uri Laserson, PhD
> >>>>>>> > Data Scientist, Cloudera
> >>>>>>> > Twitter/GitHub: @laserson
> >>>>>>> > +1 617 910 0447
> >>>>>>> > laserson@cloudera.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Uri Laserson, PhD
> >>>>>> Data Scientist, Cloudera
> >>>>>> Twitter/GitHub: @laserson
> >>>>>> +1 617 910 0447
> >>>>>> laserson@cloudera.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Uri Laserson, PhD
> >>>>> Data Scientist, Cloudera
> >>>>> Twitter/GitHub: @laserson
> >>>>> +1 617 910 0447
> >>>>> laserson@cloudera.com
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://parquet.github.com/
> >>>>> ---
> >>>>> You received this message because you are subscribed to the Google
> >>>>> Groups "Parquet" group.
> >>>>> To post to this group, send email to parquet-dev@googlegroups.com.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Uri Laserson, PhD
> >>>> Data Scientist, Cloudera
> >>>> Twitter/GitHub: @laserson
> >>>> +1 617 910 0447
> >>>> laserson@cloudera.com
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Prashant
> >>
> >>
> >>
> >>
> >> --
> >> Uri Laserson, PhD
> >> Data Scientist, Cloudera
> >> Twitter/GitHub: @laserson
> >> +1 617 910 0447
> >> laserson@cloudera.com
> >>
> >> --
> >> http://parquet.github.com/
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups
> >> "Parquet" group.
> >> To post to this group, send email to parquet-dev@googlegroups.com.
> >
> >
> > --
> > http://parquet.github.com/
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "Parquet" group.
> > To post to this group, send email to parquet-dev@googlegroups.com.
>

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Uri Laserson <la...@cloudera.com>.

I am skeptical that will solve my problem, though.  Either way, I just
pulled the latest master and built that, and the same problem remains.


On Wed, Feb 5, 2014 at 7:50 PM, Prashant Sharma <sc...@gmail.com>wrote:

> That cloneRecords parameter is gone, so either use the released 0.9.0 or
> the current master.
>
>
> On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Uri,
>>
>> Er, yes, it is the cloneRecords, and when I said true, I meant false…
>> Apologies for the misdirection there.
>>
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com> wrote:
>>
>> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time
>> (like a week or two ago).
>>
>> If you're referring to the cloneRecords parameter, it appears to default
>> to true, but even when I add it explicitly, I get the same error.
>>
>>
>> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft <
>> fnothaft@berkeley.edu> wrote:
>>
>>> Uri,
>>>
>>> Which version of Spark are you running? If it is >0.9.0, you need to add
>>> an optional true argument at the end of the sc.newApiHadoopFile(…) call to
>>> read Parquet data.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnothaft@berkeley.edu
>>> fnothaft@eecs.berkeley.edu
>>> 202-340-0466
>>>
>>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com> wrote:
>>>
>>> I am cross-posting on the parquet mailing list.  Short recap: I am
>>> trying to read Parquet data from the spark interactive shell.
>>>
>>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>>>
>>> export
>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>>>
>>> From the spark-shell, I run:
>>>
>>> val job = new Job(sc.hadoopConfiguration)
>>> ParquetInputFormat.setReadSupportClass(job,
>>> classOf[AvroReadSupport[GenericRecord]])
>>> val records1 =
>>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>>> classOf[GenericRecord], job.getConfiguration)
>>>
>>> Then I try
>>>
>>> records1.count
>>>
>>> Which gives the following error:
>>>
>>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>>> java.lang.NoSuchMethodError:
>>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>>>  at
>>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>>> at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>>>  at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>>> at
>>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>>>  at
>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>>> at
>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>>>  at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>>>  at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>>>  at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>  at
>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>  at java.lang.Thread.run(Thread.java:744)
>>>
>>>
>>> My hypothesis is that this a shading problem.  It appears that the code
>>> is trying to call a constructor that looks like this:
>>>
>>> String.Field(String, Schema, String, *parquet*
>>> .org.codehaus.jackson.JsonNode)
>>>
>>> but the signature from the spark-assembly jar is
>>>
>>> public org.apache.avro.Schema$Field(java.lang.String,
>>> org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode);
>>>
>>> Where do I go from here?
>>>
>>> Uri
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com>wrote:
>>>
>>>> Yep, I did not include that jar in the class path.  Now I've got some
>>>> "real" errors to try to work through.  Thanks!
>>>>
>>>>
>>>>  On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>wrote:
>>>>
>>>>> Hi Uri,
>>>>>
>>>>> Could you try adding the parquet-jackson JAR to your classpath? There
>>>>> may possibly be other parquet-avro dependencies that are missing too.
>>>>>
>>>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>>>>
>>>>> -Jey
>>>>>
>>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <la...@cloudera.com>
>>>>> wrote:
>>>>> > Yes, of course.  That class is a jackson class, and I'm not sure why
>>>>> it's
>>>>> > being referred to as
>>>>> parquet.org.codehaus.jackson.JsonGenerationException.
>>>>> >
>>>>> > org.codehaus.jackson.JsonGenerationException is on the classpath.
>>>>>  But not
>>>>> > when it's prefixed by parquet.
>>>>> >
>>>>> >
>>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <an...@andrewash.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to confirm
>>>>> that
>>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class exists
>>>>> in one of
>>>>> >> them?
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson <
>>>>> laserson@cloudera.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>>>> GenericRecords
>>>>> >>> from a Parquet file. I'm having a bit of trouble with respect to
>>>>> >>> dependencies.  My latest attempt looks like this:
>>>>> >>>
>>>>> >>> export
>>>>> >>>
>>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>>>> >>>
>>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>>>> >>>
>>>>> >>> Then in the shell:
>>>>> >>>
>>>>> >>> val records1 =
>>>>> >>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>>>> classOf[IndexedRecord],
>>>>> >>> sc.hadoopConfiguration)
>>>>> >>> records1.collect
>>>>> >>>
>>>>> >>> At which point it barfs:
>>>>> >>>
>>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>>>>> process : 3
>>>>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>>>>> further
>>>>> >>> details.
>>>>> >>> java.io.IOException: Could not read footer:
>>>>> >>> java.lang.NoClassDefFoundError:
>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>>>> >>> at
>>>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>>>> >>> at
>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>>>> >>> at
>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>>>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>>>> >>> at $iwC.<init>(<console>:29)
>>>>> >>> at <init>(<console>:31)
>>>>> >>> at .<init>(<console>:35)
>>>>> >>> at .<clinit>(<console>)
>>>>> >>> at .<init>(<console>:7)
>>>>> >>> at .<clinit>(<console>)
>>>>> >>> at $print(<console>)
>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> >>> at
>>>>> >>>
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>> >>> at
>>>>> >>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>>>> >>> at
>>>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>>>> >>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>>>> >>> at
>>>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>>>> >>> at
>>>>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>>> >>> at
>>>>> >>>
>>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>>> >>> at
>>>>> >>>
>>>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>>> >>> at
>>>>> >>>
>>>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>>>> >>> at
>>>>> >>>
>>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>>>> >>> at
>>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>>>> >>> at
>>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>> >>> at
>>>>> >>>
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> >>> at
>>>>> >>>
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> >>> at java.lang.Thread.run(Thread.java:744)
>>>>> >>> Caused by: java.lang.ClassNotFoundException:
>>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>>> >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>>> >>> ... 9 more
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Uri Laserson, PhD
>>>>> >>> Data Scientist, Cloudera
>>>>> >>> Twitter/GitHub: @laserson
>>>>> >>> +1 617 910 0447
>>>>> >>> laserson@cloudera.com
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Uri Laserson, PhD
>>>>> > Data Scientist, Cloudera
>>>>> > Twitter/GitHub: @laserson
>>>>> > +1 617 910 0447
>>>>> > laserson@cloudera.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Uri Laserson, PhD
>>>> Data Scientist, Cloudera
>>>> Twitter/GitHub: @laserson
>>>> +1 617 910 0447
>>>> laserson@cloudera.com
>>>>
>>>
>>>
>>>
>>> --
>>> Uri Laserson, PhD
>>> Data Scientist, Cloudera
>>> Twitter/GitHub: @laserson
>>> +1 617 910 0447
>>> laserson@cloudera.com
>>>
>>>
>>>
>>> --
>>> http://parquet.github.com/
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Parquet" group.
>>> To post to this group, send email to parquet-dev@googlegroups.com.
>>>
>>
>>
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>
>>
>>
>
>
> --
> Prashant
>



-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Prashant Sharma <sc...@gmail.com>.

That cloneRecords parameter is gone, so either use the released 0.9.0 or
the current master.


On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft
<fn...@berkeley.edu>wrote:

> Uri,
>
> Er, yes, it is the cloneRecords, and when I said true, I meant false...
> Apologies for the misdirection there.
>
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com> wrote:
>
> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time
> (like a week or two ago).
>
> If you're referring to the cloneRecords parameter, it appears to default
> to true, but even when I add it explicitly, I get the same error.
>
>
> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Uri,
>>
>> Which version of Spark are you running? If it is >0.9.0, you need to add
>> an optional true argument at the end of the sc.newApiHadoopFile(...) call to
>> read Parquet data.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com> wrote:
>>
>> I am cross-posting on the parquet mailing list.  Short recap: I am trying
>> to read Parquet data from the spark interactive shell.
>>
>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>>
>> export
>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>>
>> From the spark-shell, I run:
>>
>> val job = new Job(sc.hadoopConfiguration)
>> ParquetInputFormat.setReadSupportClass(job,
>> classOf[AvroReadSupport[GenericRecord]])
>> val records1 =
>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>> classOf[ParquetInputFormat[GenericRecord]], classOf[Void],
>> classOf[GenericRecord], job.getConfiguration)
>>
>> Then I try
>>
>> records1.count
>>
>> Which gives the following error:
>>
>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>> java.lang.NoSuchMethodError:
>> org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>>  at
>> parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>> at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>>  at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>> at
>> parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>>  at
>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>> at
>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>>  at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>>  at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:53)
>> at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>  at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:744)
>>
>>
>> My hypothesis is that this a shading problem.  It appears that the code
>> is trying to call a constructor that looks like this:
>>
>> String.Field(String, Schema, String, *parquet*
>> .org.codehaus.jackson.JsonNode)
>>
>> but the signature from the spark-assembly jar is
>>
>> public org.apache.avro.Schema$Field(java.lang.String,
>> org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode);
>>
>> Where do I go from here?
>>
>> Uri
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com>wrote:
>>
>>> Yep, I did not include that jar in the class path.  Now I've got some
>>> "real" errors to try to work through.  Thanks!
>>>
>>>
>>>  On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu>wrote:
>>>
>>>> Hi Uri,
>>>>
>>>> Could you try adding the parquet-jackson JAR to your classpath? There
>>>> may possibly be other parquet-avro dependencies that are missing too.
>>>>
>>>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>>>>
>>>> -Jey
>>>>
>>>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <la...@cloudera.com>
>>>> wrote:
>>>> > Yes, of course.  That class is a jackson class, and I'm not sure why
>>>> it's
>>>> > being referred to as
>>>> parquet.org.codehaus.jackson.JsonGenerationException.
>>>> >
>>>> > org.codehaus.jackson.JsonGenerationException is on the classpath.
>>>>  But not
>>>> > when it's prefixed by parquet.
>>>> >
>>>> >
>>>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <an...@andrewash.com>
>>>> wrote:
>>>> >>
>>>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to confirm
>>>> that
>>>> >> parquet/org/codehaus/jackson/JsonGenerationException.class exists in
>>>> one of
>>>> >> them?
>>>> >>
>>>> >>
>>>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson <laserson@cloudera.com
>>>> >
>>>> >> wrote:
>>>> >>>
>>>> >>> Has anyone tried this?  I'd like to read a bunch of Avro
>>>> GenericRecords
>>>> >>> from a Parquet file. I'm having a bit of trouble with respect to
>>>> >>> dependencies.  My latest attempt looks like this:
>>>> >>>
>>>> >>> export
>>>> >>>
>>>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>>>> >>>
>>>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>>>> >>>
>>>> >>> Then in the shell:
>>>> >>>
>>>> >>> val records1 =
>>>> >>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>>>> >>> classOf[AvroParquetInputFormat], classOf[Void],
>>>> classOf[IndexedRecord],
>>>> >>> sc.hadoopConfiguration)
>>>> >>> records1.collect
>>>> >>>
>>>> >>> At which point it barfs:
>>>> >>>
>>>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to
>>>> process : 3
>>>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>>>> further
>>>> >>> details.
>>>> >>> java.io.IOException: Could not read footer:
>>>> >>> java.lang.NoClassDefFoundError:
>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>>>> >>> at
>>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>>>> >>> at
>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>>>> >>> at
>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>>>> >>> at scala.Option.getOrElse(Option.scala:120)
>>>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>>>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>>>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>>>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>>>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>>>> >>> at $iwC$$iwC.<init>(<console>:27)
>>>> >>> at $iwC.<init>(<console>:29)
>>>> >>> at <init>(<console>:31)
>>>> >>> at .<init>(<console>:35)
>>>> >>> at .<clinit>(<console>)
>>>> >>> at .<init>(<console>:7)
>>>> >>> at .<clinit>(<console>)
>>>> >>> at $print(<console>)
>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> >>> at
>>>> >>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> >>> at
>>>> >>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>>>> >>> at
>>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>>>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>>>> >>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>>>> >>> at
>>>> org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>>>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>> >>> at
>>>> >>>
>>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>>>> >>> at
>>>> >>>
>>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>>>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>>>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>>>> >>> Caused by: java.lang.NoClassDefFoundError:
>>>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>>>> >>> at
>>>> >>>
>>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>>>> >>> at
>>>> >>>
>>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>>>> >>> at
>>>> >>>
>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>>>> >>> at
>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>>>> >>> at
>>>> parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>>>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> >>> at
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> >>> at
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >>> at java.lang.Thread.run(Thread.java:744)
>>>> >>> Caused by: java.lang.ClassNotFoundException:
>>>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>> >>> at java.security.AccessController.doPrivileged(Native Method)
>>>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>> >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>> >>> ... 9 more
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Uri Laserson, PhD
>>>> >>> Data Scientist, Cloudera
>>>> >>> Twitter/GitHub: @laserson
>>>> >>> +1 617 910 0447
>>>> >>> laserson@cloudera.com
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Uri Laserson, PhD
>>>> > Data Scientist, Cloudera
>>>> > Twitter/GitHub: @laserson
>>>> > +1 617 910 0447
>>>> > laserson@cloudera.com
>>>>
>>>
>>>
>>>
>>> --
>>> Uri Laserson, PhD
>>> Data Scientist, Cloudera
>>> Twitter/GitHub: @laserson
>>> +1 617 910 0447
>>> laserson@cloudera.com
>>>
>>
>>
>>
>> --
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>>
>>
>>
>> --
>> http://parquet.github.com/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Parquet" group.
>> To post to this group, send email to parquet-dev@googlegroups.com.
>>
>
>
>
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com
>
>
>


-- 
Prashant

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.

Uri,

Er, yes, it is the cloneRecords, and when I said true, I meant false… Apologies for the misdirection there.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Feb 5, 2014, at 7:44 PM, Uri Laserson <la...@cloudera.com> wrote:

> My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time (like a week or two ago).
> 
> If you're referring to the cloneRecords parameter, it appears to default to true, but even when I add it explicitly, I get the same error.
> 
> 
> On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft <fn...@berkeley.edu> wrote:
> Uri,
> 
> Which version of Spark are you running? If it is >0.9.0, you need to add an optional true argument at the end of the sc.newApiHadoopFile(…) call to read Parquet data.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
> 
> On Feb 5, 2014, at 7:14 PM, Uri Laserson <la...@cloudera.com> wrote:
> 
>> I am cross-posting on the parquet mailing list.  Short recap: I am trying to read Parquet data from the spark interactive shell.
>> 
>> I have added all the necessary parquet jars to SPARK_CLASSPATH:
>> 
>> export SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-jackson/target/parquet-jackson-1.3.3-SNAPSHOT.jar"
>> 
>> From the spark-shell, I run:
>> 
>> val job = new Job(sc.hadoopConfiguration)
>> ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[GenericRecord]])
>> val records1 = sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri", classOf[ParquetInputFormat[GenericRecord]], classOf[Void], classOf[GenericRecord], job.getConfiguration)
>> 
>> Then I try
>> 
>> records1.count
>> 
>> Which gives the following error:
>> 
>> 14/02/05 18:42:22 ERROR Executor: Exception in task ID 1
>> java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lparquet/org/codehaus/jackson/JsonNode;)V
>> 	at parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:191)
>> 	at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:177)
>> 	at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:86)
>> 	at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
>> 	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
>> 	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>> 	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:106)
>> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:94)
>> 	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)
>> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>> 	at org.apache.spark.scheduler.Task.run(Task.scala:53)
>> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>> 	at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> 	at java.lang.Thread.run(Thread.java:744)
>> 
>> 
>> My hypothesis is that this a shading problem.  It appears that the code is trying to call a constructor that looks like this:
>> 
>> String.Field(String, Schema, String, parquet.org.codehaus.jackson.JsonNode)
>> 
>> but the signature from the spark-assembly jar is
>> 
>> public org.apache.avro.Schema$Field(java.lang.String, org.apache.avro.Schema, java.lang.String, org.codehaus.jackson.JsonNode);
>> 
>> Where do I go from here?
>> 
>> Uri
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Feb 5, 2014 at 5:02 PM, Uri Laserson <la...@cloudera.com> wrote:
>> Yep, I did not include that jar in the class path.  Now I've got some "real" errors to try to work through.  Thanks!
>> 
>> 
>> On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam <je...@cs.berkeley.edu> wrote:
>> Hi Uri,
>> 
>> Could you try adding the parquet-jackson JAR to your classpath? There
>> may possibly be other parquet-avro dependencies that are missing too.
>> 
>> http://mvnrepository.com/artifact/com.twitter/parquet-jackson/1.3.2
>> 
>> -Jey
>> 
>> On Wed, Feb 5, 2014 at 3:02 PM, Uri Laserson <la...@cloudera.com> wrote:
>> > Yes, of course.  That class is a jackson class, and I'm not sure why it's
>> > being referred to as parquet.org.codehaus.jackson.JsonGenerationException.
>> >
>> > org.codehaus.jackson.JsonGenerationException is on the classpath.  But not
>> > when it's prefixed by parquet.
>> >
>> >
>> > On Wed, Feb 5, 2014 at 12:06 PM, Andrew Ash <an...@andrewash.com> wrote:
>> >>
>> >> I'm assuming you checked all the jars in SPARK_CLASSPATH to confirm that
>> >> parquet/org/codehaus/jackson/JsonGenerationException.class exists in one of
>> >> them?
>> >>
>> >>
>> >> On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson <la...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> Has anyone tried this?  I'd like to read a bunch of Avro GenericRecords
>> >>> from a Parquet file. I'm having a bit of trouble with respect to
>> >>> dependencies.  My latest attempt looks like this:
>> >>>
>> >>> export
>> >>> SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-common/target/parquet-common-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-column/target/parquet-column-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-format/target/parquet-format-2.0.1-SNAPSHOT.jar"
>> >>>
>> >>> MASTER=local ~/repos/incubator-spark/bin/spark-shell
>> >>>
>> >>> Then in the shell:
>> >>>
>> >>> val records1 =
>> >>> sc.newAPIHadoopFile("/Users/laserson/temp/test-parquet/alltypeuri",
>> >>> classOf[AvroParquetInputFormat], classOf[Void], classOf[IndexedRecord],
>> >>> sc.hadoopConfiguration)
>> >>> records1.collect
>> >>>
>> >>> At which point it barfs:
>> >>>
>> >>> 14/02/05 12:02:32 INFO FileInputFormat: Total input paths to process : 3
>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> >>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
>> >>> details.
>> >>> java.io.IOException: Could not read footer:
>> >>> java.lang.NoClassDefFoundError:
>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>> >>> at
>> >>> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:189)
>> >>> at
>> >>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:145)
>> >>> at
>> >>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:354)
>> >>> at
>> >>> parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:339)
>> >>> at
>> >>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:246)
>> >>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:85)
>> >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>> >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>> >>> at scala.Option.getOrElse(Option.scala:120)
>> >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:863)
>> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>> >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>> >>> at $iwC$$iwC$$iwC.<init>(<console>:25)
>> >>> at $iwC$$iwC.<init>(<console>:27)
>> >>> at $iwC.<init>(<console>:29)
>> >>> at <init>(<console>:31)
>> >>> at .<init>(<console>:35)
>> >>> at .<clinit>(<console>)
>> >>> at .<init>(<console>:7)
>> >>> at .<clinit>(<console>)
>> >>> at $print(<console>)
>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>> at
>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >>> at
>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>> at java.lang.reflect.Method.invoke(Method.java:606)
>> >>> at
>> >>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>> >>> at
>> >>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>> >>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>> >>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>> >>> at
>> >>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
>> >>> at
>> >>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
>> >>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
>> >>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
>> >>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
>> >>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
>> >>> at
>> >>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
>> >>> at
>> >>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>> >>> at
>> >>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
>> >>> at
>> >>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
>> >>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
>> >>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>> >>> at org.apache.spark.repl.Main.main(Main.scala)
>> >>> Caused by: java.lang.NoClassDefFoundError:
>> >>> parquet/org/codehaus/jackson/JsonGenerationException
>> >>> at
>> >>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:359)
>> >>> at
>> >>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:312)
>> >>> at
>> >>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:295)
>> >>> at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:179)
>> >>> at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:175)
>> >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>> at
>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>> at
>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>> at java.lang.Thread.run(Thread.java:744)
>> >>> Caused by: java.lang.ClassNotFoundException:
>> >>> parquet.org.codehaus.jackson.JsonGenerationException
>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> >>> at java.security.AccessController.doPrivileged(Native Method)
>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> >>> ... 9 more
>> >>>
>> >>>
>> >>> --
>> >>> Uri Laserson, PhD
>> >>> Data Scientist, Cloudera
>> >>> Twitter/GitHub: @laserson
>> >>> +1 617 910 0447
>> >>> laserson@cloudera.com
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Uri Laserson, PhD
>> > Data Scientist, Cloudera
>> > Twitter/GitHub: @laserson
>> > +1 617 910 0447
>> > laserson@cloudera.com
>> 
>> 
>> 
>> -- 
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
>> 
>> 
>> 
>> -- 
>> Uri Laserson, PhD
>> Data Scientist, Cloudera
>> Twitter/GitHub: @laserson
>> +1 617 910 0447
>> laserson@cloudera.com
> 
> 
> -- 
> http://parquet.github.com/
> --- 
> You received this message because you are subscribed to the Google Groups "Parquet" group.
> To post to this group, send email to parquet-dev@googlegroups.com.
> 
> 
> 
> -- 
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com