You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alex Nastetsky <al...@vervemobile.com> on 2015/10/19 20:14:34 UTC

writing avro parquet

Using Spark 1.5.1, Parquet 1.7.0.

I'm trying to write Avro/Parquet files. I have this code:

sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS,
classOf[AvroWriteSupport].getName)
AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$)
myDF.write.parquet(outputPath)

The problem is that the write support class gets overwritten in
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation#prepareJobForWrite:

val writeSupportClass =
if
(dataSchema.map(_.dataType).forall(ParquetTypesConverter.isPrimitiveType)) {
classOf[MutableRowWriteSupport]
} else {
classOf[RowWriteSupport]
}
ParquetOutputFormat.setWriteSupportClass(job, writeSupportClass)

So it doesn't seem to actually write Avro data. When look at the metadata
of the Parquet files it writes, it looks like this:

extra:             org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"foo","type":"string","nullable":true,"metadata":{}},{"name":"bar","type":"long","nullable":true,"metadata":{}}]}

I would expect to see something like "extra:  avro.schema" instead.

Re: writing avro parquet

Posted by Alex Nastetsky <al...@vervemobile.com>.

Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to
use it on myDF.rdd instead of converting it to a PairRDD first.

On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky <
alex.nastetsky@vervemobile.com> wrote:

> Using Spark 1.5.1, Parquet 1.7.0.
>
> I'm trying to write Avro/Parquet files. I have this code:
>
> sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS,
> classOf[AvroWriteSupport].getName)
> AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$)
> myDF.write.parquet(outputPath)
>
> The problem is that the write support class gets overwritten in
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation#prepareJobForWrite:
>
> val writeSupportClass =
> if
> (dataSchema.map(_.dataType).forall(ParquetTypesConverter.isPrimitiveType)) {
> classOf[MutableRowWriteSupport]
> } else {
> classOf[RowWriteSupport]
> }
> ParquetOutputFormat.setWriteSupportClass(job, writeSupportClass)
>
> So it doesn't seem to actually write Avro data. When look at the metadata
> of the Parquet files it writes, it looks like this:
>
> extra:             org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"foo","type":"string","nullable":true,"metadata":{}},{"name":"bar","type":"long","nullable":true,"metadata":{}}]}
>
> I would expect to see something like "extra:  avro.schema" instead.
>