You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "swetha k (JIRA)" <ji...@apache.org> on 2015/12/01 22:51:10 UTC

[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

    [ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034650#comment-15034650 ] 

swetha k edited comment on SPARK-11620 at 12/1/15 9:50 PM:
-----------------------------------------------------------

[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end.  And the WARNING message that I get is as shown in the previous comments. Any idea as to why this is happening?

        val job = Job.getInstance()
        var filePath = "path"
        val metricsPath: Path = new Path(filePath)
        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }

        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala


was (Author: swethakasireddy):
[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end. 

        val job = Job.getInstance()
        var filePath = "path"
        val metricsPath: Path = new Path(filePath)
        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }

        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11620
>                 URL: https://issues.apache.org/jira/browse/SPARK-11620
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org