You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "swetha k (JIRA)" <ji...@apache.org> on 2015/12/01 22:50:10 UTC

[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

    [ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034650#comment-15034650 ] 

swetha k commented on SPARK-11620:
----------------------------------

[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end. 

        val job = Job.getInstance()
        var filePath = "path"
        val metricsPath: Path = new Path(filePath)
        //Check if inputFile exists
        val fs: FileSystem = FileSystem.get(job.getConfiguration)

        if (fs.exists(metricsPath)) {
          fs.delete(metricsPath, true)
        }

        // Configure the ParquetOutputFormat to use Avro as the
serialization format
        ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
        // You need to pass the schema to AvroParquet when you are writing
objects but not when you
        // are reading them. The schema is saved in Parquet file for future
readers to use.
        AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


        // Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
        val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new     Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));

        metricsToBeSaved.coalesce(1500)
        // Save the RDD to a Parquet file in our temporary output directory
        metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
          classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11620
>                 URL: https://issues.apache.org/jira/browse/SPARK-11620
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org