You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "swetha k (JIRA)" <ji...@apache.org> on 2015/12/01 22:50:10 UTC
[jira] [Commented] (SPARK-11620)
parquet.hadoop.ParquetOutputCommitter.commitJob() throws
parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034650#comment-15034650 ]
swetha k commented on SPARK-11620:
----------------------------------
[~hyukjin.kwon]
I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end.
val job = Job.getInstance()
var filePath = "path"
val metricsPath: Path = new Path(filePath)
//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)
if (fs.exists(metricsPath)) {
fs.delete(metricsPath, true)
}
// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)
// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2))));
metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
> parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: swetha k
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org