You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/06/05 11:31:00 UTC
[jira] [Commented] (SPARK-8121) When using with Hadoop 1.x,
"spark.sql.parquet.output.committer.class" is overriden by
"spark.sql.sources.outputCommitterClass"
[ https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574198#comment-14574198 ]
Apache Spark commented on SPARK-8121:
-------------------------------------
User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6669
> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0
> Reporter: Cheng Lian
> Assignee: Cheng Lian
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and {{spark.sql.sources.outputCommitterClass}} is configured, {{spark.sql.parquet.output.committer.class}} will be overriden.
> For example, if {{spark.sql.parquet.output.committer.class}} is set to {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor {{_common_metadata}} will be written because {{FileOutputCommitter}} overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the {{TaskAttemptContext}} before calling {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} constructor clones the job configuration, thus doesn't share the job configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
> "spark.sql.sources.outputCommitterClass",
> "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
> "spark.sql.parquet.output.committer.class",
> "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-00001.gz.parquet
> ├── part-r-00002.gz.parquet
> ├── part-r-00003.gz.parquet
> ├── part-r-00004.gz.parquet
> ├── part-r-00005.gz.parquet
> ├── part-r-00006.gz.parquet
> ├── part-r-00007.gz.parquet
> └── part-r-00008.gz.parquet
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org