You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/16 10:24:46 UTC

[jira] [Created] (PARQUET-359) Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files

Cheng Lian created PARQUET-359:
----------------------------------

             Summary: Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files
                 Key: PARQUET-359
                 URL: https://issues.apache.org/jira/browse/PARQUET-359
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.0, 1.7.0, 1.6.0
            Reporter: Cheng Lian


{{ParquetOutputCommitter}} only deletes {{_metadata}} when fails to write summary files. This may leave inconsistent existing {{_common_metadata}} out there.

This issue can be reproduced via the following Spark shell snippet:
{noformat}
import sqlContext.implicits._

val path = "file:///tmp/foo"
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)
{noformat}
The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an {{_common_metadata}} left there:
{noformat}
$ tree /tmp/foo
/tmp/foo
├── _SUCCESS
├── _common_metadata
├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
└── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet
{noformat}
Check its schema, the nested group contains only 2 fields, which is wrong:
{noformat}
$ parquet-schema /tmp/foo/_common_metadata
message root {
  optional group _1 {
    optional binary _1 (UTF8);
    optional binary _2 (UTF8);
  }
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)