You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeremy Smith (JIRA)" <ji...@apache.org> on 2016/11/01 15:09:59 UTC

[jira] [Created] (SPARK-18199) Support appending to Parquet files

Jeremy Smith created SPARK-18199:
------------------------------------

             Summary: Support appending to Parquet files
                 Key: SPARK-18199
                 URL: https://issues.apache.org/jira/browse/SPARK-18199
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Jeremy Smith


Currently, appending to a Parquet directory involves simply creating new parquet files in the directory. With many small appends (for example, in a streaming job with a short batch duration) this leads to an unbounded number of small Parquet files accumulating. These must be cleaned up with some frequency by removing them all and rewriting a new file containing all the rows.

It would be far better if Spark supported appending to the Parquet files themselves. HDFS supports this, as does Parquet:

* The Parquet footer can be read in order to obtain necessary metadata.
* The new rows can then be appended to the Parquet file as a row group.
* A new footer can then be appended containing the metadata and referencing the new row groups as well as the previously existing row groups.

This would result in a small amount of bloat in the file as new row groups are added (since duplicate metadata would accumulate) but it's hugely preferable to accumulating small files, which is bad for HDFS health and also eventually leads to Spark being unable to read the Parquet directory at all.  Periodic rewriting of the file could still be performed in order to remove the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org