You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Gary Malouf <ma...@gmail.com> on 2014/10/31 21:49:56 UTC
Parquet Migrations
Outside of what is discussed here
<https://issues.apache.org/jira/browse/SPARK-3851> as a future solution, is
there any path for being able to modify a Parquet schema once some data has
been written? This seems like the kind of thing that should make people
pause when considering whether or not to use Parquet+Spark...
Re: Parquet Migrations
Posted by Michael Armbrust <mi...@databricks.com>.
You can't change parquet schema without reencoding the data as you need to
recalculate the footer index data. You can manually do what SPARK-3851
<https://issues.apache.org/jira/browse/SPARK-3851> is going to do today
however.
Consider two schemas:
Old Schema: (a: Int, b: String)
New Schema, where I've dropped and added a column: (a: Int, c: Long)
parquetFile(old).registerTempTable("old")
parquetFile(new).registerTempTable("new")
sql("""
SELECT a, b, CAST(null AS LONG) AS c FROM old UNION ALL
SELECT a, CAST(null AS STRING) AS b, c FROM new
""").registerTempTable("unifiedData")
Because of filter/column pushdown past UNIONs this should executed as
desired even if you write more complicated queries on top of
"unifiedData". Its a little onerous but should work for now. This can
also support things like column renaming which would be much harder to do
automatically.
On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf <ma...@gmail.com> wrote:
> Outside of what is discussed here
> <https://issues.apache.org/jira/browse/SPARK-3851> as a future solution,
> is
> there any path for being able to modify a Parquet schema once some data has
> been written? This seems like the kind of thing that should make people
> pause when considering whether or not to use Parquet+Spark...
>