You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Dave Birdsall <da...@esgyn.com> on 2019/01/28 19:57:08 UTC

Support for schema evolution of Parquet files?

Hi,

In Hive, it is possible to evolve one's schema using ALTER TABLE ADD COLUMNS and/or ALTER TABLE REPLACE COLUMNS. These commands change the metadata for the Hive table as a whole but do not rewrite existing files that are part of the table. So, for example, if I create a Parquet table, insert some rows into it, then do ALTER TABLE ADD COLUMN, then insert some more rows into it, the Parquet table will have (at least) two files. The added column will be absent from one file (at least) and present in another (at least).

Within the Hive shell, if one does a select * from t for such a table, the rows that lack the added column get the NULL value on retrieval.

I have tried doing the equivalent with the Arrow C++ reader, but it cores in FromParquetSchema (in module arrow_schema.cc), because a column index is off the end of the leaves_ vector of the ColumnDescriptor object.

So, my conclusion is that Arrow at this time does not support such schema evolution.

Is this a correct conclusion?

I was thinking about how one might add such support. I'm extremely new to the Arrow code base; I've had just a few hours of exposure to it.

Some questions that come to my mind (apologies if these are naïve):


  1.  Does the Arrow reader assume all files in a Parquet table have the same metadata? (Put another way, does it read the metadata of the first file it encounters, and then reuse that when reading subsequent  files?)
  2.  Is there an appropriate layer in the reader where one could simply materialize null values if a column isn't in Arrow's copy of the metadata? (Just for fun, I'm experimenting with changing the FromParquetSchema function to simply ignore column indexes that are off the end of leaves_; I expect it will fail somewhere else. And I'll learn a bit more about the Arrow reader in the process.)

Thanks and kind regards,

Dave

Re: Support for schema evolution of Parquet files?

Posted by Wes McKinney <we...@gmail.com>.
Schema evolution is not accounted for at all yet in the library. One
JIRA corresponding to this for the C++ project is

https://issues.apache.org/jira/browse/PARQUET-810

Currently in Python the dataset handling for multiple Parquet files is
implemented in Python. I am interested in porting this to C++ so that
it can be used in C, MATLAB, R, and Ruby. That would be oneappropriate
time to handle schema evolution. So to answer your questions

1) Yes
2) You'd probably want to implement this near the surface of
parquet::arrow::FileReader. I haven't thought much about what the API
should look like. Mechanically adding a field with all nulls is not
too difficult. If you have a proposal (for the API) I would be happy
to take a look

On Mon, Jan 28, 2019 at 1:57 PM Dave Birdsall <da...@esgyn.com> wrote:
>
> Hi,
>
> In Hive, it is possible to evolve one's schema using ALTER TABLE ADD COLUMNS and/or ALTER TABLE REPLACE COLUMNS. These commands change the metadata for the Hive table as a whole but do not rewrite existing files that are part of the table. So, for example, if I create a Parquet table, insert some rows into it, then do ALTER TABLE ADD COLUMN, then insert some more rows into it, the Parquet table will have (at least) two files. The added column will be absent from one file (at least) and present in another (at least).
>
> Within the Hive shell, if one does a select * from t for such a table, the rows that lack the added column get the NULL value on retrieval.
>
> I have tried doing the equivalent with the Arrow C++ reader, but it cores in FromParquetSchema (in module arrow_schema.cc), because a column index is off the end of the leaves_ vector of the ColumnDescriptor object.
>
> So, my conclusion is that Arrow at this time does not support such schema evolution.
>
> Is this a correct conclusion?
>
> I was thinking about how one might add such support. I'm extremely new to the Arrow code base; I've had just a few hours of exposure to it.
>
> Some questions that come to my mind (apologies if these are naïve):
>
>
>   1.  Does the Arrow reader assume all files in a Parquet table have the same metadata? (Put another way, does it read the metadata of the first file it encounters, and then reuse that when reading subsequent  files?)
>   2.  Is there an appropriate layer in the reader where one could simply materialize null values if a column isn't in Arrow's copy of the metadata? (Just for fun, I'm experimenting with changing the FromParquetSchema function to simply ignore column indexes that are off the end of leaves_; I expect it will fail somewhere else. And I'll learn a bit more about the Arrow reader in the process.)
>
> Thanks and kind regards,
>
> Dave