You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Cody Koeninger <co...@koeninger.org> on 2014/10/03 22:33:40 UTC

Parquet schema migrations

Wondering if anyone has thoughts on a path forward for parquet schema
migrations, especially for people (like us) that are using raw parquet
files rather than Hive.

So far we've gotten away with reading old files, converting, and writing to
new directories, but that obviously becomes problematic above a certain
data size.

Re: Parquet schema migrations

Posted by Gary Malouf <ma...@gmail.com>.

Hi Michael,

Does this affect people who use Hive for their metadata store as well?  I'm
wondering if the issue is as bad as I think it is - namely that if you
build up a year's worth of data, adding a field forces you to have to
migrate that entire year's data.

Gary

On Wed, Oct 8, 2014 at 5:08 PM, Cody Koeninger <co...@koeninger.org> wrote:

> On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> >
> > I was proposing you manually convert each different format into one
> > unified format  (by adding literal nulls and such for missing columns)
> and
> > then union these converted datasets.  It would be weird to have union all
> > try and do this automatically.
> >
>
>
> Sure, I was just musing on what an api for doing the merging without manual
> user input should look like / do.   I'll comment on the ticket, thanks for
> making it
>

Re: Parquet schema migrations

Posted by Cody Koeninger <co...@koeninger.org>.

On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust <mi...@databricks.com>
wrote:

>
> I was proposing you manually convert each different format into one
> unified format  (by adding literal nulls and such for missing columns) and
> then union these converted datasets.  It would be weird to have union all
> try and do this automatically.
>

Sure, I was just musing on what an api for doing the merging without manual
user input should look like / do.   I'll comment on the ticket, thanks for
making it

Re: Parquet schema migrations

Posted by Michael Armbrust <mi...@databricks.com>.

>
> The kind of change we've made that it probably makes most sense to support
> is adding a nullable column. I think that also implies supporting
> "removing" a nullable column, as long as you don't end up with columns of
> the same name but different type.
>

Filed here: https://issues.apache.org/jira/browse/SPARK-3851


> I'm not sure semantically that it makes sense to do schema merging as part
> of union all, and definitely doesn't make sense to do it by default.  I
> wouldn't want two accidentally compatible schema to get merged without
> warning.  It's also a little odd since unlike a normal sql database union
> all can happen before there are any projections or filters... e.g. what
> order do columns come back in if someone does select *.
>

I was proposing you manually convert each different format into one unified
format  (by adding literal nulls and such for missing columns) and then
union these converted datasets.  It would be weird to have union all try
and do this automatically.

Re: Parquet schema migrations

Posted by Cody Koeninger <co...@koeninger.org>.

Sorry, by "raw parquet" I just meant there is no external metadata store,
only the schema written as part of the parquet format.

We've done several different kinds of changes, including column rename and
widening the data type of an existing column.  I don't think it's feasible
to support those.

The kind of change we've made that it probably makes most sense to support
is adding a nullable column. I think that also implies supporting
"removing" a nullable column, as long as you don't end up with columns of
the same name but different type.

I'm not sure semantically that it makes sense to do schema merging as part
of union all, and definitely doesn't make sense to do it by default.  I
wouldn't want two accidentally compatible schema to get merged without
warning.  It's also a little odd since unlike a normal sql database union
all can happen before there are any projections or filters... e.g. what
order do columns come back in if someone does select *.

Seems like there should be either a separate api call, or an optional
argument to union all.

As far as resources go, I can probably put some personal time into this if
we come up with a plan that makes sense.

On Sun, Oct 5, 2014 at 7:36 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Hi Cody,
>
> Assuming you are talking about 'safe' changes to the schema (i.e. existing
> column names are never reused with incompatible types), this is something
> I'd love to support.  Perhaps you can describe more what sorts of changes
> you are making, and if simple merging of the schemas would be sufficient.
> If so, we can open a JIRA, though I'm not sure when we'll have resources to
> dedicate to this.
>
> In the near term, I'd suggest writing converters for each version of the
> schema, that translate to some desired master schema.  You can then union
> all of these together and avoid the cost of batch conversion.  It seems
> like in most cases this should be pretty efficient, at least now that we
> have good pushdown past union operators :)
>
> Michael
>
> On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Hi Cody,
>>
>> I wasn't aware there were different versions of the parquet format.
>> What's
>> the difference between "raw parquet" and the Hive-written parquet files?
>>
>> As for your migration question, the approaches I've often seen are
>> convert-on-read and convert-all-at-once.  Apache Cassandra for example
>> does
>> both -- when upgrading between Cassandra versions that change the on-disk
>> sstable format, it will do a convert-on-read as you access the sstables,
>> or
>> you can run the upgradesstables command to convert them all at once
>> post-upgrade.
>>
>> Andrew
>>
>> On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>> > Wondering if anyone has thoughts on a path forward for parquet schema
>> > migrations, especially for people (like us) that are using raw parquet
>> > files rather than Hive.
>> >
>> > So far we've gotten away with reading old files, converting, and
>> writing to
>> > new directories, but that obviously becomes problematic above a certain
>> > data size.
>> >
>>
>
>

Re: Parquet schema migrations

Posted by Michael Armbrust <mi...@databricks.com>.

Hi Cody,

Assuming you are talking about 'safe' changes to the schema (i.e. existing
column names are never reused with incompatible types), this is something
I'd love to support.  Perhaps you can describe more what sorts of changes
you are making, and if simple merging of the schemas would be sufficient.
If so, we can open a JIRA, though I'm not sure when we'll have resources to
dedicate to this.

In the near term, I'd suggest writing converters for each version of the
schema, that translate to some desired master schema.  You can then union
all of these together and avoid the cost of batch conversion.  It seems
like in most cases this should be pretty efficient, at least now that we
have good pushdown past union operators :)

Michael

On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash <an...@andrewash.com> wrote:

> Hi Cody,
>
> I wasn't aware there were different versions of the parquet format.  What's
> the difference between "raw parquet" and the Hive-written parquet files?
>
> As for your migration question, the approaches I've often seen are
> convert-on-read and convert-all-at-once.  Apache Cassandra for example does
> both -- when upgrading between Cassandra versions that change the on-disk
> sstable format, it will do a convert-on-read as you access the sstables, or
> you can run the upgradesstables command to convert them all at once
> post-upgrade.
>
> Andrew
>
> On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger <co...@koeninger.org> wrote:
>
> > Wondering if anyone has thoughts on a path forward for parquet schema
> > migrations, especially for people (like us) that are using raw parquet
> > files rather than Hive.
> >
> > So far we've gotten away with reading old files, converting, and writing
> to
> > new directories, but that obviously becomes problematic above a certain
> > data size.
> >
>

Re: Parquet schema migrations

Posted by Andrew Ash <an...@andrewash.com>.

Hi Cody,

I wasn't aware there were different versions of the parquet format.  What's
the difference between "raw parquet" and the Hive-written parquet files?

As for your migration question, the approaches I've often seen are
convert-on-read and convert-all-at-once.  Apache Cassandra for example does
both -- when upgrading between Cassandra versions that change the on-disk
sstable format, it will do a convert-on-read as you access the sstables, or
you can run the upgradesstables command to convert them all at once
post-upgrade.

Andrew

On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Wondering if anyone has thoughts on a path forward for parquet schema
> migrations, especially for people (like us) that are using raw parquet
> files rather than Hive.
>
> So far we've gotten away with reading old files, converting, and writing to
> new directories, but that obviously becomes problematic above a certain
> data size.
>