You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Ryan Blue <bl...@cloudera.com> on 2015/05/18 22:43:15 UTC

High-level type evolution

I've been looking at schema evolution lately, and we don't currently 
support changing physical types when a logical type does not change. 
This could be a problem when two different systems have different, but 
valid, representations for a logical type.

Decimal, for example, can be represented either with a binary or a 
fixed. But if the requested schema for a file (say, binary) doesn't 
match the underlying type (fixed) then the check that verifies all 
columns can be satisfied fails, even though both requested type and 
actual type are valid.

We can fix this by adding logic to the `checkContains` methods in the 
Type classes, plus support in the converters. But I'm wondering if we 
shouldn't take a closer look at projection and schema evolution in 
general at this point.

Are there other ways to solve this problem? Can we do projection 
differently, so we don't have to ignore the physical type of a requested 
column in some cases? What are the rules for valid projection?

Thanks!

rb


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: High-level type evolution

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
There should be a centralized place where type equivalence and conversion
are defined.
Then converters could reuse them and we would minimize the amount of work
required.
When projecting, parquet deserializes the physical types it knows about and
the converter uses the proper type conversion.
This could be implemented as a set of reusable PrimitiveConverters that
know how to convert from a given physical type to a logical type. they can
be composed with the appropriate converter if there's a more specific type
for a particular framework.



On Mon, May 18, 2015 at 1:43 PM, Ryan Blue <bl...@cloudera.com> wrote:

> I've been looking at schema evolution lately, and we don't currently
> support changing physical types when a logical type does not change. This
> could be a problem when two different systems have different, but valid,
> representations for a logical type.
>
> Decimal, for example, can be represented either with a binary or a fixed.
> But if the requested schema for a file (say, binary) doesn't match the
> underlying type (fixed) then the check that verifies all columns can be
> satisfied fails, even though both requested type and actual type are valid.
>
> We can fix this by adding logic to the `checkContains` methods in the Type
> classes, plus support in the converters. But I'm wondering if we shouldn't
> take a closer look at projection and schema evolution in general at this
> point.
>
> Are there other ways to solve this problem? Can we do projection
> differently, so we don't have to ignore the physical type of a requested
> column in some cases? What are the rules for valid projection?
>
> Thanks!
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>