You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@dremio.com> on 2017/05/02 02:26:37 UTC

Re: Parquet Schema Evolution and Protobuf Compatibility

Hi Michael,
I'd recommend using ProtoParquetWriter to do this. Since it is its purpose.
If you want to better understand how one writes a Parquet file you can look
at the simple ExampleParquetWriter intended as an example and not for
actual production use here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/ExampleParquetWriter.java
The idea is you implement a WriteSupport (schema mapping and record
mapping) to write to an even style API and avoid materializing objects in
between (if you're old like me think SAX vs DOM)
That's what the ProtoParquetWriter does for proto objects. You are welcome
to improve it as you see fit if needed. Take a look at the recent PRs in
parquet-proto:
https://github.com/apache/parquet-mr/pull/410
https://github.com/apache/parquet-mr/pull/411

I see 4 independent orgs showing interest or contributions around this.

On Fri, Apr 28, 2017 at 6:48 AM, Michael Moss <mi...@gmail.com>
wrote:

> Julien,
>
> Thanks for the thoughtful response. Looks like we are talking about this
> ticket, which I am now tracking (and which also seems to have gotten a
> recent PR!) :) https://issues.apache.org/jira/browse/PARQUET-951
>
> In the event that not all valid protobuf (or avro for that matter) schema
> evolutions rules transform seamlessly into valid parquet schema evolution
> rules, I was looking into moving away from writing my parquet files
> with ProtoParquetWriter and instead managing the transformation process
> between my protobufs and parquet files myself. I'm having trouble finding
> documentation or best practices for doing that, except for a comment
> somewhere that Parquet files are mainly meant to be generated directly from
> it's Proto and Avro ParquetWriters.
>
> Can you comment on this? Are there any examples of writing parquet files
> directly?
>
> Thanks again.
>
> On 2017-04-20 14:43 (-0400), Julien Le Dem <j....@dremio.com> wrote:
> > Hi Michael,>
> > The default schema evolution in Parquet is to merge schemas by field
> name.>
> > Which means you can:>
> >  - add a field with a name that is not used yet>
> > But you can not:>
> >  - rename a field. Although it will treat it as removing the field and>
> > adding a new one. The old name and new name will be treated as 2
> different>
> > columns>
> >  - change the type of a field.>
> >
> > In Protobuf and Thrift there is a field ID and both support renaming a>
> > field and keeping the field id. So by default this is not supported in>
> > Parquet. You can add fields but not really rename existing ones. You can>
> > use Protobuf if you restrict yourself not to rename fields.>
> > Although we have added an optional id field in the parquet schema nodes>
> > specifically for that purpose.>
> > QinHui from Criteo is looking into the exact same thing related to>
> > protobuf. (check out the latest notes from the parquet sync on this
> list)>
> > To support renaming of field we would need:>
> >  - populate the id fields in the Parquet Schema when converting the>
> > Protobuf schema to Parquet by calling withId (>
> >
> https://github.com/apache/parquet-mr/blob/master/
> parquet-protobuf/src/main/java/org/apache/parquet/proto/
> ProtoSchemaConverter.java>
>
> > )>
> >  - take the id into account when doing schema merging and the id is>
> > available. (possibly add a property to switch on the different behavior)>
> >
> > That is a feature that totally makes sense and just need someone to
> spend>
> > the time implementing it.>
> > I think QinHui mentioned he is interested in doing so. He also talked
> about>
> > dealing with the unknown fields comming from Protobuf (when you don't
> have>
> > the latest proto for example)>
> > QinHui: am I correctly reflecting this? Did you create a JIRA for this?>
> >
> > Mike: Let me know if that helps>
> >
> > Cheers>
> > Julien>
> >
> >
> >
> > On Wed, Apr 19, 2017 at 11:18 AM, Michael Moss <mi...@gmail.com>>
> > wrote:>
> >
> > > Hi,>
> > >>
> > > I'd be curious to get the communities perspective on using Parquet
> format>
> > > as the canonical source of truth for ones data. Are folks doing this
> in>
> > > practice or ETLing from their source of truth into Parquet for
> analytical>
> > > use cases (storing more than once)?>
> > >>
> > > A reason I'm reluctant to store all my data once in parquet is it's
> lack of>
> > > support for common schema evolution scenarios which largely seems>
> > > implementation specific (>
> > > http://stackoverflow.com/questions/37644664/schema->
> > > evolution-in-parquet-format>
> > > ).>
> > >>
> > > Two specific pain points I have are:>
> > > Concerned about the expense of spark schema merging operations (>
> > > http://spark.apache.org/docs/latest/sql-programming-guide.>
> > > html#schema-merging)>
> > > vs perhaps specifying a schema upfront (like Avro)>
> > >>
> > > We use protobufs, and I've found that ProtoReadSupport seems to choke
> on>
> > > what would otherwise be valid protobuf schema evolution rules, like>
> > > renaming fields for example. This leaves us in a situation where
> perhaps we>
> > > can write our data with ProtoParquetWriter, but must read it back
> using>
> > > regular Parquet support vs https://github.com/saurfang/
> sparksql-protobuf>
>
> > > or ProtoParquetReader, which means we lose all the nice type-safe POJO>
> > > features working with Protobufs in Spark offers us.>
> > >>
> > > Appreciate any insights on the roadmap, or advice.>
> > >>
> > > Best,>
> > >>
> > > -Mike>
> > >>
> >
> >
> >
> > -- >
> > Julien>
> >
>



-- 
Julien