You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Julien Le Dem <ju...@twitter.com.INVALID> on 2015/05/12 19:07:25 UTC

Parquet sync up

Happening now:
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up?authuser=0&hceid=anVsaWVuQHR3aXR0ZXIuY29t.8ojja1ffv4jnptqalci3qebf8o

Re: Parquet sync up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.

Notes:
Attendees and topics of interest:
- Julien: Twitter. 1.7.0 release. merging Bytebuffer access branch
- Alex: Twitter.
- Daniel: Netflix. GSOC(Bytebuffer access), 1.7.0 release, Vectorized read
path. Schema evolution
- Mickael: Criteo.
- Ryan: Cloudera. 1.7.0 release, Semver versioning, Reviewing Pull
Requests, Schema evolution
- Sanjeev: Twitter. Support for interactive query.
- Sergio: Cloudera. hearing about vectorization
- Tianshuo: Twitter

Agenda:
- Schema evolution
- 1.7.0 release
- merging ByteBuffer
- Vectorized exec update
- SemVer
- PR review
- interactive queries

Discuss:
- Schema evolution strategies:
- index based access of file. Adding columns only at the end. can not
delete field. can rename.
- name based access: can not rename. can delete and add columns anywhere.
- have another identifier for the column to enable best of both worlds.
- 1.7 release: go through and make sure everything is renamed to
org.apache. Ryan to do soon.
- merging ByteBuffer: merge the org.parquet rename in the branch. then
merge the branch right after 1.7 release
- Vectorized exec update: Netflix picking it up. Waiting on the rename and
the ByteBuffer read path. on the PARQUET-131 JIRA there's a link to the
github repo. Update by Dong Chen. Goal to integrate with Presto. The Drill
team and Chang Lian from Spark team should review as well.
- SemVer: have a version number for the format. and a version number for
the library. the library version increases whenever a breaking change in
the API or the format. starting in 2.0 the writer must be provided with the
format version.
- Review Pull Requests:
- Spark has a tool: https://spark-prs.appspot.com/open-prs#all
- go look at the PR and ping relevant people.
- interactive queries:
- documenting capabilities of SQL engines and level of integrated-ness
with Parquet:
- Presto
- Drill
- Impala: has done a lot of work to do code-gen to be fast. Lacks nested
types (80% done, in impala 2.4). Impala uses index access to columns
(restricts schema evolution). New encodings will come after.
- Spark SQL

On Tue, May 12, 2015 at 10:07 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Happening now:
>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up?authuser=0&hceid=anVsaWVuQHR3aXR0ZXIuY29t.8ojja1ffv4jnptqalci3qebf8o
>