You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@twitter.com.INVALID> on 2015/05/12 19:07:25 UTC

Parquet sync up

Happening now:
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up?authuser=0&hceid=anVsaWVuQHR3aXR0ZXIuY29t.8ojja1ffv4jnptqalci3qebf8o

Re: Parquet sync up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
Notes:
Attendees and topics of interest:
- Julien: Twitter. 1.7.0 release. merging Bytebuffer access branch
- Alex: Twitter.
- Daniel: Netflix. GSOC(Bytebuffer access), 1.7.0 release, Vectorized read
path. Schema evolution
- Mickael: Criteo.
- Ryan: Cloudera. 1.7.0 release, Semver versioning, Reviewing Pull
Requests, Schema evolution
- Sanjeev: Twitter. Support for interactive query.
- Sergio: Cloudera. hearing about vectorization
- Tianshuo: Twitter

Agenda:
 - Schema evolution
 - 1.7.0 release
 - merging ByteBuffer
 - Vectorized exec update
 - SemVer
 - PR review
 - interactive queries

Discuss:
 - Schema evolution strategies:
   - index based access of file. Adding columns only at the end. can not
delete field. can rename.
   - name based access: can not rename. can delete and add columns anywhere.
   - have another identifier for the column to enable best of both worlds.
 - 1.7 release: go through and make sure everything is renamed to
org.apache. Ryan to do soon.
 - merging ByteBuffer: merge the org.parquet rename in the branch. then
merge the branch right after 1.7 release
 - Vectorized exec update: Netflix picking it up. Waiting on the rename and
the ByteBuffer read path. on the  PARQUET-131 JIRA there's a link to the
github repo. Update by Dong Chen. Goal to integrate with Presto. The Drill
team and Chang Lian from Spark team should review as well.
 - SemVer: have a version number for the format. and a version number for
the library. the library version increases whenever a breaking change in
the API or the format. starting in 2.0 the writer must be provided with the
format version.
 - Review Pull Requests:
   - Spark has a tool: https://spark-prs.appspot.com/open-prs#all
   - go look at the PR and ping relevant people.
 - interactive queries:
   - documenting capabilities of SQL engines and level of integrated-ness
with Parquet:
   - Presto
   - Drill
   - Impala: has done a lot of work to do code-gen to be fast. Lacks nested
types (80% done, in impala 2.4). Impala uses index access to columns
(restricts schema evolution). New encodings will come after.
   - Spark SQL


On Tue, May 12, 2015 at 10:07 AM, Julien Le Dem <ju...@twitter.com> wrote:

> Happening now:
>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up?authuser=0&hceid=anVsaWVuQHR3aXR0ZXIuY29t.8ojja1ffv4jnptqalci3qebf8o
>