You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@wework.com.INVALID> on 2018/10/09 17:13:20 UTC
parquet sync notes
Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
Anna (Cloudera): process, feature branches, etiquette of waiting for
someone? Blocked
Zoltan (Cloudera): Feature branches? When to review them?
Nandor (Cloudera)
parquet file with multiple row groups, schema evolution
Zoltan (Cloudera): column index
Junjie (tencent): listening
Gidon (IBM): encryption next steps
Jim: bloom filter, Bit weaving
Xinli (Uber): encryption
Julien (WeWork): encryption
Bloom filter:
- PR for doc. Parquet-format feature branch.
-
- To be reviewed by: Deepak, Jim, Ryan.
Encryption:
- Another encryption effort exists, Julien to send intros: Xinli,
Giddon, Zoltan
- New requirements, updated doc, implement code changes.
Process:
- Feature branches:
-
- Julien to follow up with Ryan
- Feature branches are considered like master:
-
- Every changed is reviewed individually through a PR
- Every change has a jira
- Only difference is that it’s ok to make incompatible changes
- Squash merge vs merge commit:
-
- Merge commit keeps the history but clutters
- 3 options:
-
- Merge commit
-
- Clutters history (not linear anymore)
- But if each commit in the branch has a jira seems fine
- Squash:
-
- Loses the detailed commits of the feature
- Keeps history linear
- Rebase feature branch
-
- Keeps history linear and keeps history
- But need to address conflicts for each commit in branch
- Commits in branch are now disconnected from the PR (modified
after the facts).
- When is it appropriate to wait:
-
- Balance:
-
- making sure we don’t make incompatible changes to the format and
we have final features
- Making it easier for people to contribute.
- Anna to start a conversation around our etiquette
-
- How long is it appropriate to wait on feedback
- How to know who’s the best committer to drive a PR to conclusion
Filtering nested types support:
- We should store stats for nested types
Page Index benchmark:
- Nice results, comparing random to sorted files:
-
-
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
-
https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
- Need to compare page size affect on compression and file size
Appending to a parquet file:
- The type of a column chunk should be consistent with the schema in
the footer.
Re: parquet sync notes
Posted by Aniket Mokashi <an...@gmail.com>.
I would like to attend the next sync. Where do I find instructions to join
this meeting?
On Tue, Oct 9, 2018 at 10:13 AM Julien Le Dem
<ju...@wework.com.invalid> wrote:
> Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
> Anna (Cloudera): process, feature branches, etiquette of waiting for
> someone? Blocked
> Zoltan (Cloudera): Feature branches? When to review them?
> Nandor (Cloudera)
> parquet file with multiple row groups, schema evolution
> Zoltan (Cloudera): column index
> Junjie (tencent): listening
> Gidon (IBM): encryption next steps
> Jim: bloom filter, Bit weaving
> Xinli (Uber): encryption
> Julien (WeWork): encryption
>
> Bloom filter:
>
> - PR for doc. Parquet-format feature branch.
> -
> - To be reviewed by: Deepak, Jim, Ryan.
>
>
> Encryption:
>
> - Another encryption effort exists, Julien to send intros: Xinli,
> Giddon, Zoltan
> - New requirements, updated doc, implement code changes.
>
>
> Process:
>
> - Feature branches:
> -
> - Julien to follow up with Ryan
> - Feature branches are considered like master:
> -
> - Every changed is reviewed individually through a PR
> - Every change has a jira
> - Only difference is that it’s ok to make incompatible changes
> - Squash merge vs merge commit:
> -
> - Merge commit keeps the history but clutters
> - 3 options:
> -
> - Merge commit
> -
> - Clutters history (not linear anymore)
> - But if each commit in the branch has a jira seems fine
> - Squash:
> -
> - Loses the detailed commits of the feature
> - Keeps history linear
> - Rebase feature branch
> -
> - Keeps history linear and keeps history
> - But need to address conflicts for each commit in branch
> - Commits in branch are now disconnected from the PR (modified
> after the facts).
> - When is it appropriate to wait:
> -
> - Balance:
> -
> - making sure we don’t make incompatible changes to the format and
> we have final features
> - Making it easier for people to contribute.
> - Anna to start a conversation around our etiquette
> -
> - How long is it appropriate to wait on feedback
> - How to know who’s the best committer to drive a PR to conclusion
>
>
> Filtering nested types support:
>
> - We should store stats for nested types
>
>
> Page Index benchmark:
>
> - Nice results, comparing random to sorted files:
> -
> -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
> -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
> - Need to compare page size affect on compression and file size
>
>
> Appending to a parquet file:
>
> - The type of a column chunk should be consistent with the schema in
> the footer.
>
--
"...:::Aniket:::... Quetzalco@tl"