You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@wework.com.INVALID> on 2018/10/09 17:13:20 UTC

parquet sync notes

Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
Anna (Cloudera): process, feature branches, etiquette of waiting for
someone? Blocked
Zoltan (Cloudera): Feature branches? When to review them?
Nandor (Cloudera)
 parquet file with multiple row groups, schema evolution
Zoltan (Cloudera): column index
Junjie (tencent): listening
Gidon (IBM): encryption next steps
Jim: bloom filter, Bit weaving
Xinli (Uber): encryption
Julien (WeWork): encryption

Bloom filter:

   -  PR for doc. Parquet-format feature branch.
   -
      - To be reviewed by: Deepak, Jim, Ryan.


Encryption:

   - Another encryption effort exists, Julien to send intros: Xinli,
   Giddon, Zoltan
   - New requirements, updated doc, implement code changes.


Process:

   - Feature branches:
   -
      - Julien to follow up with Ryan
      - Feature branches are considered like master:
      -
         - Every changed is reviewed individually through a PR
         - Every change has a jira
         - Only difference is that it’s ok to make incompatible changes
         - Squash merge vs merge commit:
      -
         - Merge commit keeps the history but clutters
         - 3 options:
      -
         - Merge commit
         -
            - Clutters history (not linear anymore)
            - But if each commit in the branch has a jira seems fine
            - Squash:
         -
            - Loses the detailed commits of the feature
            - Keeps history linear
            - Rebase feature branch
         -
            - Keeps history linear and keeps history
            - But need to address conflicts for each commit in branch
            - Commits in branch are now disconnected from the PR (modified
            after the facts).
            -  When is it appropriate to wait:
   -
      - Balance:
      -
         - making sure we don’t make incompatible changes to the format and
         we have final features
         - Making it easier for people to contribute.
         - Anna to start a conversation around our etiquette
      -
         - How long is it appropriate to wait on feedback
         - How to know who’s the best committer to drive a PR to conclusion


Filtering nested types support:

   -  We should store stats for nested types


Page Index benchmark:

   - Nice results, comparing random to sorted files:
   -
      -
      https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
      -
      https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
      - Need to compare page size affect on compression and file size


Appending to a parquet file:

   -  The type of a column chunk should be consistent with the schema in
   the footer.

Re: parquet sync notes

Posted by Aniket Mokashi <an...@gmail.com>.
I would like to attend the next sync. Where do I find instructions to join
this meeting?

On Tue, Oct 9, 2018 at 10:13 AM Julien Le Dem
<ju...@wework.com.invalid> wrote:

> Gabor (Cloudera): column index, benchmark, nested types (filter, indexes)
> Anna (Cloudera): process, feature branches, etiquette of waiting for
> someone? Blocked
> Zoltan (Cloudera): Feature branches? When to review them?
> Nandor (Cloudera)
>  parquet file with multiple row groups, schema evolution
> Zoltan (Cloudera): column index
> Junjie (tencent): listening
> Gidon (IBM): encryption next steps
> Jim: bloom filter, Bit weaving
> Xinli (Uber): encryption
> Julien (WeWork): encryption
>
> Bloom filter:
>
>    -  PR for doc. Parquet-format feature branch.
>    -
>       - To be reviewed by: Deepak, Jim, Ryan.
>
>
> Encryption:
>
>    - Another encryption effort exists, Julien to send intros: Xinli,
>    Giddon, Zoltan
>    - New requirements, updated doc, implement code changes.
>
>
> Process:
>
>    - Feature branches:
>    -
>       - Julien to follow up with Ryan
>       - Feature branches are considered like master:
>       -
>          - Every changed is reviewed individually through a PR
>          - Every change has a jira
>          - Only difference is that it’s ok to make incompatible changes
>          - Squash merge vs merge commit:
>       -
>          - Merge commit keeps the history but clutters
>          - 3 options:
>       -
>          - Merge commit
>          -
>             - Clutters history (not linear anymore)
>             - But if each commit in the branch has a jira seems fine
>             - Squash:
>          -
>             - Loses the detailed commits of the feature
>             - Keeps history linear
>             - Rebase feature branch
>          -
>             - Keeps history linear and keeps history
>             - But need to address conflicts for each commit in branch
>             - Commits in branch are now disconnected from the PR (modified
>             after the facts).
>             -  When is it appropriate to wait:
>    -
>       - Balance:
>       -
>          - making sure we don’t make incompatible changes to the format and
>          we have final features
>          - Making it easier for people to contribute.
>          - Anna to start a conversation around our etiquette
>       -
>          - How long is it appropriate to wait on feedback
>          - How to know who’s the best committer to drive a PR to conclusion
>
>
> Filtering nested types support:
>
>    -  We should store stats for nested types
>
>
> Page Index benchmark:
>
>    - Nice results, comparing random to sorted files:
>    -
>       -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkWithOrWithoutColumnIndex.json
>       -
>
> https://jmh.morethan.io/?gist=2388d962d6380f74a78ad0d97b4353a2/benchmarkPageSize.json
>       - Need to compare page size affect on compression and file size
>
>
> Appending to a parquet file:
>
>    -  The type of a column chunk should be consistent with the schema in
>    the footer.
>


-- 
"...:::Aniket:::... Quetzalco@tl"