You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@gmail.com> on 2017/08/02 16:01:59 UTC

Parquet sync starting now

on hangout:
https://hangouts.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04

Re: Parquet sync starting now

Posted by Jeff Knupp <je...@enigma.com>.
Thanks! Good to know :)

-Jeff

On Fri, Aug 4, 2017 at 9:50 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Jeff,
>
> they are open for anyone and everyone is appreciated! We use these syncs
> to exchange and discuss things about the Parquet project as well as the
> Parquet format. It is also a good point to start if you want to know
> what the current "hot topics" in Parquet are and how you could get
> involved.
>
> Uwe
>
> On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> > Just out of curiosity, are these sync meetings restricted to committers
> > and
> > higher or can anyone listen in?
> >
> > Cheers,
> > Jeff Knupp
> >
> > On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Hi Julien
> > > Do we have meeting minutes for sync up?  I can't hear clearly from
> handout
> > > due to vpn issue from home.
> > >
> > > 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
> > >
> > > > on hangout:
> > > > https://hangouts.google.com/hangouts/_/calendar/
> > > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>

Re: Parquet sync starting now

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Jeff,

they are open for anyone and everyone is appreciated! We use these syncs
to exchange and discuss things about the Parquet project as well as the
Parquet format. It is also a good point to start if you want to know
what the current "hot topics" in Parquet are and how you could get
involved.

Uwe

On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> Just out of curiosity, are these sync meetings restricted to committers
> and
> higher or can anyone listen in?
> 
> Cheers,
> Jeff Knupp
> 
> On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
> 
> > Hi Julien
> > Do we have meeting minutes for sync up?  I can't hear clearly from handout
> > due to vpn issue from home.
> >
> > 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
> >
> > > on hangout:
> > > https://hangouts.google.com/hangouts/_/calendar/
> > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >

Re: Parquet sync starting now

Posted by Jeff Knupp <je...@enigma.com>.
Just out of curiosity, are these sync meetings restricted to committers and
higher or can anyone listen in?

Cheers,
Jeff Knupp

On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:

> Hi Julien
> Do we have meeting minutes for sync up?  I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>

Re: Parquet sync starting now

Posted by Wes McKinney <we...@gmail.com>.
I have not taken a look at the performance of different compression
algorithms yet. Are there any example datasets that anyone would like
to see statistics for? Otherwise I will generate some high and low
entropy datasets with dictionary encoding disabled (so that the
compression is handled more by the byte compressors than by
dictionaries).

On Fri, Aug 11, 2017 at 8:27 PM, Julien Le Dem <ju...@gmail.com> wrote:
> Sorry for the delay. See notes bellow.
> I'm on vacation next week and Lars will send an invitation for the next sync
>  August 16th.
> Pooja will talk about her work on page indices.
> Here are the notes from last sync:
>
> Parquet Sync Aug 2 2017
>
>
> Anna (Cloudera):
>
> Deepak (Vertica): timestamp format
>
> Jim (Cloudera): Bloom filters
>
> Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes
>
> Marcel: index page proposal
>
> Ryan (Netflix): Merge
>
> Zoltan (Cloudera Budapest)
>
> JunJie (Intel): Bloom Filter.
>
> Julien: Bloom Filters
>
>
> Bloom Filters:
>
>  - to be efficient, needs 1 byte per distinct value.
>
>    - useful if many MDVS that are bigger than 1 byte (example UUIDs)
>
>  - Benchmarking:
>
>    - difficulty enabling dictionary filtering in Hive and spark sql:
> https://issues.apache.org/jira/browse/PARQUET-1061
>
>       - Ryan to follow up on how to configure it
>
>  - hashing discussion:
>
>    - We will used block based hashing algorithm.
>
>    - false positive > 00.1%
>
>    - Definition of hash function:
>
>       - currently has only one (Murmur3).
>
>       - TODO: define metadata using union to allow for other hash functions
> in the future
>
>       - TODO: clarify what variation of Murmur3 we are using.
>
>
> Index pages:
>
>  - good IO savings by skipping pages.
>
>  - if columns
>
>  - added metadata for position of dictionary location.
>
>  - Next time presentation of the result.
>
>
> Timestamp Format:
>
>  - Ryan to update the PR with conclusion
>
>
> Feedback on Brotli:
>
>  - why not LZ4 or ZStandard?
>
>  - Wes to try ou to compare in C++
>
>  - Ryan to compare in Java with his datasets.
>
>  - For reference:
>
>    - comparison graphs, including brotli vs. zstd:
> https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
>
>    -
> http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html
>
>
> PGP keys size:
>
>  - Use larger PGP key id to avoid collision:
>
>
> Github integration:
>
>  - Use new Apache - Github integration to allow admin rights on Github.
>
>  - Start a thread
>
> On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
>
>> Hi Julien
>> Do we have meeting minutes for sync up?  I can't hear clearly from handout
>> due to vpn issue from home.
>>
>> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>>
>> > on hangout:
>> > https://hangouts.google.com/hangouts/_/calendar/
>> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>> >
>>
>>
>>
>> --
>> Thanks & Best Regards
>>

Re: Parquet sync starting now

Posted by Julien Le Dem <ju...@gmail.com>.
Sorry for the delay. See notes bellow.
I'm on vacation next week and Lars will send an invitation for the next sync
 August 16th.
Pooja will talk about her work on page indices.
Here are the notes from last sync:

Parquet Sync Aug 2 2017


Anna (Cloudera):

Deepak (Vertica): timestamp format

Jim (Cloudera): Bloom filters

Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes

Marcel: index page proposal

Ryan (Netflix): Merge

Zoltan (Cloudera Budapest)

JunJie (Intel): Bloom Filter.

Julien: Bloom Filters


Bloom Filters:

 - to be efficient, needs 1 byte per distinct value.

   - useful if many MDVS that are bigger than 1 byte (example UUIDs)

 - Benchmarking:

   - difficulty enabling dictionary filtering in Hive and spark sql:
https://issues.apache.org/jira/browse/PARQUET-1061

      - Ryan to follow up on how to configure it

 - hashing discussion:

   - We will used block based hashing algorithm.

   - false positive > 00.1%

   - Definition of hash function:

      - currently has only one (Murmur3).

      - TODO: define metadata using union to allow for other hash functions
in the future

      - TODO: clarify what variation of Murmur3 we are using.


Index pages:

 - good IO savings by skipping pages.

 - if columns

 - added metadata for position of dictionary location.

 - Next time presentation of the result.


Timestamp Format:

 - Ryan to update the PR with conclusion


Feedback on Brotli:

 - why not LZ4 or ZStandard?

 - Wes to try ou to compare in C++

 - Ryan to compare in Java with his datasets.

 - For reference:

   - comparison graphs, including brotli vs. zstd:
https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/

   -
http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html


PGP keys size:

 - Use larger PGP key id to avoid collision:


Github integration:

 - Use new Apache - Github integration to allow admin rights on Github.

 - Start a thread

On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈 <cj...@gmail.com> wrote:

> Hi Julien
> Do we have meeting minutes for sync up?  I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>

Re: Parquet sync starting now

Posted by 俊杰陈 <cj...@gmail.com>.
Hi Julien
Do we have meeting minutes for sync up?  I can't hear clearly from handout
due to vpn issue from home.

2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:

> on hangout:
> https://hangouts.google.com/hangouts/_/calendar/
> anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>



-- 
Thanks & Best Regards