You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@gmail.com> on 2017/08/02 16:01:59 UTC
Parquet sync starting now
on hangout:
https://hangouts.google.com/hangouts/_/calendar/anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
Re: Parquet sync starting now
Posted by Jeff Knupp <je...@enigma.com>.
Thanks! Good to know :)
-Jeff
On Fri, Aug 4, 2017 at 9:50 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Jeff,
>
> they are open for anyone and everyone is appreciated! We use these syncs
> to exchange and discuss things about the Parquet project as well as the
> Parquet format. It is also a good point to start if you want to know
> what the current "hot topics" in Parquet are and how you could get
> involved.
>
> Uwe
>
> On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> > Just out of curiosity, are these sync meetings restricted to committers
> > and
> > higher or can anyone listen in?
> >
> > Cheers,
> > Jeff Knupp
> >
> > On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Hi Julien
> > > Do we have meeting minutes for sync up? I can't hear clearly from
> handout
> > > due to vpn issue from home.
> > >
> > > 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
> > >
> > > > on hangout:
> > > > https://hangouts.google.com/hangouts/_/calendar/
> > > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>
Re: Parquet sync starting now
Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Jeff,
they are open for anyone and everyone is appreciated! We use these syncs
to exchange and discuss things about the Parquet project as well as the
Parquet format. It is also a good point to start if you want to know
what the current "hot topics" in Parquet are and how you could get
involved.
Uwe
On Fri, Aug 4, 2017, at 03:48 PM, Jeff Knupp wrote:
> Just out of curiosity, are these sync meetings restricted to committers
> and
> higher or can anyone listen in?
>
> Cheers,
> Jeff Knupp
>
> On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
>
> > Hi Julien
> > Do we have meeting minutes for sync up? I can't hear clearly from handout
> > due to vpn issue from home.
> >
> > 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
> >
> > > on hangout:
> > > https://hangouts.google.com/hangouts/_/calendar/
> > > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
Re: Parquet sync starting now
Posted by Jeff Knupp <je...@enigma.com>.
Just out of curiosity, are these sync meetings restricted to committers and
higher or can anyone listen in?
Cheers,
Jeff Knupp
On Wed, Aug 2, 2017 at 7:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
> Hi Julien
> Do we have meeting minutes for sync up? I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>
Re: Parquet sync starting now
Posted by Wes McKinney <we...@gmail.com>.
I have not taken a look at the performance of different compression
algorithms yet. Are there any example datasets that anyone would like
to see statistics for? Otherwise I will generate some high and low
entropy datasets with dictionary encoding disabled (so that the
compression is handled more by the byte compressors than by
dictionaries).
On Fri, Aug 11, 2017 at 8:27 PM, Julien Le Dem <ju...@gmail.com> wrote:
> Sorry for the delay. See notes bellow.
> I'm on vacation next week and Lars will send an invitation for the next sync
> August 16th.
> Pooja will talk about her work on page indices.
> Here are the notes from last sync:
>
> Parquet Sync Aug 2 2017
>
>
> Anna (Cloudera):
>
> Deepak (Vertica): timestamp format
>
> Jim (Cloudera): Bloom filters
>
> Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes
>
> Marcel: index page proposal
>
> Ryan (Netflix): Merge
>
> Zoltan (Cloudera Budapest)
>
> JunJie (Intel): Bloom Filter.
>
> Julien: Bloom Filters
>
>
> Bloom Filters:
>
> - to be efficient, needs 1 byte per distinct value.
>
> - useful if many MDVS that are bigger than 1 byte (example UUIDs)
>
> - Benchmarking:
>
> - difficulty enabling dictionary filtering in Hive and spark sql:
> https://issues.apache.org/jira/browse/PARQUET-1061
>
> - Ryan to follow up on how to configure it
>
> - hashing discussion:
>
> - We will used block based hashing algorithm.
>
> - false positive > 00.1%
>
> - Definition of hash function:
>
> - currently has only one (Murmur3).
>
> - TODO: define metadata using union to allow for other hash functions
> in the future
>
> - TODO: clarify what variation of Murmur3 we are using.
>
>
> Index pages:
>
> - good IO savings by skipping pages.
>
> - if columns
>
> - added metadata for position of dictionary location.
>
> - Next time presentation of the result.
>
>
> Timestamp Format:
>
> - Ryan to update the PR with conclusion
>
>
> Feedback on Brotli:
>
> - why not LZ4 or ZStandard?
>
> - Wes to try ou to compare in C++
>
> - Ryan to compare in Java with his datasets.
>
> - For reference:
>
> - comparison graphs, including brotli vs. zstd:
> https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
>
> -
> http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html
>
>
> PGP keys size:
>
> - Use larger PGP key id to avoid collision:
>
>
> Github integration:
>
> - Use new Apache - Github integration to allow admin rights on Github.
>
> - Start a thread
>
> On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
>
>> Hi Julien
>> Do we have meeting minutes for sync up? I can't hear clearly from handout
>> due to vpn issue from home.
>>
>> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>>
>> > on hangout:
>> > https://hangouts.google.com/hangouts/_/calendar/
>> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>> >
>>
>>
>>
>> --
>> Thanks & Best Regards
>>
Re: Parquet sync starting now
Posted by Julien Le Dem <ju...@gmail.com>.
Sorry for the delay. See notes bellow.
I'm on vacation next week and Lars will send an invitation for the next sync
August 16th.
Pooja will talk about her work on page indices.
Here are the notes from last sync:
Parquet Sync Aug 2 2017
Anna (Cloudera):
Deepak (Vertica): timestamp format
Jim (Cloudera): Bloom filters
Lars (Cloudera Impala): feedback on Brotli, Pooja’s file indexes
Marcel: index page proposal
Ryan (Netflix): Merge
Zoltan (Cloudera Budapest)
JunJie (Intel): Bloom Filter.
Julien: Bloom Filters
Bloom Filters:
- to be efficient, needs 1 byte per distinct value.
- useful if many MDVS that are bigger than 1 byte (example UUIDs)
- Benchmarking:
- difficulty enabling dictionary filtering in Hive and spark sql:
https://issues.apache.org/jira/browse/PARQUET-1061
- Ryan to follow up on how to configure it
- hashing discussion:
- We will used block based hashing algorithm.
- false positive > 00.1%
- Definition of hash function:
- currently has only one (Murmur3).
- TODO: define metadata using union to allow for other hash functions
in the future
- TODO: clarify what variation of Murmur3 we are using.
Index pages:
- good IO savings by skipping pages.
- if columns
- added metadata for position of dictionary location.
- Next time presentation of the result.
Timestamp Format:
- Ryan to update the PR with conclusion
Feedback on Brotli:
- why not LZ4 or ZStandard?
- Wes to try ou to compare in C++
- Ryan to compare in Java with his datasets.
- For reference:
- comparison graphs, including brotli vs. zstd:
https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
-
http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/compress/Lz4Codec.html
PGP keys size:
- Use larger PGP key id to avoid collision:
Github integration:
- Use new Apache - Github integration to allow admin rights on Github.
- Start a thread
On Wed, Aug 2, 2017 at 4:28 PM, 俊杰陈 <cj...@gmail.com> wrote:
> Hi Julien
> Do we have meeting minutes for sync up? I can't hear clearly from handout
> due to vpn issue from home.
>
> 2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
>
> > on hangout:
> > https://hangouts.google.com/hangouts/_/calendar/
> > anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
> >
>
>
>
> --
> Thanks & Best Regards
>
Re: Parquet sync starting now
Posted by 俊杰陈 <cj...@gmail.com>.
Hi Julien
Do we have meeting minutes for sync up? I can't hear clearly from handout
due to vpn issue from home.
2017-08-03 0:01 GMT+08:00 Julien Le Dem <ju...@gmail.com>:
> on hangout:
> https://hangouts.google.com/hangouts/_/calendar/
> anVsaWVuLmxlZGVtQGdtYWlsLmNvbQ.k5oikh8rp3ho37qdca3o9jvh04
>
--
Thanks & Best Regards