You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@dremio.com> on 2016/11/10 16:51:56 UTC

parquet sync up today at 10PT (in 1 hour)

Reminder that the Parquet Sync up will be in 1h at 10am PT on hangout:
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

-- 
Julien

Re: parquet sync up today at 10PT (in 1 hour)

Posted by Julien Le Dem <ju...@dremio.com>.
Thank you for correcting!

On Thursday, November 10, 2016, Ryan Blue <rb...@netflix.com.invalid> wrote:

> I have a slight correct for the Brotli encoding numbers. The 20% size
> decrease incurred a 2.5% increase in compression time (using brotli-5),
> while the 15% size decrease had a 12% encoding time *decrease* (using
> brotli-4). We've decided to use brotli-5 for tables that are read a lot,
> and brotli-4 for most other tables.
>
> On Thu, Nov 10, 2016 at 11:26 AM, Julien Le Dem <julien@dremio.com
> <javascript:;>> wrote:
>
> >  Attendees/agenda:
> > Zoltan (Cloudera):
> >  - Parquet tools questions
> > Piyush (Twitter):
> >  - planning on encoding optimization
> > Uwe:
> >  - release parquet-cpp
> >  - license/notice questions
> > Wes (twosigma):
> >  - working on arrow
> >  - helping with the parquet-cpp release
> > Deepak (HP/Vertica):
> >  - read/write parquet-cpp
> >  - discuss. statistics PARQUET-686. timestamps/...
> > Ryan (Netflix):
> >  - 1.9.0 release out.
> >  - statistics
> > Julien (Dremio):
> >  - Parquet-Arrow integration
> >
> > Notes:
> > Parquet-tools:
> >  - when missing hadoop jars on the class path => bad error message
> >    - 1.6 used to bundle hadoop
> >    - 1.9 requires adding hadoop classpath
> >  - Ryan has new new CLI tool
> >
> > Parquet cpp release:
> >  - need to put mentions in NOTICE files
> >    - merge script came from the Spark project (Apache 2 License)
> >    - some code came from Impala (Apache 2 License)
> >  - Need to track the files imported from impala
> >    - Wes to document.
> >    - Zoltan to look into moving copyright to NOTICE
> >
> > Statistics:
> >  - Revisit signed/unsigned stats approach
> >  - instead add information on how the min/man got obtained. (Collation)
> >  - collation should follow a standard. We’re going to implement only a
> > subset.
> >  - JIRA PARQUET-686
> >
> > int96:
> >  - deprecate write of int96 (Ryan to look into it)
> >
> > New Encodings/compression:
> >  - brotli compression. => 20% decrease in size. 25% increase in encoding
> > time. other settings: 15%/12% (compared to gzip). Ryan to update the PR.
> >     - need cpp integration as well. Uwe
> >  - PARQUET-682: specify encoding per column. Piyush to update PR
> >
> >
> >
> > On Thu, Nov 10, 2016 at 10:00 AM, Julien Le Dem <julien@dremio.com
> <javascript:;>> wrote:
> >
> > > starting now
> > > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >
> > > On Thu, Nov 10, 2016 at 8:51 AM, Julien Le Dem <julien@dremio.com
> <javascript:;>>
> > wrote:
> > >
> > >> Reminder that the Parquet Sync up will be in 1h at 10am PT on hangout:
> > >> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> > >>
> > >> --
> > >> Julien
> > >>
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Julien

Re: parquet sync up today at 10PT (in 1 hour)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I have a slight correct for the Brotli encoding numbers. The 20% size
decrease incurred a 2.5% increase in compression time (using brotli-5),
while the 15% size decrease had a 12% encoding time *decrease* (using
brotli-4). We've decided to use brotli-5 for tables that are read a lot,
and brotli-4 for most other tables.

On Thu, Nov 10, 2016 at 11:26 AM, Julien Le Dem <ju...@dremio.com> wrote:

>  Attendees/agenda:
> Zoltan (Cloudera):
>  - Parquet tools questions
> Piyush (Twitter):
>  - planning on encoding optimization
> Uwe:
>  - release parquet-cpp
>  - license/notice questions
> Wes (twosigma):
>  - working on arrow
>  - helping with the parquet-cpp release
> Deepak (HP/Vertica):
>  - read/write parquet-cpp
>  - discuss. statistics PARQUET-686. timestamps/...
> Ryan (Netflix):
>  - 1.9.0 release out.
>  - statistics
> Julien (Dremio):
>  - Parquet-Arrow integration
>
> Notes:
> Parquet-tools:
>  - when missing hadoop jars on the class path => bad error message
>    - 1.6 used to bundle hadoop
>    - 1.9 requires adding hadoop classpath
>  - Ryan has new new CLI tool
>
> Parquet cpp release:
>  - need to put mentions in NOTICE files
>    - merge script came from the Spark project (Apache 2 License)
>    - some code came from Impala (Apache 2 License)
>  - Need to track the files imported from impala
>    - Wes to document.
>    - Zoltan to look into moving copyright to NOTICE
>
> Statistics:
>  - Revisit signed/unsigned stats approach
>  - instead add information on how the min/man got obtained. (Collation)
>  - collation should follow a standard. We’re going to implement only a
> subset.
>  - JIRA PARQUET-686
>
> int96:
>  - deprecate write of int96 (Ryan to look into it)
>
> New Encodings/compression:
>  - brotli compression. => 20% decrease in size. 25% increase in encoding
> time. other settings: 15%/12% (compared to gzip). Ryan to update the PR.
>     - need cpp integration as well. Uwe
>  - PARQUET-682: specify encoding per column. Piyush to update PR
>
>
>
> On Thu, Nov 10, 2016 at 10:00 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > starting now
> > https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >
> > On Thu, Nov 10, 2016 at 8:51 AM, Julien Le Dem <ju...@dremio.com>
> wrote:
> >
> >> Reminder that the Parquet Sync up will be in 1h at 10am PT on hangout:
> >> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
> >>
> >> --
> >> Julien
> >>
> >
> >
> >
> > --
> > Julien
> >
>
>
>
> --
> Julien
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: parquet sync up today at 10PT (in 1 hour)

Posted by Julien Le Dem <ju...@dremio.com>.
 Attendees/agenda:
Zoltan (Cloudera):
 - Parquet tools questions
Piyush (Twitter):
 - planning on encoding optimization
Uwe:
 - release parquet-cpp
 - license/notice questions
Wes (twosigma):
 - working on arrow
 - helping with the parquet-cpp release
Deepak (HP/Vertica):
 - read/write parquet-cpp
 - discuss. statistics PARQUET-686. timestamps/...
Ryan (Netflix):
 - 1.9.0 release out.
 - statistics
Julien (Dremio):
 - Parquet-Arrow integration

Notes:
Parquet-tools:
 - when missing hadoop jars on the class path => bad error message
   - 1.6 used to bundle hadoop
   - 1.9 requires adding hadoop classpath
 - Ryan has new new CLI tool

Parquet cpp release:
 - need to put mentions in NOTICE files
   - merge script came from the Spark project (Apache 2 License)
   - some code came from Impala (Apache 2 License)
 - Need to track the files imported from impala
   - Wes to document.
   - Zoltan to look into moving copyright to NOTICE

Statistics:
 - Revisit signed/unsigned stats approach
 - instead add information on how the min/man got obtained. (Collation)
 - collation should follow a standard. We’re going to implement only a
subset.
 - JIRA PARQUET-686

int96:
 - deprecate write of int96 (Ryan to look into it)

New Encodings/compression:
 - brotli compression. => 20% decrease in size. 25% increase in encoding
time. other settings: 15%/12% (compared to gzip). Ryan to update the PR.
    - need cpp integration as well. Uwe
 - PARQUET-682: specify encoding per column. Piyush to update PR



On Thu, Nov 10, 2016 at 10:00 AM, Julien Le Dem <ju...@dremio.com> wrote:

> starting now
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> On Thu, Nov 10, 2016 at 8:51 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> Reminder that the Parquet Sync up will be in 1h at 10am PT on hangout:
>> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>



-- 
Julien

Re: parquet sync up today at 10PT (in 1 hour)

Posted by Julien Le Dem <ju...@dremio.com>.
starting now
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Thu, Nov 10, 2016 at 8:51 AM, Julien Le Dem <ju...@dremio.com> wrote:

> Reminder that the Parquet Sync up will be in 1h at 10am PT on hangout:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>



-- 
Julien