You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Julien Le Dem <ju...@twitter.com.INVALID> on 2015/07/14 19:04:13 UTC

Next Parquet Sync Up

The next Parquet sync up will be held on google hangout on 7/21/2015 at 10
am PST
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up

Re: Next Parquet Sync Up

Posted by Ryan Blue <bl...@cloudera.com>.

+1 Wednesday

On 07/22/2015 04:58 PM, Julien Le Dem wrote:
> +1 Wednesday
>
> On Wed, Jul 22, 2015 at 4:02 PM, Jason Altekruse <al...@gmail.com>
> wrote:
>
>> +1 for wednesday
>>
>> On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> +1 for Wed.
>>>
>>> On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
>>> alexlevenson@twitter.com.invalid> wrote:
>>>
>>>> +1 for Wednesday
>>>>
>>>> On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
>>> <julien@twitter.com.invalid
>>>>>
>>>> wrote:
>>>>
>>>>> Wednesday then?
>>>>> no more conflicts?
>>>>>
>>>>> On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
>>>>> alexlevenson@twitter.com.invalid> wrote:
>>>>>
>>>>>> Sorry to be difficult but, can I request any day other than Monday
>> --
>>>> how
>>>>>> about Wednesday?
>>>>>>
>>>>>> On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
>>>> wrote:
>>>>>>
>>>>>>> There's no particular reason for Tuesdays.
>>>>>>> We could do the next one on a Monday.
>>>>>>> Anybody objects?
>>>>>>>
>>>>>>> Julien
>>>>>>>
>>>>>>>> On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Any chance we can have these on either a different day or time?
>>>> The
>>>>>>> Drill
>>>>>>>> hangout is every Tuesday at 10am so I always have to pick one
>> or
>>>> the
>>>>>>> other.
>>>>>>>>
>>>>>>>> On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
>>>>>>>> nyigitbasi@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> An update to "actions", I will create a PR for the vectorized
>>> read
>>>>>>> instead
>>>>>>>>> of Zhenxiao.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nezih
>>>>>>>>>
>>>>>>>>> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
>>>>>>> <julien@twitter.com.invalid
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Agenda
>>>>>>>>>> - Julien (Twitter):
>>>>>>>>>>    - interested in ByteBuffer status
>>>>>>>>>> - Ryan (by email): interested in ByteBuffer status. did some
>>> work
>>>>> on
>>>>>>>>> bloom
>>>>>>>>>> filters.
>>>>>>>>>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
>>> new
>>>>>>>>> features
>>>>>>>>>> are solid.
>>>>>>>>>> - Daniel, Nezih, Zhengxiao (Netflix):
>>>>>>>>>>     - update on Vectorized read path for Presto (Dong Chen for
>>>> Hive)
>>>>>>>>>>     - Parquet-99: OOM on write
>>>>>>>>>> - Ippokratis: Impala team.
>>>>>>>>>> - Jason Altekruse: (Drill/MapR)
>>>>>>>>>>    - update on Java direct memory representation (hadoop 2.0
>>>>>> ByteBuffer)
>>>>>>>>>>    - currently uses a fork of Parquet that uses the GSOC work.
>>>>>>>>>> - Tianshuo: 1.8.1 release.
>>>>>>>>>> - Sanjeev (Twitter):
>>>>>>>>>>   - want to hear updates about vectorized in Presto
>>>>>>>>>>
>>>>>>>>>> actions:
>>>>>>>>>>   - Zhengxiao: update vectorization PR
>>>>>>>>>>   - Jason: update ByteBuffer PR
>>>>>>>>>>   - Jason: open JIRA for dic encoding fallback pointer
>>>>>>>>>>   - Daniel: opened a PR for PARQUET-99: up for review
>>>>>>>>>>
>>>>>>>>>> Notes:
>>>>>>>>>> - Vectorized read path for Presto (Dong Chen for Hive)
>>>> PARQUET-131
>>>>>>>>>>        - batch read
>>>>>>>>>>        - lazy materialization
>>>>>>>>>>        - Netflix integrated with Presto, Dong Chen integrated
>>> with
>>>>>> Hive
>>>>>>>>>>        - Nezih: micro/macro benchmark
>>>>>>>>>>             - micro 2 read paths
>>>>>>>>>>                   - only primitives, no converters (3 x faster
>>>> with
>>>>>>>>>> vectorized)
>>>>>>>>>>                   - complex with converters (no different
>>>>> performance)
>>>>>>>>>>             - macro Presto :
>>>>>>>>>>                   - complex types not better
>>>>>>>>>>                   - 2x better for primitive types
>>>>>>>>>>        - Daniel: projection + predicate well optimized with
>>> presto
>>>>>> (lazy
>>>>>>>>>> load, lazy materialization). predicate push down and using
>> dic
>>> in
>>>>>>>>> predicate
>>>>>>>>>> evaluation.
>>>>>>>>>>        - Ippokratis: fan out? => 100 values per collection,
>>>> list/map
>>>>>>>>>> materialization expansive
>>>>>>>>>>
>>>>>>>>>> - Dictionary encoding: because of fallback mechanism. We
>> don't
>>>> know
>>>>>>> when
>>>>>>>>>> the dictionary ends. => Jason to open a JIRA
>>>>>>>>>>
>>>>>>>>>> - Parquet-99: OOM on write
>>>>>>>>>>    - all big rows: (10MB per row) runs OOM before we first
>> check
>>>>>>>>>>    - big variability in size: small initial rows throw off
>>>> estimate
>>>>>> and
>>>>>>>>>> following big rows blow memory
>>>>>>>>>>    - add settings for checking at constant #rows.
>>>>>>>>>>    - we should experiment with simpler strategies
>>>>>>>>>>
>>>>>>>>>> - ByteBuffer status:
>>>>>>>>>>    - Jason need to rebase the PR
>>>>>>>>>>    - Parquet-77
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
>>>>> julien@twitter.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> It's happening now:
>>>>>>>>>>>
>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
>>>>> julien@twitter.com
>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The next Parquet sync up will be held on google hangout on
>>>>>> 7/21/2015
>>>>>>>>> at
>>>>>>>>>>>> 10 am PST
>>>>>>>>>>>>
>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alex Levenson
>>>>>> @THISWILLWORK
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alex Levenson
>>>> @THISWILLWORK
>>>>
>>>
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Next Parquet Sync Up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.

+1 Wednesday

On Wed, Jul 22, 2015 at 4:02 PM, Jason Altekruse <al...@gmail.com>
wrote:

> +1 for wednesday
>
> On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > +1 for Wed.
> >
> > On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
> > alexlevenson@twitter.com.invalid> wrote:
> >
> > > +1 for Wednesday
> > >
> > > On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
> > <julien@twitter.com.invalid
> > > >
> > > wrote:
> > >
> > > > Wednesday then?
> > > > no more conflicts?
> > > >
> > > > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > > > alexlevenson@twitter.com.invalid> wrote:
> > > >
> > > > > Sorry to be difficult but, can I request any day other than Monday
> --
> > > how
> > > > > about Wednesday?
> > > > >
> > > > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> > > wrote:
> > > > >
> > > > > > There's no particular reason for Tuesdays.
> > > > > > We could do the next one on a Monday.
> > > > > > Anybody objects?
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > Any chance we can have these on either a different day or time?
> > > The
> > > > > > Drill
> > > > > > > hangout is every Tuesday at 10am so I always have to pick one
> or
> > > the
> > > > > > other.
> > > > > > >
> > > > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > > > >
> > > > > > >> An update to "actions", I will create a PR for the vectorized
> > read
> > > > > > instead
> > > > > > >> of Zhenxiao.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Nezih
> > > > > > >>
> > > > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > > > <julien@twitter.com.invalid
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Agenda
> > > > > > >>> - Julien (Twitter):
> > > > > > >>>   - interested in ByteBuffer status
> > > > > > >>> - Ryan (by email): interested in ByteBuffer status. did some
> > work
> > > > on
> > > > > > >> bloom
> > > > > > >>> filters.
> > > > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
> > new
> > > > > > >> features
> > > > > > >>> are solid.
> > > > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > > > >>>    - update on Vectorized read path for Presto (Dong Chen for
> > > Hive)
> > > > > > >>>    - Parquet-99: OOM on write
> > > > > > >>> - Ippokratis: Impala team.
> > > > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > > > >>>   - update on Java direct memory representation (hadoop 2.0
> > > > > ByteBuffer)
> > > > > > >>>   - currently uses a fork of Parquet that uses the GSOC work.
> > > > > > >>> - Tianshuo: 1.8.1 release.
> > > > > > >>> - Sanjeev (Twitter):
> > > > > > >>>  - want to hear updates about vectorized in Presto
> > > > > > >>>
> > > > > > >>> actions:
> > > > > > >>>  - Zhengxiao: update vectorization PR
> > > > > > >>>  - Jason: update ByteBuffer PR
> > > > > > >>>  - Jason: open JIRA for dic encoding fallback pointer
> > > > > > >>>  - Daniel: opened a PR for PARQUET-99: up for review
> > > > > > >>>
> > > > > > >>> Notes:
> > > > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> > > PARQUET-131
> > > > > > >>>       - batch read
> > > > > > >>>       - lazy materialization
> > > > > > >>>       - Netflix integrated with Presto, Dong Chen integrated
> > with
> > > > > Hive
> > > > > > >>>       - Nezih: micro/macro benchmark
> > > > > > >>>            - micro 2 read paths
> > > > > > >>>                  - only primitives, no converters (3 x faster
> > > with
> > > > > > >>> vectorized)
> > > > > > >>>                  - complex with converters (no different
> > > > performance)
> > > > > > >>>            - macro Presto :
> > > > > > >>>                  - complex types not better
> > > > > > >>>                  - 2x better for primitive types
> > > > > > >>>       - Daniel: projection + predicate well optimized with
> > presto
> > > > > (lazy
> > > > > > >>> load, lazy materialization). predicate push down and using
> dic
> > in
> > > > > > >> predicate
> > > > > > >>> evaluation.
> > > > > > >>>       - Ippokratis: fan out? => 100 values per collection,
> > > list/map
> > > > > > >>> materialization expansive
> > > > > > >>>
> > > > > > >>> - Dictionary encoding: because of fallback mechanism. We
> don't
> > > know
> > > > > > when
> > > > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > > > >>>
> > > > > > >>> - Parquet-99: OOM on write
> > > > > > >>>   - all big rows: (10MB per row) runs OOM before we first
> check
> > > > > > >>>   - big variability in size: small initial rows throw off
> > > estimate
> > > > > and
> > > > > > >>> following big rows blow memory
> > > > > > >>>   - add settings for checking at constant #rows.
> > > > > > >>>   - we should experiment with simpler strategies
> > > > > > >>>
> > > > > > >>> - ByteBuffer status:
> > > > > > >>>   - Jason need to rebase the PR
> > > > > > >>>   - Parquet-77
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > > > julien@twitter.com>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> It's happening now:
> > > > > > >>>>
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > > >>>>
> > > > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > > > julien@twitter.com
> > > > > >
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > > > 7/21/2015
> > > > > > >> at
> > > > > > >>>>> 10 am PST
> > > > > > >>>>>
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Alex Levenson
> > > > > @THISWILLWORK
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Levenson
> > > @THISWILLWORK
> > >
> >
>

Re: Next Parquet Sync Up

Posted by Jason Altekruse <al...@gmail.com>.

+1 for wednesday

On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org> wrote:

> +1 for Wed.
>
> On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > +1 for Wednesday
> >
> > On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
> <julien@twitter.com.invalid
> > >
> > wrote:
> >
> > > Wednesday then?
> > > no more conflicts?
> > >
> > > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > > alexlevenson@twitter.com.invalid> wrote:
> > >
> > > > Sorry to be difficult but, can I request any day other than Monday --
> > how
> > > > about Wednesday?
> > > >
> > > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> > wrote:
> > > >
> > > > > There's no particular reason for Tuesdays.
> > > > > We could do the next one on a Monday.
> > > > > Anybody objects?
> > > > >
> > > > > Julien
> > > > >
> > > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > > > >
> > > > > > Any chance we can have these on either a different day or time?
> > The
> > > > > Drill
> > > > > > hangout is every Tuesday at 10am so I always have to pick one or
> > the
> > > > > other.
> > > > > >
> > > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > > >
> > > > > >> An update to "actions", I will create a PR for the vectorized
> read
> > > > > instead
> > > > > >> of Zhenxiao.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Nezih
> > > > > >>
> > > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > > <julien@twitter.com.invalid
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Agenda
> > > > > >>> - Julien (Twitter):
> > > > > >>>   - interested in ByteBuffer status
> > > > > >>> - Ryan (by email): interested in ByteBuffer status. did some
> work
> > > on
> > > > > >> bloom
> > > > > >>> filters.
> > > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
> new
> > > > > >> features
> > > > > >>> are solid.
> > > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > > >>>    - update on Vectorized read path for Presto (Dong Chen for
> > Hive)
> > > > > >>>    - Parquet-99: OOM on write
> > > > > >>> - Ippokratis: Impala team.
> > > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > > >>>   - update on Java direct memory representation (hadoop 2.0
> > > > ByteBuffer)
> > > > > >>>   - currently uses a fork of Parquet that uses the GSOC work.
> > > > > >>> - Tianshuo: 1.8.1 release.
> > > > > >>> - Sanjeev (Twitter):
> > > > > >>>  - want to hear updates about vectorized in Presto
> > > > > >>>
> > > > > >>> actions:
> > > > > >>>  - Zhengxiao: update vectorization PR
> > > > > >>>  - Jason: update ByteBuffer PR
> > > > > >>>  - Jason: open JIRA for dic encoding fallback pointer
> > > > > >>>  - Daniel: opened a PR for PARQUET-99: up for review
> > > > > >>>
> > > > > >>> Notes:
> > > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> > PARQUET-131
> > > > > >>>       - batch read
> > > > > >>>       - lazy materialization
> > > > > >>>       - Netflix integrated with Presto, Dong Chen integrated
> with
> > > > Hive
> > > > > >>>       - Nezih: micro/macro benchmark
> > > > > >>>            - micro 2 read paths
> > > > > >>>                  - only primitives, no converters (3 x faster
> > with
> > > > > >>> vectorized)
> > > > > >>>                  - complex with converters (no different
> > > performance)
> > > > > >>>            - macro Presto :
> > > > > >>>                  - complex types not better
> > > > > >>>                  - 2x better for primitive types
> > > > > >>>       - Daniel: projection + predicate well optimized with
> presto
> > > > (lazy
> > > > > >>> load, lazy materialization). predicate push down and using dic
> in
> > > > > >> predicate
> > > > > >>> evaluation.
> > > > > >>>       - Ippokratis: fan out? => 100 values per collection,
> > list/map
> > > > > >>> materialization expansive
> > > > > >>>
> > > > > >>> - Dictionary encoding: because of fallback mechanism. We don't
> > know
> > > > > when
> > > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > > >>>
> > > > > >>> - Parquet-99: OOM on write
> > > > > >>>   - all big rows: (10MB per row) runs OOM before we first check
> > > > > >>>   - big variability in size: small initial rows throw off
> > estimate
> > > > and
> > > > > >>> following big rows blow memory
> > > > > >>>   - add settings for checking at constant #rows.
> > > > > >>>   - we should experiment with simpler strategies
> > > > > >>>
> > > > > >>> - ByteBuffer status:
> > > > > >>>   - Jason need to rebase the PR
> > > > > >>>   - Parquet-77
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > > julien@twitter.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> It's happening now:
> > > > > >>>>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > >>>>
> > > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > > julien@twitter.com
> > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > > 7/21/2015
> > > > > >> at
> > > > > >>>>> 10 am PST
> > > > > >>>>>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Alex Levenson
> > > > @THISWILLWORK
> > > >
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>

Re: Next Parquet Sync Up

Posted by Jacques Nadeau <ja...@apache.org>.

+1 for Wed.

On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:

> +1 for Wednesday
>
> On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem <julien@twitter.com.invalid
> >
> wrote:
>
> > Wednesday then?
> > no more conflicts?
> >
> > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > alexlevenson@twitter.com.invalid> wrote:
> >
> > > Sorry to be difficult but, can I request any day other than Monday --
> how
> > > about Wednesday?
> > >
> > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> wrote:
> > >
> > > > There's no particular reason for Tuesdays.
> > > > We could do the next one on a Monday.
> > > > Anybody objects?
> > > >
> > > > Julien
> > > >
> > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > > >
> > > > > Any chance we can have these on either a different day or time?
> The
> > > > Drill
> > > > > hangout is every Tuesday at 10am so I always have to pick one or
> the
> > > > other.
> > > > >
> > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > >
> > > > >> An update to "actions", I will create a PR for the vectorized read
> > > > instead
> > > > >> of Zhenxiao.
> > > > >>
> > > > >> Thanks,
> > > > >> Nezih
> > > > >>
> > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > <julien@twitter.com.invalid
> > > > >> wrote:
> > > > >>
> > > > >>> Agenda
> > > > >>> - Julien (Twitter):
> > > > >>>   - interested in ByteBuffer status
> > > > >>> - Ryan (by email): interested in ByteBuffer status. did some work
> > on
> > > > >> bloom
> > > > >>> filters.
> > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > > > >> features
> > > > >>> are solid.
> > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > >>>    - update on Vectorized read path for Presto (Dong Chen for
> Hive)
> > > > >>>    - Parquet-99: OOM on write
> > > > >>> - Ippokratis: Impala team.
> > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > >>>   - update on Java direct memory representation (hadoop 2.0
> > > ByteBuffer)
> > > > >>>   - currently uses a fork of Parquet that uses the GSOC work.
> > > > >>> - Tianshuo: 1.8.1 release.
> > > > >>> - Sanjeev (Twitter):
> > > > >>>  - want to hear updates about vectorized in Presto
> > > > >>>
> > > > >>> actions:
> > > > >>>  - Zhengxiao: update vectorization PR
> > > > >>>  - Jason: update ByteBuffer PR
> > > > >>>  - Jason: open JIRA for dic encoding fallback pointer
> > > > >>>  - Daniel: opened a PR for PARQUET-99: up for review
> > > > >>>
> > > > >>> Notes:
> > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> PARQUET-131
> > > > >>>       - batch read
> > > > >>>       - lazy materialization
> > > > >>>       - Netflix integrated with Presto, Dong Chen integrated with
> > > Hive
> > > > >>>       - Nezih: micro/macro benchmark
> > > > >>>            - micro 2 read paths
> > > > >>>                  - only primitives, no converters (3 x faster
> with
> > > > >>> vectorized)
> > > > >>>                  - complex with converters (no different
> > performance)
> > > > >>>            - macro Presto :
> > > > >>>                  - complex types not better
> > > > >>>                  - 2x better for primitive types
> > > > >>>       - Daniel: projection + predicate well optimized with presto
> > > (lazy
> > > > >>> load, lazy materialization). predicate push down and using dic in
> > > > >> predicate
> > > > >>> evaluation.
> > > > >>>       - Ippokratis: fan out? => 100 values per collection,
> list/map
> > > > >>> materialization expansive
> > > > >>>
> > > > >>> - Dictionary encoding: because of fallback mechanism. We don't
> know
> > > > when
> > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > >>>
> > > > >>> - Parquet-99: OOM on write
> > > > >>>   - all big rows: (10MB per row) runs OOM before we first check
> > > > >>>   - big variability in size: small initial rows throw off
> estimate
> > > and
> > > > >>> following big rows blow memory
> > > > >>>   - add settings for checking at constant #rows.
> > > > >>>   - we should experiment with simpler strategies
> > > > >>>
> > > > >>> - ByteBuffer status:
> > > > >>>   - Jason need to rebase the PR
> > > > >>>   - Parquet-77
> > > > >>>
> > > > >>>
> > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > julien@twitter.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> It's happening now:
> > > > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > >>>>
> > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > julien@twitter.com
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > 7/21/2015
> > > > >> at
> > > > >>>>> 10 am PST
> > > > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Levenson
> > > @THISWILLWORK
> > >
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Re: Next Parquet Sync Up

Posted by Alex Levenson <al...@twitter.com.INVALID>.

+1 for Wednesday

On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem <ju...@twitter.com.invalid>
wrote:

> Wednesday then?
> no more conflicts?
>
> On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > Sorry to be difficult but, can I request any day other than Monday -- how
> > about Wednesday?
> >
> > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:
> >
> > > There's no particular reason for Tuesdays.
> > > We could do the next one on a Monday.
> > > Anybody objects?
> > >
> > > Julien
> > >
> > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> wrote:
> > > >
> > > > Any chance we can have these on either a different day or time?  The
> > > Drill
> > > > hangout is every Tuesday at 10am so I always have to pick one or the
> > > other.
> > > >
> > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > nyigitbasi@netflix.com.invalid> wrote:
> > > >
> > > >> An update to "actions", I will create a PR for the vectorized read
> > > instead
> > > >> of Zhenxiao.
> > > >>
> > > >> Thanks,
> > > >> Nezih
> > > >>
> > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > <julien@twitter.com.invalid
> > > >> wrote:
> > > >>
> > > >>> Agenda
> > > >>> - Julien (Twitter):
> > > >>>   - interested in ByteBuffer status
> > > >>> - Ryan (by email): interested in ByteBuffer status. did some work
> on
> > > >> bloom
> > > >>> filters.
> > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > > >> features
> > > >>> are solid.
> > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > >>>    - update on Vectorized read path for Presto (Dong Chen for Hive)
> > > >>>    - Parquet-99: OOM on write
> > > >>> - Ippokratis: Impala team.
> > > >>> - Jason Altekruse: (Drill/MapR)
> > > >>>   - update on Java direct memory representation (hadoop 2.0
> > ByteBuffer)
> > > >>>   - currently uses a fork of Parquet that uses the GSOC work.
> > > >>> - Tianshuo: 1.8.1 release.
> > > >>> - Sanjeev (Twitter):
> > > >>>  - want to hear updates about vectorized in Presto
> > > >>>
> > > >>> actions:
> > > >>>  - Zhengxiao: update vectorization PR
> > > >>>  - Jason: update ByteBuffer PR
> > > >>>  - Jason: open JIRA for dic encoding fallback pointer
> > > >>>  - Daniel: opened a PR for PARQUET-99: up for review
> > > >>>
> > > >>> Notes:
> > > >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> > > >>>       - batch read
> > > >>>       - lazy materialization
> > > >>>       - Netflix integrated with Presto, Dong Chen integrated with
> > Hive
> > > >>>       - Nezih: micro/macro benchmark
> > > >>>            - micro 2 read paths
> > > >>>                  - only primitives, no converters (3 x faster with
> > > >>> vectorized)
> > > >>>                  - complex with converters (no different
> performance)
> > > >>>            - macro Presto :
> > > >>>                  - complex types not better
> > > >>>                  - 2x better for primitive types
> > > >>>       - Daniel: projection + predicate well optimized with presto
> > (lazy
> > > >>> load, lazy materialization). predicate push down and using dic in
> > > >> predicate
> > > >>> evaluation.
> > > >>>       - Ippokratis: fan out? => 100 values per collection, list/map
> > > >>> materialization expansive
> > > >>>
> > > >>> - Dictionary encoding: because of fallback mechanism. We don't know
> > > when
> > > >>> the dictionary ends. => Jason to open a JIRA
> > > >>>
> > > >>> - Parquet-99: OOM on write
> > > >>>   - all big rows: (10MB per row) runs OOM before we first check
> > > >>>   - big variability in size: small initial rows throw off estimate
> > and
> > > >>> following big rows blow memory
> > > >>>   - add settings for checking at constant #rows.
> > > >>>   - we should experiment with simpler strategies
> > > >>>
> > > >>> - ByteBuffer status:
> > > >>>   - Jason need to rebase the PR
> > > >>>   - Parquet-77
> > > >>>
> > > >>>
> > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> julien@twitter.com>
> > > >>> wrote:
> > > >>>
> > > >>>> It's happening now:
> > > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > >>>>
> > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> julien@twitter.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> The next Parquet sync up will be held on google hangout on
> > 7/21/2015
> > > >> at
> > > >>>>> 10 am PST
> > > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > >>
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>



-- 
Alex Levenson
@THISWILLWORK

Re: Next Parquet Sync Up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.

Wednesday then?
no more conflicts?

On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:

> Sorry to be difficult but, can I request any day other than Monday -- how
> about Wednesday?
>
> On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:
>
> > There's no particular reason for Tuesdays.
> > We could do the next one on a Monday.
> > Anybody objects?
> >
> > Julien
> >
> > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > Any chance we can have these on either a different day or time?  The
> > Drill
> > > hangout is every Tuesday at 10am so I always have to pick one or the
> > other.
> > >
> > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > nyigitbasi@netflix.com.invalid> wrote:
> > >
> > >> An update to "actions", I will create a PR for the vectorized read
> > instead
> > >> of Zhenxiao.
> > >>
> > >> Thanks,
> > >> Nezih
> > >>
> > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > <julien@twitter.com.invalid
> > >> wrote:
> > >>
> > >>> Agenda
> > >>> - Julien (Twitter):
> > >>>   - interested in ByteBuffer status
> > >>> - Ryan (by email): interested in ByteBuffer status. did some work on
> > >> bloom
> > >>> filters.
> > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > >> features
> > >>> are solid.
> > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > >>>    - update on Vectorized read path for Presto (Dong Chen for Hive)
> > >>>    - Parquet-99: OOM on write
> > >>> - Ippokratis: Impala team.
> > >>> - Jason Altekruse: (Drill/MapR)
> > >>>   - update on Java direct memory representation (hadoop 2.0
> ByteBuffer)
> > >>>   - currently uses a fork of Parquet that uses the GSOC work.
> > >>> - Tianshuo: 1.8.1 release.
> > >>> - Sanjeev (Twitter):
> > >>>  - want to hear updates about vectorized in Presto
> > >>>
> > >>> actions:
> > >>>  - Zhengxiao: update vectorization PR
> > >>>  - Jason: update ByteBuffer PR
> > >>>  - Jason: open JIRA for dic encoding fallback pointer
> > >>>  - Daniel: opened a PR for PARQUET-99: up for review
> > >>>
> > >>> Notes:
> > >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> > >>>       - batch read
> > >>>       - lazy materialization
> > >>>       - Netflix integrated with Presto, Dong Chen integrated with
> Hive
> > >>>       - Nezih: micro/macro benchmark
> > >>>            - micro 2 read paths
> > >>>                  - only primitives, no converters (3 x faster with
> > >>> vectorized)
> > >>>                  - complex with converters (no different performance)
> > >>>            - macro Presto :
> > >>>                  - complex types not better
> > >>>                  - 2x better for primitive types
> > >>>       - Daniel: projection + predicate well optimized with presto
> (lazy
> > >>> load, lazy materialization). predicate push down and using dic in
> > >> predicate
> > >>> evaluation.
> > >>>       - Ippokratis: fan out? => 100 values per collection, list/map
> > >>> materialization expansive
> > >>>
> > >>> - Dictionary encoding: because of fallback mechanism. We don't know
> > when
> > >>> the dictionary ends. => Jason to open a JIRA
> > >>>
> > >>> - Parquet-99: OOM on write
> > >>>   - all big rows: (10MB per row) runs OOM before we first check
> > >>>   - big variability in size: small initial rows throw off estimate
> and
> > >>> following big rows blow memory
> > >>>   - add settings for checking at constant #rows.
> > >>>   - we should experiment with simpler strategies
> > >>>
> > >>> - ByteBuffer status:
> > >>>   - Jason need to rebase the PR
> > >>>   - Parquet-77
> > >>>
> > >>>
> > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> > >>> wrote:
> > >>>
> > >>>> It's happening now:
> > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>>>
> > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <julien@twitter.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> The next Parquet sync up will be held on google hangout on
> 7/21/2015
> > >> at
> > >>>>> 10 am PST
> > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Re: Next Parquet Sync Up

Posted by Alex Levenson <al...@twitter.com.INVALID>.

Sorry to be difficult but, can I request any day other than Monday -- how
about Wednesday?

On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:

> There's no particular reason for Tuesdays.
> We could do the next one on a Monday.
> Anybody objects?
>
> Julien
>
> > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > Any chance we can have these on either a different day or time?  The
> Drill
> > hangout is every Tuesday at 10am so I always have to pick one or the
> other.
> >
> > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > nyigitbasi@netflix.com.invalid> wrote:
> >
> >> An update to "actions", I will create a PR for the vectorized read
> instead
> >> of Zhenxiao.
> >>
> >> Thanks,
> >> Nezih
> >>
> >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> <julien@twitter.com.invalid
> >> wrote:
> >>
> >>> Agenda
> >>> - Julien (Twitter):
> >>>   - interested in ByteBuffer status
> >>> - Ryan (by email): interested in ByteBuffer status. did some work on
> >> bloom
> >>> filters.
> >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> >> features
> >>> are solid.
> >>> - Daniel, Nezih, Zhengxiao (Netflix):
> >>>    - update on Vectorized read path for Presto (Dong Chen for Hive)
> >>>    - Parquet-99: OOM on write
> >>> - Ippokratis: Impala team.
> >>> - Jason Altekruse: (Drill/MapR)
> >>>   - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
> >>>   - currently uses a fork of Parquet that uses the GSOC work.
> >>> - Tianshuo: 1.8.1 release.
> >>> - Sanjeev (Twitter):
> >>>  - want to hear updates about vectorized in Presto
> >>>
> >>> actions:
> >>>  - Zhengxiao: update vectorization PR
> >>>  - Jason: update ByteBuffer PR
> >>>  - Jason: open JIRA for dic encoding fallback pointer
> >>>  - Daniel: opened a PR for PARQUET-99: up for review
> >>>
> >>> Notes:
> >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> >>>       - batch read
> >>>       - lazy materialization
> >>>       - Netflix integrated with Presto, Dong Chen integrated with Hive
> >>>       - Nezih: micro/macro benchmark
> >>>            - micro 2 read paths
> >>>                  - only primitives, no converters (3 x faster with
> >>> vectorized)
> >>>                  - complex with converters (no different performance)
> >>>            - macro Presto :
> >>>                  - complex types not better
> >>>                  - 2x better for primitive types
> >>>       - Daniel: projection + predicate well optimized with presto (lazy
> >>> load, lazy materialization). predicate push down and using dic in
> >> predicate
> >>> evaluation.
> >>>       - Ippokratis: fan out? => 100 values per collection, list/map
> >>> materialization expansive
> >>>
> >>> - Dictionary encoding: because of fallback mechanism. We don't know
> when
> >>> the dictionary ends. => Jason to open a JIRA
> >>>
> >>> - Parquet-99: OOM on write
> >>>   - all big rows: (10MB per row) runs OOM before we first check
> >>>   - big variability in size: small initial rows throw off estimate and
> >>> following big rows blow memory
> >>>   - add settings for checking at constant #rows.
> >>>   - we should experiment with simpler strategies
> >>>
> >>> - ByteBuffer status:
> >>>   - Jason need to rebase the PR
> >>>   - Parquet-77
> >>>
> >>>
> >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> >>> wrote:
> >>>
> >>>> It's happening now:
> >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>>>
> >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> >>>> wrote:
> >>>>
> >>>>> The next Parquet sync up will be held on google hangout on 7/21/2015
> >> at
> >>>>> 10 am PST
> >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>
>



-- 
Alex Levenson
@THISWILLWORK

Re: Next Parquet Sync Up

Posted by Julien Le Dem <ju...@ledem.net>.

There's no particular reason for Tuesdays.
We could do the next one on a Monday.
Anybody objects?

Julien

> On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
> 
> Any chance we can have these on either a different day or time?  The Drill
> hangout is every Tuesday at 10am so I always have to pick one or the other.
> 
> On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> nyigitbasi@netflix.com.invalid> wrote:
> 
>> An update to "actions", I will create a PR for the vectorized read instead
>> of Zhenxiao.
>> 
>> Thanks,
>> Nezih
>> 
>> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <julien@twitter.com.invalid
>> wrote:
>> 
>>> Agenda
>>> - Julien (Twitter):
>>>   - interested in ByteBuffer status
>>> - Ryan (by email): interested in ByteBuffer status. did some work on
>> bloom
>>> filters.
>>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
>> features
>>> are solid.
>>> - Daniel, Nezih, Zhengxiao (Netflix):
>>>    - update on Vectorized read path for Presto (Dong Chen for Hive)
>>>    - Parquet-99: OOM on write
>>> - Ippokratis: Impala team.
>>> - Jason Altekruse: (Drill/MapR)
>>>   - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
>>>   - currently uses a fork of Parquet that uses the GSOC work.
>>> - Tianshuo: 1.8.1 release.
>>> - Sanjeev (Twitter):
>>>  - want to hear updates about vectorized in Presto
>>> 
>>> actions:
>>>  - Zhengxiao: update vectorization PR
>>>  - Jason: update ByteBuffer PR
>>>  - Jason: open JIRA for dic encoding fallback pointer
>>>  - Daniel: opened a PR for PARQUET-99: up for review
>>> 
>>> Notes:
>>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
>>>       - batch read
>>>       - lazy materialization
>>>       - Netflix integrated with Presto, Dong Chen integrated with Hive
>>>       - Nezih: micro/macro benchmark
>>>            - micro 2 read paths
>>>                  - only primitives, no converters (3 x faster with
>>> vectorized)
>>>                  - complex with converters (no different performance)
>>>            - macro Presto :
>>>                  - complex types not better
>>>                  - 2x better for primitive types
>>>       - Daniel: projection + predicate well optimized with presto (lazy
>>> load, lazy materialization). predicate push down and using dic in
>> predicate
>>> evaluation.
>>>       - Ippokratis: fan out? => 100 values per collection, list/map
>>> materialization expansive
>>> 
>>> - Dictionary encoding: because of fallback mechanism. We don't know when
>>> the dictionary ends. => Jason to open a JIRA
>>> 
>>> - Parquet-99: OOM on write
>>>   - all big rows: (10MB per row) runs OOM before we first check
>>>   - big variability in size: small initial rows throw off estimate and
>>> following big rows blow memory
>>>   - add settings for checking at constant #rows.
>>>   - we should experiment with simpler strategies
>>> 
>>> - ByteBuffer status:
>>>   - Jason need to rebase the PR
>>>   - Parquet-77
>>> 
>>> 
>>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
>>> wrote:
>>> 
>>>> It's happening now:
>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>> 
>>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
>>>> wrote:
>>>> 
>>>>> The next Parquet sync up will be held on google hangout on 7/21/2015
>> at
>>>>> 10 am PST
>>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>

Re: Next Parquet Sync Up

Posted by Jacques Nadeau <ja...@apache.org>.

Any chance we can have these on either a different day or time?  The Drill
hangout is every Tuesday at 10am so I always have to pick one or the other.

On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
nyigitbasi@netflix.com.invalid> wrote:

> An update to "actions", I will create a PR for the vectorized read instead
> of Zhenxiao.
>
> Thanks,
> Nezih
>
> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <julien@twitter.com.invalid
> >
> wrote:
>
> > Agenda
> > - Julien (Twitter):
> >    - interested in ByteBuffer status
> > - Ryan (by email): interested in ByteBuffer status. did some work on
> bloom
> > filters.
> >  PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> features
> > are solid.
> > - Daniel, Nezih, Zhengxiao (Netflix):
> >     - update on Vectorized read path for Presto (Dong Chen for Hive)
> >     - Parquet-99: OOM on write
> > - Ippokratis: Impala team.
> > - Jason Altekruse: (Drill/MapR)
> >    - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
> >    - currently uses a fork of Parquet that uses the GSOC work.
> > - Tianshuo: 1.8.1 release.
> > - Sanjeev (Twitter):
> >   - want to hear updates about vectorized in Presto
> >
> > actions:
> >   - Zhengxiao: update vectorization PR
> >   - Jason: update ByteBuffer PR
> >   - Jason: open JIRA for dic encoding fallback pointer
> >   - Daniel: opened a PR for PARQUET-99: up for review
> >
> > Notes:
> > - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> >        - batch read
> >        - lazy materialization
> >        - Netflix integrated with Presto, Dong Chen integrated with Hive
> >        - Nezih: micro/macro benchmark
> >             - micro 2 read paths
> >                   - only primitives, no converters (3 x faster with
> > vectorized)
> >                   - complex with converters (no different performance)
> >             - macro Presto :
> >                   - complex types not better
> >                   - 2x better for primitive types
> >        - Daniel: projection + predicate well optimized with presto (lazy
> > load, lazy materialization). predicate push down and using dic in
> predicate
> > evaluation.
> >        - Ippokratis: fan out? => 100 values per collection, list/map
> > materialization expansive
> >
> >  - Dictionary encoding: because of fallback mechanism. We don't know when
> > the dictionary ends. => Jason to open a JIRA
> >
> > - Parquet-99: OOM on write
> >    - all big rows: (10MB per row) runs OOM before we first check
> >    - big variability in size: small initial rows throw off estimate and
> > following big rows blow memory
> >    - add settings for checking at constant #rows.
> >    - we should experiment with simpler strategies
> >
> > - ByteBuffer status:
> >    - Jason need to rebase the PR
> >    - Parquet-77
> >
> >
> > On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> >
> > > It's happening now:
> > > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >
> > > On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > >
> > >> The next Parquet sync up will be held on google hangout on 7/21/2015
> at
> > >> 10 am PST
> > >> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>
> > >
> > >
> >
>

Re: Next Parquet Sync Up

Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.

An update to "actions", I will create a PR for the vectorized read instead
of Zhenxiao.

Thanks,
Nezih

On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <ju...@twitter.com.invalid>
wrote:

> Agenda
> - Julien (Twitter):
>    - interested in ByteBuffer status
> - Ryan (by email): interested in ByteBuffer status. did some work on bloom
> filters.
>  PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new features
> are solid.
> - Daniel, Nezih, Zhengxiao (Netflix):
>     - update on Vectorized read path for Presto (Dong Chen for Hive)
>     - Parquet-99: OOM on write
> - Ippokratis: Impala team.
> - Jason Altekruse: (Drill/MapR)
>    - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
>    - currently uses a fork of Parquet that uses the GSOC work.
> - Tianshuo: 1.8.1 release.
> - Sanjeev (Twitter):
>   - want to hear updates about vectorized in Presto
>
> actions:
>   - Zhengxiao: update vectorization PR
>   - Jason: update ByteBuffer PR
>   - Jason: open JIRA for dic encoding fallback pointer
>   - Daniel: opened a PR for PARQUET-99: up for review
>
> Notes:
> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
>        - batch read
>        - lazy materialization
>        - Netflix integrated with Presto, Dong Chen integrated with Hive
>        - Nezih: micro/macro benchmark
>             - micro 2 read paths
>                   - only primitives, no converters (3 x faster with
> vectorized)
>                   - complex with converters (no different performance)
>             - macro Presto :
>                   - complex types not better
>                   - 2x better for primitive types
>        - Daniel: projection + predicate well optimized with presto (lazy
> load, lazy materialization). predicate push down and using dic in predicate
> evaluation.
>        - Ippokratis: fan out? => 100 values per collection, list/map
> materialization expansive
>
>  - Dictionary encoding: because of fallback mechanism. We don't know when
> the dictionary ends. => Jason to open a JIRA
>
> - Parquet-99: OOM on write
>    - all big rows: (10MB per row) runs OOM before we first check
>    - big variability in size: small initial rows throw off estimate and
> following big rows blow memory
>    - add settings for checking at constant #rows.
>    - we should experiment with simpler strategies
>
> - ByteBuffer status:
>    - Jason need to rebase the PR
>    - Parquet-77
>
>
> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
>
> > It's happening now:
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >
> > On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> >
> >> The next Parquet sync up will be held on google hangout on 7/21/2015 at
> >> 10 am PST
> >> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>
> >
> >
>

Re: Next Parquet Sync Up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.

Agenda
- Julien (Twitter):
   - interested in ByteBuffer status
- Ryan (by email): interested in ByteBuffer status. did some work on bloom
filters.
 PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new features
are solid.
- Daniel, Nezih, Zhengxiao (Netflix):
    - update on Vectorized read path for Presto (Dong Chen for Hive)
    - Parquet-99: OOM on write
- Ippokratis: Impala team.
- Jason Altekruse: (Drill/MapR)
   - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
   - currently uses a fork of Parquet that uses the GSOC work.
- Tianshuo: 1.8.1 release.
- Sanjeev (Twitter):
  - want to hear updates about vectorized in Presto

actions:
  - Zhengxiao: update vectorization PR
  - Jason: update ByteBuffer PR
  - Jason: open JIRA for dic encoding fallback pointer
  - Daniel: opened a PR for PARQUET-99: up for review

Notes:
- Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
       - batch read
       - lazy materialization
       - Netflix integrated with Presto, Dong Chen integrated with Hive
       - Nezih: micro/macro benchmark
            - micro 2 read paths
                  - only primitives, no converters (3 x faster with
vectorized)
                  - complex with converters (no different performance)
            - macro Presto :
                  - complex types not better
                  - 2x better for primitive types
       - Daniel: projection + predicate well optimized with presto (lazy
load, lazy materialization). predicate push down and using dic in predicate
evaluation.
       - Ippokratis: fan out? => 100 values per collection, list/map
materialization expansive

 - Dictionary encoding: because of fallback mechanism. We don't know when
the dictionary ends. => Jason to open a JIRA

- Parquet-99: OOM on write
   - all big rows: (10MB per row) runs OOM before we first check
   - big variability in size: small initial rows throw off estimate and
following big rows blow memory
   - add settings for checking at constant #rows.
   - we should experiment with simpler strategies

- ByteBuffer status:
   - Jason need to rebase the PR
   - Parquet-77


On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com> wrote:

> It's happening now:
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>
> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
>
>> The next Parquet sync up will be held on google hangout on 7/21/2015 at
>> 10 am PST
>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>
>
>

Re: Next Parquet Sync Up

Posted by Julien Le Dem <ju...@twitter.com.INVALID>.

It's happening now:
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up

On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com> wrote:

> The next Parquet sync up will be held on google hangout on 7/21/2015 at 10
> am PST
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>