You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@twitter.com.INVALID> on 2015/07/14 19:04:13 UTC
Next Parquet Sync Up
The next Parquet sync up will be held on google hangout on 7/21/2015 at 10
am PST
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
Re: Next Parquet Sync Up
Posted by Ryan Blue <bl...@cloudera.com>.
+1 Wednesday
On 07/22/2015 04:58 PM, Julien Le Dem wrote:
> +1 Wednesday
>
> On Wed, Jul 22, 2015 at 4:02 PM, Jason Altekruse <al...@gmail.com>
> wrote:
>
>> +1 for wednesday
>>
>> On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> +1 for Wed.
>>>
>>> On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
>>> alexlevenson@twitter.com.invalid> wrote:
>>>
>>>> +1 for Wednesday
>>>>
>>>> On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
>>> <julien@twitter.com.invalid
>>>>>
>>>> wrote:
>>>>
>>>>> Wednesday then?
>>>>> no more conflicts?
>>>>>
>>>>> On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
>>>>> alexlevenson@twitter.com.invalid> wrote:
>>>>>
>>>>>> Sorry to be difficult but, can I request any day other than Monday
>> --
>>>> how
>>>>>> about Wednesday?
>>>>>>
>>>>>> On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
>>>> wrote:
>>>>>>
>>>>>>> There's no particular reason for Tuesdays.
>>>>>>> We could do the next one on a Monday.
>>>>>>> Anybody objects?
>>>>>>>
>>>>>>> Julien
>>>>>>>
>>>>>>>> On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Any chance we can have these on either a different day or time?
>>>> The
>>>>>>> Drill
>>>>>>>> hangout is every Tuesday at 10am so I always have to pick one
>> or
>>>> the
>>>>>>> other.
>>>>>>>>
>>>>>>>> On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
>>>>>>>> nyigitbasi@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> An update to "actions", I will create a PR for the vectorized
>>> read
>>>>>>> instead
>>>>>>>>> of Zhenxiao.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nezih
>>>>>>>>>
>>>>>>>>> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
>>>>>>> <julien@twitter.com.invalid
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Agenda
>>>>>>>>>> - Julien (Twitter):
>>>>>>>>>> - interested in ByteBuffer status
>>>>>>>>>> - Ryan (by email): interested in ByteBuffer status. did some
>>> work
>>>>> on
>>>>>>>>> bloom
>>>>>>>>>> filters.
>>>>>>>>>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
>>> new
>>>>>>>>> features
>>>>>>>>>> are solid.
>>>>>>>>>> - Daniel, Nezih, Zhengxiao (Netflix):
>>>>>>>>>> - update on Vectorized read path for Presto (Dong Chen for
>>>> Hive)
>>>>>>>>>> - Parquet-99: OOM on write
>>>>>>>>>> - Ippokratis: Impala team.
>>>>>>>>>> - Jason Altekruse: (Drill/MapR)
>>>>>>>>>> - update on Java direct memory representation (hadoop 2.0
>>>>>> ByteBuffer)
>>>>>>>>>> - currently uses a fork of Parquet that uses the GSOC work.
>>>>>>>>>> - Tianshuo: 1.8.1 release.
>>>>>>>>>> - Sanjeev (Twitter):
>>>>>>>>>> - want to hear updates about vectorized in Presto
>>>>>>>>>>
>>>>>>>>>> actions:
>>>>>>>>>> - Zhengxiao: update vectorization PR
>>>>>>>>>> - Jason: update ByteBuffer PR
>>>>>>>>>> - Jason: open JIRA for dic encoding fallback pointer
>>>>>>>>>> - Daniel: opened a PR for PARQUET-99: up for review
>>>>>>>>>>
>>>>>>>>>> Notes:
>>>>>>>>>> - Vectorized read path for Presto (Dong Chen for Hive)
>>>> PARQUET-131
>>>>>>>>>> - batch read
>>>>>>>>>> - lazy materialization
>>>>>>>>>> - Netflix integrated with Presto, Dong Chen integrated
>>> with
>>>>>> Hive
>>>>>>>>>> - Nezih: micro/macro benchmark
>>>>>>>>>> - micro 2 read paths
>>>>>>>>>> - only primitives, no converters (3 x faster
>>>> with
>>>>>>>>>> vectorized)
>>>>>>>>>> - complex with converters (no different
>>>>> performance)
>>>>>>>>>> - macro Presto :
>>>>>>>>>> - complex types not better
>>>>>>>>>> - 2x better for primitive types
>>>>>>>>>> - Daniel: projection + predicate well optimized with
>>> presto
>>>>>> (lazy
>>>>>>>>>> load, lazy materialization). predicate push down and using
>> dic
>>> in
>>>>>>>>> predicate
>>>>>>>>>> evaluation.
>>>>>>>>>> - Ippokratis: fan out? => 100 values per collection,
>>>> list/map
>>>>>>>>>> materialization expansive
>>>>>>>>>>
>>>>>>>>>> - Dictionary encoding: because of fallback mechanism. We
>> don't
>>>> know
>>>>>>> when
>>>>>>>>>> the dictionary ends. => Jason to open a JIRA
>>>>>>>>>>
>>>>>>>>>> - Parquet-99: OOM on write
>>>>>>>>>> - all big rows: (10MB per row) runs OOM before we first
>> check
>>>>>>>>>> - big variability in size: small initial rows throw off
>>>> estimate
>>>>>> and
>>>>>>>>>> following big rows blow memory
>>>>>>>>>> - add settings for checking at constant #rows.
>>>>>>>>>> - we should experiment with simpler strategies
>>>>>>>>>>
>>>>>>>>>> - ByteBuffer status:
>>>>>>>>>> - Jason need to rebase the PR
>>>>>>>>>> - Parquet-77
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
>>>>> julien@twitter.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> It's happening now:
>>>>>>>>>>>
>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
>>>>> julien@twitter.com
>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The next Parquet sync up will be held on google hangout on
>>>>>> 7/21/2015
>>>>>>>>> at
>>>>>>>>>>>> 10 am PST
>>>>>>>>>>>>
>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alex Levenson
>>>>>> @THISWILLWORK
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alex Levenson
>>>> @THISWILLWORK
>>>>
>>>
>>
>
--
Ryan Blue
Software Engineer
Cloudera, Inc.
Re: Next Parquet Sync Up
Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
+1 Wednesday
On Wed, Jul 22, 2015 at 4:02 PM, Jason Altekruse <al...@gmail.com>
wrote:
> +1 for wednesday
>
> On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > +1 for Wed.
> >
> > On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
> > alexlevenson@twitter.com.invalid> wrote:
> >
> > > +1 for Wednesday
> > >
> > > On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
> > <julien@twitter.com.invalid
> > > >
> > > wrote:
> > >
> > > > Wednesday then?
> > > > no more conflicts?
> > > >
> > > > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > > > alexlevenson@twitter.com.invalid> wrote:
> > > >
> > > > > Sorry to be difficult but, can I request any day other than Monday
> --
> > > how
> > > > > about Wednesday?
> > > > >
> > > > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> > > wrote:
> > > > >
> > > > > > There's no particular reason for Tuesdays.
> > > > > > We could do the next one on a Monday.
> > > > > > Anybody objects?
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > Any chance we can have these on either a different day or time?
> > > The
> > > > > > Drill
> > > > > > > hangout is every Tuesday at 10am so I always have to pick one
> or
> > > the
> > > > > > other.
> > > > > > >
> > > > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > > > >
> > > > > > >> An update to "actions", I will create a PR for the vectorized
> > read
> > > > > > instead
> > > > > > >> of Zhenxiao.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Nezih
> > > > > > >>
> > > > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > > > <julien@twitter.com.invalid
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Agenda
> > > > > > >>> - Julien (Twitter):
> > > > > > >>> - interested in ByteBuffer status
> > > > > > >>> - Ryan (by email): interested in ByteBuffer status. did some
> > work
> > > > on
> > > > > > >> bloom
> > > > > > >>> filters.
> > > > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
> > new
> > > > > > >> features
> > > > > > >>> are solid.
> > > > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > > > >>> - update on Vectorized read path for Presto (Dong Chen for
> > > Hive)
> > > > > > >>> - Parquet-99: OOM on write
> > > > > > >>> - Ippokratis: Impala team.
> > > > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > > > >>> - update on Java direct memory representation (hadoop 2.0
> > > > > ByteBuffer)
> > > > > > >>> - currently uses a fork of Parquet that uses the GSOC work.
> > > > > > >>> - Tianshuo: 1.8.1 release.
> > > > > > >>> - Sanjeev (Twitter):
> > > > > > >>> - want to hear updates about vectorized in Presto
> > > > > > >>>
> > > > > > >>> actions:
> > > > > > >>> - Zhengxiao: update vectorization PR
> > > > > > >>> - Jason: update ByteBuffer PR
> > > > > > >>> - Jason: open JIRA for dic encoding fallback pointer
> > > > > > >>> - Daniel: opened a PR for PARQUET-99: up for review
> > > > > > >>>
> > > > > > >>> Notes:
> > > > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> > > PARQUET-131
> > > > > > >>> - batch read
> > > > > > >>> - lazy materialization
> > > > > > >>> - Netflix integrated with Presto, Dong Chen integrated
> > with
> > > > > Hive
> > > > > > >>> - Nezih: micro/macro benchmark
> > > > > > >>> - micro 2 read paths
> > > > > > >>> - only primitives, no converters (3 x faster
> > > with
> > > > > > >>> vectorized)
> > > > > > >>> - complex with converters (no different
> > > > performance)
> > > > > > >>> - macro Presto :
> > > > > > >>> - complex types not better
> > > > > > >>> - 2x better for primitive types
> > > > > > >>> - Daniel: projection + predicate well optimized with
> > presto
> > > > > (lazy
> > > > > > >>> load, lazy materialization). predicate push down and using
> dic
> > in
> > > > > > >> predicate
> > > > > > >>> evaluation.
> > > > > > >>> - Ippokratis: fan out? => 100 values per collection,
> > > list/map
> > > > > > >>> materialization expansive
> > > > > > >>>
> > > > > > >>> - Dictionary encoding: because of fallback mechanism. We
> don't
> > > know
> > > > > > when
> > > > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > > > >>>
> > > > > > >>> - Parquet-99: OOM on write
> > > > > > >>> - all big rows: (10MB per row) runs OOM before we first
> check
> > > > > > >>> - big variability in size: small initial rows throw off
> > > estimate
> > > > > and
> > > > > > >>> following big rows blow memory
> > > > > > >>> - add settings for checking at constant #rows.
> > > > > > >>> - we should experiment with simpler strategies
> > > > > > >>>
> > > > > > >>> - ByteBuffer status:
> > > > > > >>> - Jason need to rebase the PR
> > > > > > >>> - Parquet-77
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > > > julien@twitter.com>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> It's happening now:
> > > > > > >>>>
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > > >>>>
> > > > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > > > julien@twitter.com
> > > > > >
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > > > 7/21/2015
> > > > > > >> at
> > > > > > >>>>> 10 am PST
> > > > > > >>>>>
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Alex Levenson
> > > > > @THISWILLWORK
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Levenson
> > > @THISWILLWORK
> > >
> >
>
Re: Next Parquet Sync Up
Posted by Jason Altekruse <al...@gmail.com>.
+1 for wednesday
On Wed, Jul 22, 2015 at 3:47 PM, Jacques Nadeau <ja...@apache.org> wrote:
> +1 for Wed.
>
> On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > +1 for Wednesday
> >
> > On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem
> <julien@twitter.com.invalid
> > >
> > wrote:
> >
> > > Wednesday then?
> > > no more conflicts?
> > >
> > > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > > alexlevenson@twitter.com.invalid> wrote:
> > >
> > > > Sorry to be difficult but, can I request any day other than Monday --
> > how
> > > > about Wednesday?
> > > >
> > > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> > wrote:
> > > >
> > > > > There's no particular reason for Tuesdays.
> > > > > We could do the next one on a Monday.
> > > > > Anybody objects?
> > > > >
> > > > > Julien
> > > > >
> > > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > > > >
> > > > > > Any chance we can have these on either a different day or time?
> > The
> > > > > Drill
> > > > > > hangout is every Tuesday at 10am so I always have to pick one or
> > the
> > > > > other.
> > > > > >
> > > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > > >
> > > > > >> An update to "actions", I will create a PR for the vectorized
> read
> > > > > instead
> > > > > >> of Zhenxiao.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Nezih
> > > > > >>
> > > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > > <julien@twitter.com.invalid
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Agenda
> > > > > >>> - Julien (Twitter):
> > > > > >>> - interested in ByteBuffer status
> > > > > >>> - Ryan (by email): interested in ByteBuffer status. did some
> work
> > > on
> > > > > >> bloom
> > > > > >>> filters.
> > > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other
> new
> > > > > >> features
> > > > > >>> are solid.
> > > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > > >>> - update on Vectorized read path for Presto (Dong Chen for
> > Hive)
> > > > > >>> - Parquet-99: OOM on write
> > > > > >>> - Ippokratis: Impala team.
> > > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > > >>> - update on Java direct memory representation (hadoop 2.0
> > > > ByteBuffer)
> > > > > >>> - currently uses a fork of Parquet that uses the GSOC work.
> > > > > >>> - Tianshuo: 1.8.1 release.
> > > > > >>> - Sanjeev (Twitter):
> > > > > >>> - want to hear updates about vectorized in Presto
> > > > > >>>
> > > > > >>> actions:
> > > > > >>> - Zhengxiao: update vectorization PR
> > > > > >>> - Jason: update ByteBuffer PR
> > > > > >>> - Jason: open JIRA for dic encoding fallback pointer
> > > > > >>> - Daniel: opened a PR for PARQUET-99: up for review
> > > > > >>>
> > > > > >>> Notes:
> > > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> > PARQUET-131
> > > > > >>> - batch read
> > > > > >>> - lazy materialization
> > > > > >>> - Netflix integrated with Presto, Dong Chen integrated
> with
> > > > Hive
> > > > > >>> - Nezih: micro/macro benchmark
> > > > > >>> - micro 2 read paths
> > > > > >>> - only primitives, no converters (3 x faster
> > with
> > > > > >>> vectorized)
> > > > > >>> - complex with converters (no different
> > > performance)
> > > > > >>> - macro Presto :
> > > > > >>> - complex types not better
> > > > > >>> - 2x better for primitive types
> > > > > >>> - Daniel: projection + predicate well optimized with
> presto
> > > > (lazy
> > > > > >>> load, lazy materialization). predicate push down and using dic
> in
> > > > > >> predicate
> > > > > >>> evaluation.
> > > > > >>> - Ippokratis: fan out? => 100 values per collection,
> > list/map
> > > > > >>> materialization expansive
> > > > > >>>
> > > > > >>> - Dictionary encoding: because of fallback mechanism. We don't
> > know
> > > > > when
> > > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > > >>>
> > > > > >>> - Parquet-99: OOM on write
> > > > > >>> - all big rows: (10MB per row) runs OOM before we first check
> > > > > >>> - big variability in size: small initial rows throw off
> > estimate
> > > > and
> > > > > >>> following big rows blow memory
> > > > > >>> - add settings for checking at constant #rows.
> > > > > >>> - we should experiment with simpler strategies
> > > > > >>>
> > > > > >>> - ByteBuffer status:
> > > > > >>> - Jason need to rebase the PR
> > > > > >>> - Parquet-77
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > > julien@twitter.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> It's happening now:
> > > > > >>>>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > >>>>
> > > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > > julien@twitter.com
> > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > > 7/21/2015
> > > > > >> at
> > > > > >>>>> 10 am PST
> > > > > >>>>>
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Alex Levenson
> > > > @THISWILLWORK
> > > >
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>
Re: Next Parquet Sync Up
Posted by Jacques Nadeau <ja...@apache.org>.
+1 for Wed.
On Wed, Jul 22, 2015 at 3:45 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:
> +1 for Wednesday
>
> On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem <julien@twitter.com.invalid
> >
> wrote:
>
> > Wednesday then?
> > no more conflicts?
> >
> > On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> > alexlevenson@twitter.com.invalid> wrote:
> >
> > > Sorry to be difficult but, can I request any day other than Monday --
> how
> > > about Wednesday?
> > >
> > > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net>
> wrote:
> > >
> > > > There's no particular reason for Tuesdays.
> > > > We could do the next one on a Monday.
> > > > Anybody objects?
> > > >
> > > > Julien
> > > >
> > > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > > >
> > > > > Any chance we can have these on either a different day or time?
> The
> > > > Drill
> > > > > hangout is every Tuesday at 10am so I always have to pick one or
> the
> > > > other.
> > > > >
> > > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > > nyigitbasi@netflix.com.invalid> wrote:
> > > > >
> > > > >> An update to "actions", I will create a PR for the vectorized read
> > > > instead
> > > > >> of Zhenxiao.
> > > > >>
> > > > >> Thanks,
> > > > >> Nezih
> > > > >>
> > > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > > <julien@twitter.com.invalid
> > > > >> wrote:
> > > > >>
> > > > >>> Agenda
> > > > >>> - Julien (Twitter):
> > > > >>> - interested in ByteBuffer status
> > > > >>> - Ryan (by email): interested in ByteBuffer status. did some work
> > on
> > > > >> bloom
> > > > >>> filters.
> > > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > > > >> features
> > > > >>> are solid.
> > > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > > >>> - update on Vectorized read path for Presto (Dong Chen for
> Hive)
> > > > >>> - Parquet-99: OOM on write
> > > > >>> - Ippokratis: Impala team.
> > > > >>> - Jason Altekruse: (Drill/MapR)
> > > > >>> - update on Java direct memory representation (hadoop 2.0
> > > ByteBuffer)
> > > > >>> - currently uses a fork of Parquet that uses the GSOC work.
> > > > >>> - Tianshuo: 1.8.1 release.
> > > > >>> - Sanjeev (Twitter):
> > > > >>> - want to hear updates about vectorized in Presto
> > > > >>>
> > > > >>> actions:
> > > > >>> - Zhengxiao: update vectorization PR
> > > > >>> - Jason: update ByteBuffer PR
> > > > >>> - Jason: open JIRA for dic encoding fallback pointer
> > > > >>> - Daniel: opened a PR for PARQUET-99: up for review
> > > > >>>
> > > > >>> Notes:
> > > > >>> - Vectorized read path for Presto (Dong Chen for Hive)
> PARQUET-131
> > > > >>> - batch read
> > > > >>> - lazy materialization
> > > > >>> - Netflix integrated with Presto, Dong Chen integrated with
> > > Hive
> > > > >>> - Nezih: micro/macro benchmark
> > > > >>> - micro 2 read paths
> > > > >>> - only primitives, no converters (3 x faster
> with
> > > > >>> vectorized)
> > > > >>> - complex with converters (no different
> > performance)
> > > > >>> - macro Presto :
> > > > >>> - complex types not better
> > > > >>> - 2x better for primitive types
> > > > >>> - Daniel: projection + predicate well optimized with presto
> > > (lazy
> > > > >>> load, lazy materialization). predicate push down and using dic in
> > > > >> predicate
> > > > >>> evaluation.
> > > > >>> - Ippokratis: fan out? => 100 values per collection,
> list/map
> > > > >>> materialization expansive
> > > > >>>
> > > > >>> - Dictionary encoding: because of fallback mechanism. We don't
> know
> > > > when
> > > > >>> the dictionary ends. => Jason to open a JIRA
> > > > >>>
> > > > >>> - Parquet-99: OOM on write
> > > > >>> - all big rows: (10MB per row) runs OOM before we first check
> > > > >>> - big variability in size: small initial rows throw off
> estimate
> > > and
> > > > >>> following big rows blow memory
> > > > >>> - add settings for checking at constant #rows.
> > > > >>> - we should experiment with simpler strategies
> > > > >>>
> > > > >>> - ByteBuffer status:
> > > > >>> - Jason need to rebase the PR
> > > > >>> - Parquet-77
> > > > >>>
> > > > >>>
> > > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> > julien@twitter.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> It's happening now:
> > > > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > >>>>
> > > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> > julien@twitter.com
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> The next Parquet sync up will be held on google hangout on
> > > 7/21/2015
> > > > >> at
> > > > >>>>> 10 am PST
> > > > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Levenson
> > > @THISWILLWORK
> > >
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>
Re: Next Parquet Sync Up
Posted by Alex Levenson <al...@twitter.com.INVALID>.
+1 for Wednesday
On Wed, Jul 22, 2015 at 3:44 PM, Julien Le Dem <ju...@twitter.com.invalid>
wrote:
> Wednesday then?
> no more conflicts?
>
> On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
> alexlevenson@twitter.com.invalid> wrote:
>
> > Sorry to be difficult but, can I request any day other than Monday -- how
> > about Wednesday?
> >
> > On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:
> >
> > > There's no particular reason for Tuesdays.
> > > We could do the next one on a Monday.
> > > Anybody objects?
> > >
> > > Julien
> > >
> > > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org>
> wrote:
> > > >
> > > > Any chance we can have these on either a different day or time? The
> > > Drill
> > > > hangout is every Tuesday at 10am so I always have to pick one or the
> > > other.
> > > >
> > > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > > nyigitbasi@netflix.com.invalid> wrote:
> > > >
> > > >> An update to "actions", I will create a PR for the vectorized read
> > > instead
> > > >> of Zhenxiao.
> > > >>
> > > >> Thanks,
> > > >> Nezih
> > > >>
> > > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > > <julien@twitter.com.invalid
> > > >> wrote:
> > > >>
> > > >>> Agenda
> > > >>> - Julien (Twitter):
> > > >>> - interested in ByteBuffer status
> > > >>> - Ryan (by email): interested in ByteBuffer status. did some work
> on
> > > >> bloom
> > > >>> filters.
> > > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > > >> features
> > > >>> are solid.
> > > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > > >>> - update on Vectorized read path for Presto (Dong Chen for Hive)
> > > >>> - Parquet-99: OOM on write
> > > >>> - Ippokratis: Impala team.
> > > >>> - Jason Altekruse: (Drill/MapR)
> > > >>> - update on Java direct memory representation (hadoop 2.0
> > ByteBuffer)
> > > >>> - currently uses a fork of Parquet that uses the GSOC work.
> > > >>> - Tianshuo: 1.8.1 release.
> > > >>> - Sanjeev (Twitter):
> > > >>> - want to hear updates about vectorized in Presto
> > > >>>
> > > >>> actions:
> > > >>> - Zhengxiao: update vectorization PR
> > > >>> - Jason: update ByteBuffer PR
> > > >>> - Jason: open JIRA for dic encoding fallback pointer
> > > >>> - Daniel: opened a PR for PARQUET-99: up for review
> > > >>>
> > > >>> Notes:
> > > >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> > > >>> - batch read
> > > >>> - lazy materialization
> > > >>> - Netflix integrated with Presto, Dong Chen integrated with
> > Hive
> > > >>> - Nezih: micro/macro benchmark
> > > >>> - micro 2 read paths
> > > >>> - only primitives, no converters (3 x faster with
> > > >>> vectorized)
> > > >>> - complex with converters (no different
> performance)
> > > >>> - macro Presto :
> > > >>> - complex types not better
> > > >>> - 2x better for primitive types
> > > >>> - Daniel: projection + predicate well optimized with presto
> > (lazy
> > > >>> load, lazy materialization). predicate push down and using dic in
> > > >> predicate
> > > >>> evaluation.
> > > >>> - Ippokratis: fan out? => 100 values per collection, list/map
> > > >>> materialization expansive
> > > >>>
> > > >>> - Dictionary encoding: because of fallback mechanism. We don't know
> > > when
> > > >>> the dictionary ends. => Jason to open a JIRA
> > > >>>
> > > >>> - Parquet-99: OOM on write
> > > >>> - all big rows: (10MB per row) runs OOM before we first check
> > > >>> - big variability in size: small initial rows throw off estimate
> > and
> > > >>> following big rows blow memory
> > > >>> - add settings for checking at constant #rows.
> > > >>> - we should experiment with simpler strategies
> > > >>>
> > > >>> - ByteBuffer status:
> > > >>> - Jason need to rebase the PR
> > > >>> - Parquet-77
> > > >>>
> > > >>>
> > > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <
> julien@twitter.com>
> > > >>> wrote:
> > > >>>
> > > >>>> It's happening now:
> > > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > >>>>
> > > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <
> julien@twitter.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> The next Parquet sync up will be held on google hangout on
> > 7/21/2015
> > > >> at
> > > >>>>> 10 am PST
> > > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > > >>
> > >
> >
> >
> >
> > --
> > Alex Levenson
> > @THISWILLWORK
> >
>
--
Alex Levenson
@THISWILLWORK
Re: Next Parquet Sync Up
Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
Wednesday then?
no more conflicts?
On Tue, Jul 21, 2015 at 7:26 PM, Alex Levenson <
alexlevenson@twitter.com.invalid> wrote:
> Sorry to be difficult but, can I request any day other than Monday -- how
> about Wednesday?
>
> On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:
>
> > There's no particular reason for Tuesdays.
> > We could do the next one on a Monday.
> > Anybody objects?
> >
> > Julien
> >
> > > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > Any chance we can have these on either a different day or time? The
> > Drill
> > > hangout is every Tuesday at 10am so I always have to pick one or the
> > other.
> > >
> > > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > > nyigitbasi@netflix.com.invalid> wrote:
> > >
> > >> An update to "actions", I will create a PR for the vectorized read
> > instead
> > >> of Zhenxiao.
> > >>
> > >> Thanks,
> > >> Nezih
> > >>
> > >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> > <julien@twitter.com.invalid
> > >> wrote:
> > >>
> > >>> Agenda
> > >>> - Julien (Twitter):
> > >>> - interested in ByteBuffer status
> > >>> - Ryan (by email): interested in ByteBuffer status. did some work on
> > >> bloom
> > >>> filters.
> > >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> > >> features
> > >>> are solid.
> > >>> - Daniel, Nezih, Zhengxiao (Netflix):
> > >>> - update on Vectorized read path for Presto (Dong Chen for Hive)
> > >>> - Parquet-99: OOM on write
> > >>> - Ippokratis: Impala team.
> > >>> - Jason Altekruse: (Drill/MapR)
> > >>> - update on Java direct memory representation (hadoop 2.0
> ByteBuffer)
> > >>> - currently uses a fork of Parquet that uses the GSOC work.
> > >>> - Tianshuo: 1.8.1 release.
> > >>> - Sanjeev (Twitter):
> > >>> - want to hear updates about vectorized in Presto
> > >>>
> > >>> actions:
> > >>> - Zhengxiao: update vectorization PR
> > >>> - Jason: update ByteBuffer PR
> > >>> - Jason: open JIRA for dic encoding fallback pointer
> > >>> - Daniel: opened a PR for PARQUET-99: up for review
> > >>>
> > >>> Notes:
> > >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> > >>> - batch read
> > >>> - lazy materialization
> > >>> - Netflix integrated with Presto, Dong Chen integrated with
> Hive
> > >>> - Nezih: micro/macro benchmark
> > >>> - micro 2 read paths
> > >>> - only primitives, no converters (3 x faster with
> > >>> vectorized)
> > >>> - complex with converters (no different performance)
> > >>> - macro Presto :
> > >>> - complex types not better
> > >>> - 2x better for primitive types
> > >>> - Daniel: projection + predicate well optimized with presto
> (lazy
> > >>> load, lazy materialization). predicate push down and using dic in
> > >> predicate
> > >>> evaluation.
> > >>> - Ippokratis: fan out? => 100 values per collection, list/map
> > >>> materialization expansive
> > >>>
> > >>> - Dictionary encoding: because of fallback mechanism. We don't know
> > when
> > >>> the dictionary ends. => Jason to open a JIRA
> > >>>
> > >>> - Parquet-99: OOM on write
> > >>> - all big rows: (10MB per row) runs OOM before we first check
> > >>> - big variability in size: small initial rows throw off estimate
> and
> > >>> following big rows blow memory
> > >>> - add settings for checking at constant #rows.
> > >>> - we should experiment with simpler strategies
> > >>>
> > >>> - ByteBuffer status:
> > >>> - Jason need to rebase the PR
> > >>> - Parquet-77
> > >>>
> > >>>
> > >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> > >>> wrote:
> > >>>
> > >>>> It's happening now:
> > >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>>>
> > >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <julien@twitter.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> The next Parquet sync up will be held on google hangout on
> 7/21/2015
> > >> at
> > >>>>> 10 am PST
> > >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>
Re: Next Parquet Sync Up
Posted by Alex Levenson <al...@twitter.com.INVALID>.
Sorry to be difficult but, can I request any day other than Monday -- how
about Wednesday?
On Tue, Jul 21, 2015 at 7:19 PM, Julien Le Dem <ju...@ledem.net> wrote:
> There's no particular reason for Tuesdays.
> We could do the next one on a Monday.
> Anybody objects?
>
> Julien
>
> > On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > Any chance we can have these on either a different day or time? The
> Drill
> > hangout is every Tuesday at 10am so I always have to pick one or the
> other.
> >
> > On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> > nyigitbasi@netflix.com.invalid> wrote:
> >
> >> An update to "actions", I will create a PR for the vectorized read
> instead
> >> of Zhenxiao.
> >>
> >> Thanks,
> >> Nezih
> >>
> >> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem
> <julien@twitter.com.invalid
> >> wrote:
> >>
> >>> Agenda
> >>> - Julien (Twitter):
> >>> - interested in ByteBuffer status
> >>> - Ryan (by email): interested in ByteBuffer status. did some work on
> >> bloom
> >>> filters.
> >>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> >> features
> >>> are solid.
> >>> - Daniel, Nezih, Zhengxiao (Netflix):
> >>> - update on Vectorized read path for Presto (Dong Chen for Hive)
> >>> - Parquet-99: OOM on write
> >>> - Ippokratis: Impala team.
> >>> - Jason Altekruse: (Drill/MapR)
> >>> - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
> >>> - currently uses a fork of Parquet that uses the GSOC work.
> >>> - Tianshuo: 1.8.1 release.
> >>> - Sanjeev (Twitter):
> >>> - want to hear updates about vectorized in Presto
> >>>
> >>> actions:
> >>> - Zhengxiao: update vectorization PR
> >>> - Jason: update ByteBuffer PR
> >>> - Jason: open JIRA for dic encoding fallback pointer
> >>> - Daniel: opened a PR for PARQUET-99: up for review
> >>>
> >>> Notes:
> >>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> >>> - batch read
> >>> - lazy materialization
> >>> - Netflix integrated with Presto, Dong Chen integrated with Hive
> >>> - Nezih: micro/macro benchmark
> >>> - micro 2 read paths
> >>> - only primitives, no converters (3 x faster with
> >>> vectorized)
> >>> - complex with converters (no different performance)
> >>> - macro Presto :
> >>> - complex types not better
> >>> - 2x better for primitive types
> >>> - Daniel: projection + predicate well optimized with presto (lazy
> >>> load, lazy materialization). predicate push down and using dic in
> >> predicate
> >>> evaluation.
> >>> - Ippokratis: fan out? => 100 values per collection, list/map
> >>> materialization expansive
> >>>
> >>> - Dictionary encoding: because of fallback mechanism. We don't know
> when
> >>> the dictionary ends. => Jason to open a JIRA
> >>>
> >>> - Parquet-99: OOM on write
> >>> - all big rows: (10MB per row) runs OOM before we first check
> >>> - big variability in size: small initial rows throw off estimate and
> >>> following big rows blow memory
> >>> - add settings for checking at constant #rows.
> >>> - we should experiment with simpler strategies
> >>>
> >>> - ByteBuffer status:
> >>> - Jason need to rebase the PR
> >>> - Parquet-77
> >>>
> >>>
> >>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> >>> wrote:
> >>>
> >>>> It's happening now:
> >>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>>>
> >>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> >>>> wrote:
> >>>>
> >>>>> The next Parquet sync up will be held on google hangout on 7/21/2015
> >> at
> >>>>> 10 am PST
> >>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>
>
--
Alex Levenson
@THISWILLWORK
Re: Next Parquet Sync Up
Posted by Julien Le Dem <ju...@ledem.net>.
There's no particular reason for Tuesdays.
We could do the next one on a Monday.
Anybody objects?
Julien
> On Jul 21, 2015, at 17:37, Jacques Nadeau <ja...@apache.org> wrote:
>
> Any chance we can have these on either a different day or time? The Drill
> hangout is every Tuesday at 10am so I always have to pick one or the other.
>
> On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
> nyigitbasi@netflix.com.invalid> wrote:
>
>> An update to "actions", I will create a PR for the vectorized read instead
>> of Zhenxiao.
>>
>> Thanks,
>> Nezih
>>
>> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <julien@twitter.com.invalid
>> wrote:
>>
>>> Agenda
>>> - Julien (Twitter):
>>> - interested in ByteBuffer status
>>> - Ryan (by email): interested in ByteBuffer status. did some work on
>> bloom
>>> filters.
>>> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
>> features
>>> are solid.
>>> - Daniel, Nezih, Zhengxiao (Netflix):
>>> - update on Vectorized read path for Presto (Dong Chen for Hive)
>>> - Parquet-99: OOM on write
>>> - Ippokratis: Impala team.
>>> - Jason Altekruse: (Drill/MapR)
>>> - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
>>> - currently uses a fork of Parquet that uses the GSOC work.
>>> - Tianshuo: 1.8.1 release.
>>> - Sanjeev (Twitter):
>>> - want to hear updates about vectorized in Presto
>>>
>>> actions:
>>> - Zhengxiao: update vectorization PR
>>> - Jason: update ByteBuffer PR
>>> - Jason: open JIRA for dic encoding fallback pointer
>>> - Daniel: opened a PR for PARQUET-99: up for review
>>>
>>> Notes:
>>> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
>>> - batch read
>>> - lazy materialization
>>> - Netflix integrated with Presto, Dong Chen integrated with Hive
>>> - Nezih: micro/macro benchmark
>>> - micro 2 read paths
>>> - only primitives, no converters (3 x faster with
>>> vectorized)
>>> - complex with converters (no different performance)
>>> - macro Presto :
>>> - complex types not better
>>> - 2x better for primitive types
>>> - Daniel: projection + predicate well optimized with presto (lazy
>>> load, lazy materialization). predicate push down and using dic in
>> predicate
>>> evaluation.
>>> - Ippokratis: fan out? => 100 values per collection, list/map
>>> materialization expansive
>>>
>>> - Dictionary encoding: because of fallback mechanism. We don't know when
>>> the dictionary ends. => Jason to open a JIRA
>>>
>>> - Parquet-99: OOM on write
>>> - all big rows: (10MB per row) runs OOM before we first check
>>> - big variability in size: small initial rows throw off estimate and
>>> following big rows blow memory
>>> - add settings for checking at constant #rows.
>>> - we should experiment with simpler strategies
>>>
>>> - ByteBuffer status:
>>> - Jason need to rebase the PR
>>> - Parquet-77
>>>
>>>
>>> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
>>> wrote:
>>>
>>>> It's happening now:
>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>>>
>>>> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
>>>> wrote:
>>>>
>>>>> The next Parquet sync up will be held on google hangout on 7/21/2015
>> at
>>>>> 10 am PST
>>>>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>
Re: Next Parquet Sync Up
Posted by Jacques Nadeau <ja...@apache.org>.
Any chance we can have these on either a different day or time? The Drill
hangout is every Tuesday at 10am so I always have to pick one or the other.
On Tue, Jul 21, 2015 at 10:56 AM, Nezih Yigitbasi <
nyigitbasi@netflix.com.invalid> wrote:
> An update to "actions", I will create a PR for the vectorized read instead
> of Zhenxiao.
>
> Thanks,
> Nezih
>
> On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <julien@twitter.com.invalid
> >
> wrote:
>
> > Agenda
> > - Julien (Twitter):
> > - interested in ByteBuffer status
> > - Ryan (by email): interested in ByteBuffer status. did some work on
> bloom
> > filters.
> > PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new
> features
> > are solid.
> > - Daniel, Nezih, Zhengxiao (Netflix):
> > - update on Vectorized read path for Presto (Dong Chen for Hive)
> > - Parquet-99: OOM on write
> > - Ippokratis: Impala team.
> > - Jason Altekruse: (Drill/MapR)
> > - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
> > - currently uses a fork of Parquet that uses the GSOC work.
> > - Tianshuo: 1.8.1 release.
> > - Sanjeev (Twitter):
> > - want to hear updates about vectorized in Presto
> >
> > actions:
> > - Zhengxiao: update vectorization PR
> > - Jason: update ByteBuffer PR
> > - Jason: open JIRA for dic encoding fallback pointer
> > - Daniel: opened a PR for PARQUET-99: up for review
> >
> > Notes:
> > - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> > - batch read
> > - lazy materialization
> > - Netflix integrated with Presto, Dong Chen integrated with Hive
> > - Nezih: micro/macro benchmark
> > - micro 2 read paths
> > - only primitives, no converters (3 x faster with
> > vectorized)
> > - complex with converters (no different performance)
> > - macro Presto :
> > - complex types not better
> > - 2x better for primitive types
> > - Daniel: projection + predicate well optimized with presto (lazy
> > load, lazy materialization). predicate push down and using dic in
> predicate
> > evaluation.
> > - Ippokratis: fan out? => 100 values per collection, list/map
> > materialization expansive
> >
> > - Dictionary encoding: because of fallback mechanism. We don't know when
> > the dictionary ends. => Jason to open a JIRA
> >
> > - Parquet-99: OOM on write
> > - all big rows: (10MB per row) runs OOM before we first check
> > - big variability in size: small initial rows throw off estimate and
> > following big rows blow memory
> > - add settings for checking at constant #rows.
> > - we should experiment with simpler strategies
> >
> > - ByteBuffer status:
> > - Jason need to rebase the PR
> > - Parquet-77
> >
> >
> > On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> >
> > > It's happening now:
> > > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >
> > > On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> > > wrote:
> > >
> > >> The next Parquet sync up will be held on google hangout on 7/21/2015
> at
> > >> 10 am PST
> > >> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> > >>
> > >
> > >
> >
>
Re: Next Parquet Sync Up
Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.
An update to "actions", I will create a PR for the vectorized read instead
of Zhenxiao.
Thanks,
Nezih
On Tue, Jul 21, 2015 at 10:51 AM, Julien Le Dem <ju...@twitter.com.invalid>
wrote:
> Agenda
> - Julien (Twitter):
> - interested in ByteBuffer status
> - Ryan (by email): interested in ByteBuffer status. did some work on bloom
> filters.
> PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new features
> are solid.
> - Daniel, Nezih, Zhengxiao (Netflix):
> - update on Vectorized read path for Presto (Dong Chen for Hive)
> - Parquet-99: OOM on write
> - Ippokratis: Impala team.
> - Jason Altekruse: (Drill/MapR)
> - update on Java direct memory representation (hadoop 2.0 ByteBuffer)
> - currently uses a fork of Parquet that uses the GSOC work.
> - Tianshuo: 1.8.1 release.
> - Sanjeev (Twitter):
> - want to hear updates about vectorized in Presto
>
> actions:
> - Zhengxiao: update vectorization PR
> - Jason: update ByteBuffer PR
> - Jason: open JIRA for dic encoding fallback pointer
> - Daniel: opened a PR for PARQUET-99: up for review
>
> Notes:
> - Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
> - batch read
> - lazy materialization
> - Netflix integrated with Presto, Dong Chen integrated with Hive
> - Nezih: micro/macro benchmark
> - micro 2 read paths
> - only primitives, no converters (3 x faster with
> vectorized)
> - complex with converters (no different performance)
> - macro Presto :
> - complex types not better
> - 2x better for primitive types
> - Daniel: projection + predicate well optimized with presto (lazy
> load, lazy materialization). predicate push down and using dic in predicate
> evaluation.
> - Ippokratis: fan out? => 100 values per collection, list/map
> materialization expansive
>
> - Dictionary encoding: because of fallback mechanism. We don't know when
> the dictionary ends. => Jason to open a JIRA
>
> - Parquet-99: OOM on write
> - all big rows: (10MB per row) runs OOM before we first check
> - big variability in size: small initial rows throw off estimate and
> following big rows blow memory
> - add settings for checking at constant #rows.
> - we should experiment with simpler strategies
>
> - ByteBuffer status:
> - Jason need to rebase the PR
> - Parquet-77
>
>
> On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
>
> > It's happening now:
> > https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >
> > On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> > wrote:
> >
> >> The next Parquet sync up will be held on google hangout on 7/21/2015 at
> >> 10 am PST
> >> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
> >>
> >
> >
>
Re: Next Parquet Sync Up
Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
Agenda
- Julien (Twitter):
- interested in ByteBuffer status
- Ryan (by email): interested in ByteBuffer status. did some work on bloom
filters.
PARQUET-251 and PARQUET-246 make sure 2.0 encodings and other new features
are solid.
- Daniel, Nezih, Zhengxiao (Netflix):
- update on Vectorized read path for Presto (Dong Chen for Hive)
- Parquet-99: OOM on write
- Ippokratis: Impala team.
- Jason Altekruse: (Drill/MapR)
- update on Java direct memory representation (hadoop 2.0 ByteBuffer)
- currently uses a fork of Parquet that uses the GSOC work.
- Tianshuo: 1.8.1 release.
- Sanjeev (Twitter):
- want to hear updates about vectorized in Presto
actions:
- Zhengxiao: update vectorization PR
- Jason: update ByteBuffer PR
- Jason: open JIRA for dic encoding fallback pointer
- Daniel: opened a PR for PARQUET-99: up for review
Notes:
- Vectorized read path for Presto (Dong Chen for Hive) PARQUET-131
- batch read
- lazy materialization
- Netflix integrated with Presto, Dong Chen integrated with Hive
- Nezih: micro/macro benchmark
- micro 2 read paths
- only primitives, no converters (3 x faster with
vectorized)
- complex with converters (no different performance)
- macro Presto :
- complex types not better
- 2x better for primitive types
- Daniel: projection + predicate well optimized with presto (lazy
load, lazy materialization). predicate push down and using dic in predicate
evaluation.
- Ippokratis: fan out? => 100 values per collection, list/map
materialization expansive
- Dictionary encoding: because of fallback mechanism. We don't know when
the dictionary ends. => Jason to open a JIRA
- Parquet-99: OOM on write
- all big rows: (10MB per row) runs OOM before we first check
- big variability in size: small initial rows throw off estimate and
following big rows blow memory
- add settings for checking at constant #rows.
- we should experiment with simpler strategies
- ByteBuffer status:
- Jason need to rebase the PR
- Parquet-77
On Tue, Jul 21, 2015 at 10:05 AM, Julien Le Dem <ju...@twitter.com> wrote:
> It's happening now:
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>
> On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com>
> wrote:
>
>> The next Parquet sync up will be held on google hangout on 7/21/2015 at
>> 10 am PST
>> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>>
>
>
Re: Next Parquet Sync Up
Posted by Julien Le Dem <ju...@twitter.com.INVALID>.
It's happening now:
https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
On Tue, Jul 14, 2015 at 10:04 AM, Julien Le Dem <ju...@twitter.com> wrote:
> The next Parquet sync up will be held on google hangout on 7/21/2015 at 10
> am PST
> https://plus.google.com/hangouts/_/twitter.com/parquet-sync-up
>