You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Igor Calabria <ig...@gmail.com> on 2020/03/03 16:25:45 UTC

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Hi Micah, I actually got involved with another personal project and had to
postpone my contribution to arrow a bit. The good news is that I'm almost
done with it, so I could help you with the read side very soon. Any ideas
how we could coordinate this?

Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
escreveu:

> hi Micah -- great news on the level generation PR. I'll try to carve
> out some time for reviewing over the coming week.
>
> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Igor,
> > I was wondering if you have made any progress on this?
> >
> > I posted a new PR [1] which I believe handles the difficult algorithmic
> > part of writing.  There will be some follow-ups but I think this PR might
> > take a while to review, so I was thinking of starting to take a look at
> the
> > read side if you haven't started yet, and circle back to the final
> > integration for the write side once the PR is checked in.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/arrow/pull/6490
> >
> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
> > wrote:
> >
> > > Hi, I would love to help with this issue. I'm aware that this is a huge
> > > task for a first contribution to arrow, but I feel that I could help
> with
> > > the read path.
> > > Reading parquet seems like a extremely complex task since both hive[0]
> and
> > > spark[1] tried to implement a "vectorized" version and they all stopped
> > > short of supporting complex types.
> > > I wanted to at least give it a try and find out where the challenge
> lies.
> > >
> > > Since you guys are much more familiar with the current code base, I
> could
> > > use some starting tips so I don't fall in common pitfalls and whatnot.
> > >
> > > [0] https://issues.apache.org/jira/browse/HIVE-18576
> > > [1]
> > >
> > >
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > >
> > > On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
> > > > Just to give an update.  I've been a little bit delayed, but my
> progress
> > > is>
> > > > as follows:>
> > > > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> > > > 2.  Have another PR open that allows a configuration option in C++
> to>
> > > > determine which algorithm version to use for reading/writing, the
> > > existing>
> > > > version and the new version supported complex-nested arrays.  I
> think a>
> > > > large amount of code will be reused/delegated to but I will err on
> the
> > > side>
> > > > of not touching the existing code/algorithms so that any errors in
> the>
> > > > implementation  or performance regressions can hopefully be
> mitigated at>
> > > > runtime.  I expect in later releases (once the code has "baked")
> will>
> > > > become a no-op.>
> > > > 3.  Started coding the write path.>
> > > >
> > > > Which leaves:>
> > > > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> complete>
> > > > 2.  Implementing the read path.>
> > > >
> > > > Again, I'm happy to collaborate if people have bandwidth and want to>
> > > > contribute.>
> > > >
> > > > Thanks,>
> > > > Micah>
> > > >
> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
> > > > wrote:>
> > > >
> > > > > Hi Wes,>
> > > > > I'm still interested in doing the work.  But don't to hold anybody
> up
> > > if>
> > > > > they have bandwidth.>
> > > > >>
> > > > > In order to actually make progress on this, my plan will be to:>
> > > > > 1.  Help with the current Java review backlog through early next
> week
> > > or>
> > > > > so (this has been taking the majority of my time allocated for
> Arrow>
> > > > > contributions for the last 6 months or so).>
> > > > > 2.  Shift all my attention to trying to get this done (this means
> no>
> > > > > reviews other then closing out existing ones that I've started
> until it
> > > is>
> > > > > done).  Hopefully, other Java committers can help shrink the
> backlog>
> > > > > further (Jacques thanks for you recent efforts here).>
> > > > >>
> > > > > Thanks,>
> > > > > Micah>
> > > > >>
> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> wrote:>
> > > > >>
> > > > >> hi folks,>
> > > > >>>
> > > > >> I think we have reached a point where the incomplete C++ Parquet>
> > > > >> nested data assembly/disassembly is harming the value of several>
> > > > >> others parts of the project, for example the Datasets API. As
> another>
> > > > >> example, it's possible to ingest nested data from JSON but not
> write>
> > > > >> it to Parquet in general.>
> > > > >>>
> > > > >> Implementing the nested data read and write path completely is a>
> > > > >> difficult project requiring at least several weeks of dedicated
> work,>
> > > > >> so it's not so surprising that it hasn't been accomplished yet. I
> > > know>
> > > > >> that several people have expressed interest in working on it, but
> I>
> > > > >> would like to see if anyone would be able to volunteer a
> commitment
> > > of>
> > > > >> time and guess on a rough timeline when this work could be done.
> It>
> > > > >> seems to me if this slips beyond 2020 it will significant diminish
> > > the>
> > > > >> value being created by other parts of the project.>
> > > > >>>
> > > > >> Since I'm pretty familiar with all the Parquet code I'm one
> candidate>
> > > > >> person to take on this project (and I can dedicate the time, but
> it>
> > > > >> would come at the expense of other projects where I can also be>
> > > > >> useful). But Micah and others expressed interest in working on
> it, so>
> > > > >> I wanted to have a discussion about it to see what others think.>
> > > > >>>
> > > > >> Thanks>
> > > > >> Wes>
> > > > >>>
> > > > >>
> > > >
> > >
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Micah Kornfield <em...@gmail.com>.

Hi Igor,
It looks like a good start.  Thank you! I left feedback on the PR.  I
should have some time over the next few weeks or so to handle the
combinations lists/structs that aren't currently supported in the core
reader.  As I get further into the code I'll chime in here if there are
other areas to collaborate on.

Thanks,
Micah

On Sun, Mar 29, 2020 at 4:27 PM Igor Calabria <ig...@gmail.com>
wrote:

> Hi Micah,
>
> Finally got around to doing some work on the reader's side. Like you
> suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960.
>
> I never programmed C++ professionally so I opened the PR
> https://github.com/apache/arrow/pull/6758 as soon as possible to collect
> feedback while I continue the work. Next step is actually fixing the
> reader, since adding a explicit translation to maps broke it. It should be
> simple enough to just reuse StructReader.
>
> Em dom., 15 de mar. de 2020 às 01:17, Micah Kornfield <
> emkornfield@gmail.com> escreveu:
>
>> >
>> > I'd be OK with having the flag so long as the new code is the default.
>> > Otherwise we'll never find out about the corner cases.
>>
>> Completely agree, my PR makes the new code the default.
>>
>> On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <we...@gmail.com> wrote:
>>
>> > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <emkornfield@gmail.com
>> >
>> > wrote:
>> > >
>> > > Maarten, I don't expect regressions for flat cases (I'm going to try
>> to
>> > run
>> > > benchmarks comparison tonight).
>> > >
>> > > In terms of the flag, I'm more concerned about some corner case I
>> didn't
>> > > think of in testing or a workload that for some reason is better with
>> the
>> > > prior code. If either of these arise I would like to give users a way
>> to
>> > > mitigate issues so a patch release isn't an imperative.
>> >
>> > I'd be OK with having the flag so long as the new code is the default.
>> > Otherwise we'll never find out about the corner cases.
>> >
>> > > I'm happy to cleanup the existing code as soon as the next release
>> > happens
>> > > though.
>> > >
>> > > Thoughts?
>> > >
>> > > -Micah
>> > >
>> > > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <we...@gmail.com>
>> > wrote:
>> > >
>> > > > Maarten -- AFAIK Micah's work only affects nested / non-flat column
>> > > > paths, so flat data should not be impacted. Since we have a partial
>> > > > implementation of writes for nested data (lists-of-lists and
>> > > > structs-of-structs, but no mix of the two) that was the performance
>> > > > difference I was referencing.
>> > > >
>> > > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <
>> > maartenb@xs4all.nl>
>> > > > wrote:
>> > > > >
>> > > > > Hi Micah,
>> > > > >
>> > > > > How does the performance change for “flat” schemas?
>> > > > > (particularly in the case of a large number of columns)
>> > > > >
>> > > > > Thanks,
>> > > > > Maarten
>> > > > >
>> > > > >
>> > > > >
>> > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <
>> > emkornfield@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > Another status update.  I've integrated the level generation
>> code
>> > with
>> > > > the
>> > > > > > parquet writing code [1].
>> > > > > >
>> > > > > > After that PR is merged I'll add bindings in Python to control
>> > > > versions of
>> > > > > > the level generation algorithm and plan on moving on to the read
>> > side.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Micah
>> > > > > >
>> > > > > > [1] https://github.com/apache/arrow/pull/6586
>> > > > > >
>> > > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <
>> > emkornfield@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> Hi Igor,
>> > > > > >> If you have the time
>> > https://issues.apache.org/jira/browse/ARROW-7960
>> > > > might
>> > > > > >> be a good task to pick up for this I think it should be a
>> > relatively
>> > > > small
>> > > > > >> amount of code, so it is probably a good contribution to the
>> > > > project.  Once
>> > > > > >> that is wrapped up we can see were we both are.
>> > > > > >>
>> > > > > >> Cheers,
>> > > > > >> Micah
>> > > > > >>
>> > > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <
>> > igor.calabria@gmail.com
>> > > > >
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> Hi Micah, I actually got involved with another personal
>> project
>> > and
>> > > > had
>> > > > > >>> to postpone my contribution to arrow a bit. The good news is
>> > that I'm
>> > > > > >>> almost done with it, so I could help you with the read side
>> very
>> > > > soon. Any
>> > > > > >>> ideas how we could coordinate this?
>> > > > > >>>
>> > > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
>> > > > wesmckinn@gmail.com>
>> > > > > >>> escreveu:
>> > > > > >>>
>> > > > > >>>> hi Micah -- great news on the level generation PR. I'll try
>> to
>> > carve
>> > > > > >>>> out some time for reviewing over the coming week.
>> > > > > >>>>
>> > > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
>> > > > emkornfield@gmail.com>
>> > > > > >>>> wrote:
>> > > > > >>>>>
>> > > > > >>>>> Hi Igor,
>> > > > > >>>>> I was wondering if you have made any progress on this?
>> > > > > >>>>>
>> > > > > >>>>> I posted a new PR [1] which I believe handles the difficult
>> > > > algorithmic
>> > > > > >>>>> part of writing.  There will be some follow-ups but I think
>> > this PR
>> > > > > >>>> might
>> > > > > >>>>> take a while to review, so I was thinking of starting to
>> take a
>> > > > look
>> > > > > >>>> at the
>> > > > > >>>>> read side if you haven't started yet, and circle back to the
>> > final
>> > > > > >>>>> integration for the write side once the PR is checked in.
>> > > > > >>>>>
>> > > > > >>>>> Thanks,
>> > > > > >>>>> Micah
>> > > > > >>>>>
>> > > > > >>>>> [1] https://github.com/apache/arrow/pull/6490
>> > > > > >>>>>
>> > > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
>> > > > igor.calabria@gmail.com>
>> > > > > >>>>> wrote:
>> > > > > >>>>>
>> > > > > >>>>>> Hi, I would love to help with this issue. I'm aware that
>> this
>> > is a
>> > > > > >>>> huge
>> > > > > >>>>>> task for a first contribution to arrow, but I feel that I
>> > could
>> > > > help
>> > > > > >>>> with
>> > > > > >>>>>> the read path.
>> > > > > >>>>>> Reading parquet seems like a extremely complex task since
>> both
>> > > > > >>>> hive[0] and
>> > > > > >>>>>> spark[1] tried to implement a "vectorized" version and they
>> > all
>> > > > > >>>> stopped
>> > > > > >>>>>> short of supporting complex types.
>> > > > > >>>>>> I wanted to at least give it a try and find out where the
>> > > > challenge
>> > > > > >>>> lies.
>> > > > > >>>>>>
>> > > > > >>>>>> Since you guys are much more familiar with the current code
>> > base,
>> > > > I
>> > > > > >>>> could
>> > > > > >>>>>> use some starting tips so I don't fall in common pitfalls
>> and
>> > > > > >>>> whatnot.
>> > > > > >>>>>>
>> > > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
>> > > > > >>>>>> [1]
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>
>> > > >
>> >
>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>> > > > > >>>>>>
>> > > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com>
>> > wrote:
>> > > > > >>>>>>> Just to give an update.  I've been a little bit delayed,
>> but
>> > my
>> > > > > >>>> progress
>> > > > > >>>>>> is>
>> > > > > >>>>>>> as follows:>
>> > > > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end
>> > tests.>
>> > > > > >>>>>>> 2.  Have another PR open that allows a configuration
>> option
>> > in
>> > > > C++
>> > > > > >>>> to>
>> > > > > >>>>>>> determine which algorithm version to use for
>> > reading/writing, the
>> > > > > >>>>>> existing>
>> > > > > >>>>>>> version and the new version supported complex-nested
>> > arrays.  I
>> > > > > >>>> think a>
>> > > > > >>>>>>> large amount of code will be reused/delegated to but I
>> will
>> > err
>> > > > on
>> > > > > >>>> the
>> > > > > >>>>>> side>
>> > > > > >>>>>>> of not touching the existing code/algorithms so that any
>> > errors
>> > > > in
>> > > > > >>>> the>
>> > > > > >>>>>>> implementation  or performance regressions can hopefully
>> be
>> > > > > >>>> mitigated at>
>> > > > > >>>>>>> runtime.  I expect in later releases (once the code has
>> > "baked")
>> > > > > >>>> will>
>> > > > > >>>>>>> become a no-op.>
>> > > > > >>>>>>> 3.  Started coding the write path.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Which leaves:>
>> > > > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be
>> > code
>> > > > > >>>> complete>
>> > > > > >>>>>>> 2.  Implementing the read path.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth
>> and
>> > want
>> > > > > >>>> to>
>> > > > > >>>>>>> contribute.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Thanks,>
>> > > > > >>>>>>> Micah>
>> > > > > >>>>>>>
>> > > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
>> > em...@gmail.com
>> > > > >>
>> > > > > >>>>>>> wrote:>
>> > > > > >>>>>>>
>> > > > > >>>>>>>> Hi Wes,>
>> > > > > >>>>>>>> I'm still interested in doing the work.  But don't to
>> hold
>> > > > > >>>> anybody up
>> > > > > >>>>>> if>
>> > > > > >>>>>>>> they have bandwidth.>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> In order to actually make progress on this, my plan will
>> be
>> > to:>
>> > > > > >>>>>>>> 1.  Help with the current Java review backlog through
>> early
>> > next
>> > > > > >>>> week
>> > > > > >>>>>> or>
>> > > > > >>>>>>>> so (this has been taking the majority of my time
>> allocated
>> > for
>> > > > > >>>> Arrow>
>> > > > > >>>>>>>> contributions for the last 6 months or so).>
>> > > > > >>>>>>>> 2.  Shift all my attention to trying to get this done
>> (this
>> > > > > >>>> means no>
>> > > > > >>>>>>>> reviews other then closing out existing ones that I've
>> > started
>> > > > > >>>> until it
>> > > > > >>>>>> is>
>> > > > > >>>>>>>> done).  Hopefully, other Java committers can help shrink
>> the
>> > > > > >>>> backlog>
>> > > > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> Thanks,>
>> > > > > >>>>>>>> Micah>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <
>> > we...@gmail.com>
>> > > > > >>>> wrote:>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>>> hi folks,>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> I think we have reached a point where the incomplete C++
>> > > > > >>>> Parquet>
>> > > > > >>>>>>>>> nested data assembly/disassembly is harming the value of
>> > > > > >>>> several>
>> > > > > >>>>>>>>> others parts of the project, for example the Datasets
>> API.
>> > As
>> > > > > >>>> another>
>> > > > > >>>>>>>>> example, it's possible to ingest nested data from JSON
>> but
>> > not
>> > > > > >>>> write>
>> > > > > >>>>>>>>> it to Parquet in general.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Implementing the nested data read and write path
>> > completely is
>> > > > > >>>> a>
>> > > > > >>>>>>>>> difficult project requiring at least several weeks of
>> > dedicated
>> > > > > >>>> work,>
>> > > > > >>>>>>>>> so it's not so surprising that it hasn't been
>> accomplished
>> > yet.
>> > > > > >>>> I
>> > > > > >>>>>> know>
>> > > > > >>>>>>>>> that several people have expressed interest in working
>> on
>> > it,
>> > > > > >>>> but I>
>> > > > > >>>>>>>>> would like to see if anyone would be able to volunteer a
>> > > > > >>>> commitment
>> > > > > >>>>>> of>
>> > > > > >>>>>>>>> time and guess on a rough timeline when this work could
>> be
>> > > > > >>>> done. It>
>> > > > > >>>>>>>>> seems to me if this slips beyond 2020 it will
>> significant
>> > > > > >>>> diminish
>> > > > > >>>>>> the>
>> > > > > >>>>>>>>> value being created by other parts of the project.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm
>> one
>> > > > > >>>> candidate>
>> > > > > >>>>>>>>> person to take on this project (and I can dedicate the
>> > time,
>> > > > > >>>> but it>
>> > > > > >>>>>>>>> would come at the expense of other projects where I can
>> > also
>> > > > be>
>> > > > > >>>>>>>>> useful). But Micah and others expressed interest in
>> > working on
>> > > > > >>>> it, so>
>> > > > > >>>>>>>>> I wanted to have a discussion about it to see what
>> others
>> > > > > >>>> think.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Thanks>
>> > > > > >>>>>>>>> Wes>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>
>> > > > > >>>>>>
>> > > > > >>>>
>> > > > > >>>
>> > > > >
>> > > >
>> >
>>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Igor Calabria <ig...@gmail.com>.

Hi Micah,

Finally got around to doing some work on the reader's side. Like you
suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960.
I never programmed C++ professionally so I opened the PR
https://github.com/apache/arrow/pull/6758 as soon as possible to collect
feedback while I continue the work. Next step is actually fixing the
reader, since adding a explicit translation to maps broke it. It should be
simple enough to just reuse StructReader.

Em dom., 15 de mar. de 2020 às 01:17, Micah Kornfield <em...@gmail.com>
escreveu:

> >
> > I'd be OK with having the flag so long as the new code is the default.
> > Otherwise we'll never find out about the corner cases.
>
> Completely agree, my PR makes the new code the default.
>
> On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Maarten, I don't expect regressions for flat cases (I'm going to try to
> > run
> > > benchmarks comparison tonight).
> > >
> > > In terms of the flag, I'm more concerned about some corner case I
> didn't
> > > think of in testing or a workload that for some reason is better with
> the
> > > prior code. If either of these arise I would like to give users a way
> to
> > > mitigate issues so a patch release isn't an imperative.
> >
> > I'd be OK with having the flag so long as the new code is the default.
> > Otherwise we'll never find out about the corner cases.
> >
> > > I'm happy to cleanup the existing code as soon as the next release
> > happens
> > > though.
> > >
> > > Thoughts?
> > >
> > > -Micah
> > >
> > > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Maarten -- AFAIK Micah's work only affects nested / non-flat column
> > > > paths, so flat data should not be impacted. Since we have a partial
> > > > implementation of writes for nested data (lists-of-lists and
> > > > structs-of-structs, but no mix of the two) that was the performance
> > > > difference I was referencing.
> > > >
> > > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <
> > maartenb@xs4all.nl>
> > > > wrote:
> > > > >
> > > > > Hi Micah,
> > > > >
> > > > > How does the performance change for “flat” schemas?
> > > > > (particularly in the case of a large number of columns)
> > > > >
> > > > > Thanks,
> > > > > Maarten
> > > > >
> > > > >
> > > > >
> > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <
> > emkornfield@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Another status update.  I've integrated the level generation code
> > with
> > > > the
> > > > > > parquet writing code [1].
> > > > > >
> > > > > > After that PR is merged I'll add bindings in Python to control
> > > > versions of
> > > > > > the level generation algorithm and plan on moving on to the read
> > side.
> > > > > >
> > > > > > Thanks,
> > > > > > Micah
> > > > > >
> > > > > > [1] https://github.com/apache/arrow/pull/6586
> > > > > >
> > > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Igor,
> > > > > >> If you have the time
> > https://issues.apache.org/jira/browse/ARROW-7960
> > > > might
> > > > > >> be a good task to pick up for this I think it should be a
> > relatively
> > > > small
> > > > > >> amount of code, so it is probably a good contribution to the
> > > > project.  Once
> > > > > >> that is wrapped up we can see were we both are.
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Micah
> > > > > >>
> > > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <
> > igor.calabria@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Micah, I actually got involved with another personal project
> > and
> > > > had
> > > > > >>> to postpone my contribution to arrow a bit. The good news is
> > that I'm
> > > > > >>> almost done with it, so I could help you with the read side
> very
> > > > soon. Any
> > > > > >>> ideas how we could coordinate this?
> > > > > >>>
> > > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
> > > > wesmckinn@gmail.com>
> > > > > >>> escreveu:
> > > > > >>>
> > > > > >>>> hi Micah -- great news on the level generation PR. I'll try to
> > carve
> > > > > >>>> out some time for reviewing over the coming week.
> > > > > >>>>
> > > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>>
> > > > > >>>>> Hi Igor,
> > > > > >>>>> I was wondering if you have made any progress on this?
> > > > > >>>>>
> > > > > >>>>> I posted a new PR [1] which I believe handles the difficult
> > > > algorithmic
> > > > > >>>>> part of writing.  There will be some follow-ups but I think
> > this PR
> > > > > >>>> might
> > > > > >>>>> take a while to review, so I was thinking of starting to
> take a
> > > > look
> > > > > >>>> at the
> > > > > >>>>> read side if you haven't started yet, and circle back to the
> > final
> > > > > >>>>> integration for the write side once the PR is checked in.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>> Micah
> > > > > >>>>>
> > > > > >>>>> [1] https://github.com/apache/arrow/pull/6490
> > > > > >>>>>
> > > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
> > > > igor.calabria@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hi, I would love to help with this issue. I'm aware that
> this
> > is a
> > > > > >>>> huge
> > > > > >>>>>> task for a first contribution to arrow, but I feel that I
> > could
> > > > help
> > > > > >>>> with
> > > > > >>>>>> the read path.
> > > > > >>>>>> Reading parquet seems like a extremely complex task since
> both
> > > > > >>>> hive[0] and
> > > > > >>>>>> spark[1] tried to implement a "vectorized" version and they
> > all
> > > > > >>>> stopped
> > > > > >>>>>> short of supporting complex types.
> > > > > >>>>>> I wanted to at least give it a try and find out where the
> > > > challenge
> > > > > >>>> lies.
> > > > > >>>>>>
> > > > > >>>>>> Since you guys are much more familiar with the current code
> > base,
> > > > I
> > > > > >>>> could
> > > > > >>>>>> use some starting tips so I don't fall in common pitfalls
> and
> > > > > >>>> whatnot.
> > > > > >>>>>>
> > > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> > > > > >>>>>> [1]
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > > > > >>>>>>
> > > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com>
> > wrote:
> > > > > >>>>>>> Just to give an update.  I've been a little bit delayed,
> but
> > my
> > > > > >>>> progress
> > > > > >>>>>> is>
> > > > > >>>>>>> as follows:>
> > > > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end
> > tests.>
> > > > > >>>>>>> 2.  Have another PR open that allows a configuration option
> > in
> > > > C++
> > > > > >>>> to>
> > > > > >>>>>>> determine which algorithm version to use for
> > reading/writing, the
> > > > > >>>>>> existing>
> > > > > >>>>>>> version and the new version supported complex-nested
> > arrays.  I
> > > > > >>>> think a>
> > > > > >>>>>>> large amount of code will be reused/delegated to but I will
> > err
> > > > on
> > > > > >>>> the
> > > > > >>>>>> side>
> > > > > >>>>>>> of not touching the existing code/algorithms so that any
> > errors
> > > > in
> > > > > >>>> the>
> > > > > >>>>>>> implementation  or performance regressions can hopefully be
> > > > > >>>> mitigated at>
> > > > > >>>>>>> runtime.  I expect in later releases (once the code has
> > "baked")
> > > > > >>>> will>
> > > > > >>>>>>> become a no-op.>
> > > > > >>>>>>> 3.  Started coding the write path.>
> > > > > >>>>>>>
> > > > > >>>>>>> Which leaves:>
> > > > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be
> > code
> > > > > >>>> complete>
> > > > > >>>>>>> 2.  Implementing the read path.>
> > > > > >>>>>>>
> > > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth
> and
> > want
> > > > > >>>> to>
> > > > > >>>>>>> contribute.>
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks,>
> > > > > >>>>>>> Micah>
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
> > em...@gmail.com
> > > > >>
> > > > > >>>>>>> wrote:>
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi Wes,>
> > > > > >>>>>>>> I'm still interested in doing the work.  But don't to hold
> > > > > >>>> anybody up
> > > > > >>>>>> if>
> > > > > >>>>>>>> they have bandwidth.>
> > > > > >>>>>>>>>
> > > > > >>>>>>>> In order to actually make progress on this, my plan will
> be
> > to:>
> > > > > >>>>>>>> 1.  Help with the current Java review backlog through
> early
> > next
> > > > > >>>> week
> > > > > >>>>>> or>
> > > > > >>>>>>>> so (this has been taking the majority of my time allocated
> > for
> > > > > >>>> Arrow>
> > > > > >>>>>>>> contributions for the last 6 months or so).>
> > > > > >>>>>>>> 2.  Shift all my attention to trying to get this done
> (this
> > > > > >>>> means no>
> > > > > >>>>>>>> reviews other then closing out existing ones that I've
> > started
> > > > > >>>> until it
> > > > > >>>>>> is>
> > > > > >>>>>>>> done).  Hopefully, other Java committers can help shrink
> the
> > > > > >>>> backlog>
> > > > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
> > > > > >>>>>>>>>
> > > > > >>>>>>>> Thanks,>
> > > > > >>>>>>>> Micah>
> > > > > >>>>>>>>>
> > > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <
> > we...@gmail.com>
> > > > > >>>> wrote:>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> hi folks,>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> I think we have reached a point where the incomplete C++
> > > > > >>>> Parquet>
> > > > > >>>>>>>>> nested data assembly/disassembly is harming the value of
> > > > > >>>> several>
> > > > > >>>>>>>>> others parts of the project, for example the Datasets
> API.
> > As
> > > > > >>>> another>
> > > > > >>>>>>>>> example, it's possible to ingest nested data from JSON
> but
> > not
> > > > > >>>> write>
> > > > > >>>>>>>>> it to Parquet in general.>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> Implementing the nested data read and write path
> > completely is
> > > > > >>>> a>
> > > > > >>>>>>>>> difficult project requiring at least several weeks of
> > dedicated
> > > > > >>>> work,>
> > > > > >>>>>>>>> so it's not so surprising that it hasn't been
> accomplished
> > yet.
> > > > > >>>> I
> > > > > >>>>>> know>
> > > > > >>>>>>>>> that several people have expressed interest in working on
> > it,
> > > > > >>>> but I>
> > > > > >>>>>>>>> would like to see if anyone would be able to volunteer a
> > > > > >>>> commitment
> > > > > >>>>>> of>
> > > > > >>>>>>>>> time and guess on a rough timeline when this work could
> be
> > > > > >>>> done. It>
> > > > > >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> > > > > >>>> diminish
> > > > > >>>>>> the>
> > > > > >>>>>>>>> value being created by other parts of the project.>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm
> one
> > > > > >>>> candidate>
> > > > > >>>>>>>>> person to take on this project (and I can dedicate the
> > time,
> > > > > >>>> but it>
> > > > > >>>>>>>>> would come at the expense of other projects where I can
> > also
> > > > be>
> > > > > >>>>>>>>> useful). But Micah and others expressed interest in
> > working on
> > > > > >>>> it, so>
> > > > > >>>>>>>>> I wanted to have a discussion about it to see what others
> > > > > >>>> think.>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> Thanks>
> > > > > >>>>>>>>> Wes>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> >
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Micah Kornfield <em...@gmail.com>.

>
> I'd be OK with having the flag so long as the new code is the default.
> Otherwise we'll never find out about the corner cases.

Completely agree, my PR makes the new code the default.

On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <we...@gmail.com> wrote:

> On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Maarten, I don't expect regressions for flat cases (I'm going to try to
> run
> > benchmarks comparison tonight).
> >
> > In terms of the flag, I'm more concerned about some corner case I didn't
> > think of in testing or a workload that for some reason is better with the
> > prior code. If either of these arise I would like to give users a way to
> > mitigate issues so a patch release isn't an imperative.
>
> I'd be OK with having the flag so long as the new code is the default.
> Otherwise we'll never find out about the corner cases.
>
> > I'm happy to cleanup the existing code as soon as the next release
> happens
> > though.
> >
> > Thoughts?
> >
> > -Micah
> >
> > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Maarten -- AFAIK Micah's work only affects nested / non-flat column
> > > paths, so flat data should not be impacted. Since we have a partial
> > > implementation of writes for nested data (lists-of-lists and
> > > structs-of-structs, but no mix of the two) that was the performance
> > > difference I was referencing.
> > >
> > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <
> maartenb@xs4all.nl>
> > > wrote:
> > > >
> > > > Hi Micah,
> > > >
> > > > How does the performance change for “flat” schemas?
> > > > (particularly in the case of a large number of columns)
> > > >
> > > > Thanks,
> > > > Maarten
> > > >
> > > >
> > > >
> > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <
> emkornfield@gmail.com>
> > > wrote:
> > > > >
> > > > > Another status update.  I've integrated the level generation code
> with
> > > the
> > > > > parquet writing code [1].
> > > > >
> > > > > After that PR is merged I'll add bindings in Python to control
> > > versions of
> > > > > the level generation algorithm and plan on moving on to the read
> side.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/6586
> > > > >
> > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi Igor,
> > > > >> If you have the time
> https://issues.apache.org/jira/browse/ARROW-7960
> > > might
> > > > >> be a good task to pick up for this I think it should be a
> relatively
> > > small
> > > > >> amount of code, so it is probably a good contribution to the
> > > project.  Once
> > > > >> that is wrapped up we can see were we both are.
> > > > >>
> > > > >> Cheers,
> > > > >> Micah
> > > > >>
> > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <
> igor.calabria@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Micah, I actually got involved with another personal project
> and
> > > had
> > > > >>> to postpone my contribution to arrow a bit. The good news is
> that I'm
> > > > >>> almost done with it, so I could help you with the read side very
> > > soon. Any
> > > > >>> ideas how we could coordinate this?
> > > > >>>
> > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > >>> escreveu:
> > > > >>>
> > > > >>>> hi Micah -- great news on the level generation PR. I'll try to
> carve
> > > > >>>> out some time for reviewing over the coming week.
> > > > >>>>
> > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Hi Igor,
> > > > >>>>> I was wondering if you have made any progress on this?
> > > > >>>>>
> > > > >>>>> I posted a new PR [1] which I believe handles the difficult
> > > algorithmic
> > > > >>>>> part of writing.  There will be some follow-ups but I think
> this PR
> > > > >>>> might
> > > > >>>>> take a while to review, so I was thinking of starting to take a
> > > look
> > > > >>>> at the
> > > > >>>>> read side if you haven't started yet, and circle back to the
> final
> > > > >>>>> integration for the write side once the PR is checked in.
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Micah
> > > > >>>>>
> > > > >>>>> [1] https://github.com/apache/arrow/pull/6490
> > > > >>>>>
> > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
> > > igor.calabria@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi, I would love to help with this issue. I'm aware that this
> is a
> > > > >>>> huge
> > > > >>>>>> task for a first contribution to arrow, but I feel that I
> could
> > > help
> > > > >>>> with
> > > > >>>>>> the read path.
> > > > >>>>>> Reading parquet seems like a extremely complex task since both
> > > > >>>> hive[0] and
> > > > >>>>>> spark[1] tried to implement a "vectorized" version and they
> all
> > > > >>>> stopped
> > > > >>>>>> short of supporting complex types.
> > > > >>>>>> I wanted to at least give it a try and find out where the
> > > challenge
> > > > >>>> lies.
> > > > >>>>>>
> > > > >>>>>> Since you guys are much more familiar with the current code
> base,
> > > I
> > > > >>>> could
> > > > >>>>>> use some starting tips so I don't fall in common pitfalls and
> > > > >>>> whatnot.
> > > > >>>>>>
> > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> > > > >>>>>> [1]
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>
> > >
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > > > >>>>>>
> > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com>
> wrote:
> > > > >>>>>>> Just to give an update.  I've been a little bit delayed, but
> my
> > > > >>>> progress
> > > > >>>>>> is>
> > > > >>>>>>> as follows:>
> > > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end
> tests.>
> > > > >>>>>>> 2.  Have another PR open that allows a configuration option
> in
> > > C++
> > > > >>>> to>
> > > > >>>>>>> determine which algorithm version to use for
> reading/writing, the
> > > > >>>>>> existing>
> > > > >>>>>>> version and the new version supported complex-nested
> arrays.  I
> > > > >>>> think a>
> > > > >>>>>>> large amount of code will be reused/delegated to but I will
> err
> > > on
> > > > >>>> the
> > > > >>>>>> side>
> > > > >>>>>>> of not touching the existing code/algorithms so that any
> errors
> > > in
> > > > >>>> the>
> > > > >>>>>>> implementation  or performance regressions can hopefully be
> > > > >>>> mitigated at>
> > > > >>>>>>> runtime.  I expect in later releases (once the code has
> "baked")
> > > > >>>> will>
> > > > >>>>>>> become a no-op.>
> > > > >>>>>>> 3.  Started coding the write path.>
> > > > >>>>>>>
> > > > >>>>>>> Which leaves:>
> > > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be
> code
> > > > >>>> complete>
> > > > >>>>>>> 2.  Implementing the read path.>
> > > > >>>>>>>
> > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and
> want
> > > > >>>> to>
> > > > >>>>>>> contribute.>
> > > > >>>>>>>
> > > > >>>>>>> Thanks,>
> > > > >>>>>>> Micah>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
> em...@gmail.com
> > > >>
> > > > >>>>>>> wrote:>
> > > > >>>>>>>
> > > > >>>>>>>> Hi Wes,>
> > > > >>>>>>>> I'm still interested in doing the work.  But don't to hold
> > > > >>>> anybody up
> > > > >>>>>> if>
> > > > >>>>>>>> they have bandwidth.>
> > > > >>>>>>>>>
> > > > >>>>>>>> In order to actually make progress on this, my plan will be
> to:>
> > > > >>>>>>>> 1.  Help with the current Java review backlog through early
> next
> > > > >>>> week
> > > > >>>>>> or>
> > > > >>>>>>>> so (this has been taking the majority of my time allocated
> for
> > > > >>>> Arrow>
> > > > >>>>>>>> contributions for the last 6 months or so).>
> > > > >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> > > > >>>> means no>
> > > > >>>>>>>> reviews other then closing out existing ones that I've
> started
> > > > >>>> until it
> > > > >>>>>> is>
> > > > >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> > > > >>>> backlog>
> > > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
> > > > >>>>>>>>>
> > > > >>>>>>>> Thanks,>
> > > > >>>>>>>> Micah>
> > > > >>>>>>>>>
> > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <
> we...@gmail.com>
> > > > >>>> wrote:>
> > > > >>>>>>>>>
> > > > >>>>>>>>> hi folks,>
> > > > >>>>>>>>>>
> > > > >>>>>>>>> I think we have reached a point where the incomplete C++
> > > > >>>> Parquet>
> > > > >>>>>>>>> nested data assembly/disassembly is harming the value of
> > > > >>>> several>
> > > > >>>>>>>>> others parts of the project, for example the Datasets API.
> As
> > > > >>>> another>
> > > > >>>>>>>>> example, it's possible to ingest nested data from JSON but
> not
> > > > >>>> write>
> > > > >>>>>>>>> it to Parquet in general.>
> > > > >>>>>>>>>>
> > > > >>>>>>>>> Implementing the nested data read and write path
> completely is
> > > > >>>> a>
> > > > >>>>>>>>> difficult project requiring at least several weeks of
> dedicated
> > > > >>>> work,>
> > > > >>>>>>>>> so it's not so surprising that it hasn't been accomplished
> yet.
> > > > >>>> I
> > > > >>>>>> know>
> > > > >>>>>>>>> that several people have expressed interest in working on
> it,
> > > > >>>> but I>
> > > > >>>>>>>>> would like to see if anyone would be able to volunteer a
> > > > >>>> commitment
> > > > >>>>>> of>
> > > > >>>>>>>>> time and guess on a rough timeline when this work could be
> > > > >>>> done. It>
> > > > >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> > > > >>>> diminish
> > > > >>>>>> the>
> > > > >>>>>>>>> value being created by other parts of the project.>
> > > > >>>>>>>>>>
> > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> > > > >>>> candidate>
> > > > >>>>>>>>> person to take on this project (and I can dedicate the
> time,
> > > > >>>> but it>
> > > > >>>>>>>>> would come at the expense of other projects where I can
> also
> > > be>
> > > > >>>>>>>>> useful). But Micah and others expressed interest in
> working on
> > > > >>>> it, so>
> > > > >>>>>>>>> I wanted to have a discussion about it to see what others
> > > > >>>> think.>
> > > > >>>>>>>>>>
> > > > >>>>>>>>> Thanks>
> > > > >>>>>>>>> Wes>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > >
> > >
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Wes McKinney <we...@gmail.com>.

On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Maarten, I don't expect regressions for flat cases (I'm going to try to run
> benchmarks comparison tonight).
>
> In terms of the flag, I'm more concerned about some corner case I didn't
> think of in testing or a workload that for some reason is better with the
> prior code. If either of these arise I would like to give users a way to
> mitigate issues so a patch release isn't an imperative.

I'd be OK with having the flag so long as the new code is the default.
Otherwise we'll never find out about the corner cases.

> I'm happy to cleanup the existing code as soon as the next release happens
> though.
>
> Thoughts?
>
> -Micah
>
> On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Maarten -- AFAIK Micah's work only affects nested / non-flat column
> > paths, so flat data should not be impacted. Since we have a partial
> > implementation of writes for nested data (lists-of-lists and
> > structs-of-structs, but no mix of the two) that was the performance
> > difference I was referencing.
> >
> > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <ma...@xs4all.nl>
> > wrote:
> > >
> > > Hi Micah,
> > >
> > > How does the performance change for “flat” schemas?
> > > (particularly in the case of a large number of columns)
> > >
> > > Thanks,
> > > Maarten
> > >
> > >
> > >
> > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <em...@gmail.com>
> > wrote:
> > > >
> > > > Another status update.  I've integrated the level generation code with
> > the
> > > > parquet writing code [1].
> > > >
> > > > After that PR is merged I'll add bindings in Python to control
> > versions of
> > > > the level generation algorithm and plan on moving on to the read side.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/apache/arrow/pull/6586
> > > >
> > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Igor,
> > > >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960
> > might
> > > >> be a good task to pick up for this I think it should be a relatively
> > small
> > > >> amount of code, so it is probably a good contribution to the
> > project.  Once
> > > >> that is wrapped up we can see were we both are.
> > > >>
> > > >> Cheers,
> > > >> Micah
> > > >>
> > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calabria@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Micah, I actually got involved with another personal project and
> > had
> > > >>> to postpone my contribution to arrow a bit. The good news is that I'm
> > > >>> almost done with it, so I could help you with the read side very
> > soon. Any
> > > >>> ideas how we could coordinate this?
> > > >>>
> > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
> > wesmckinn@gmail.com>
> > > >>> escreveu:
> > > >>>
> > > >>>> hi Micah -- great news on the level generation PR. I'll try to carve
> > > >>>> out some time for reviewing over the coming week.
> > > >>>>
> > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
> > emkornfield@gmail.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> Hi Igor,
> > > >>>>> I was wondering if you have made any progress on this?
> > > >>>>>
> > > >>>>> I posted a new PR [1] which I believe handles the difficult
> > algorithmic
> > > >>>>> part of writing.  There will be some follow-ups but I think this PR
> > > >>>> might
> > > >>>>> take a while to review, so I was thinking of starting to take a
> > look
> > > >>>> at the
> > > >>>>> read side if you haven't started yet, and circle back to the final
> > > >>>>> integration for the write side once the PR is checked in.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Micah
> > > >>>>>
> > > >>>>> [1] https://github.com/apache/arrow/pull/6490
> > > >>>>>
> > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
> > igor.calabria@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi, I would love to help with this issue. I'm aware that this is a
> > > >>>> huge
> > > >>>>>> task for a first contribution to arrow, but I feel that I could
> > help
> > > >>>> with
> > > >>>>>> the read path.
> > > >>>>>> Reading parquet seems like a extremely complex task since both
> > > >>>> hive[0] and
> > > >>>>>> spark[1] tried to implement a "vectorized" version and they all
> > > >>>> stopped
> > > >>>>>> short of supporting complex types.
> > > >>>>>> I wanted to at least give it a try and find out where the
> > challenge
> > > >>>> lies.
> > > >>>>>>
> > > >>>>>> Since you guys are much more familiar with the current code base,
> > I
> > > >>>> could
> > > >>>>>> use some starting tips so I don't fall in common pitfalls and
> > > >>>> whatnot.
> > > >>>>>>
> > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> > > >>>>>> [1]
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > > >>>>>>
> > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
> > > >>>>>>> Just to give an update.  I've been a little bit delayed, but my
> > > >>>> progress
> > > >>>>>> is>
> > > >>>>>>> as follows:>
> > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> > > >>>>>>> 2.  Have another PR open that allows a configuration option in
> > C++
> > > >>>> to>
> > > >>>>>>> determine which algorithm version to use for reading/writing, the
> > > >>>>>> existing>
> > > >>>>>>> version and the new version supported complex-nested arrays.  I
> > > >>>> think a>
> > > >>>>>>> large amount of code will be reused/delegated to but I will err
> > on
> > > >>>> the
> > > >>>>>> side>
> > > >>>>>>> of not touching the existing code/algorithms so that any errors
> > in
> > > >>>> the>
> > > >>>>>>> implementation  or performance regressions can hopefully be
> > > >>>> mitigated at>
> > > >>>>>>> runtime.  I expect in later releases (once the code has "baked")
> > > >>>> will>
> > > >>>>>>> become a no-op.>
> > > >>>>>>> 3.  Started coding the write path.>
> > > >>>>>>>
> > > >>>>>>> Which leaves:>
> > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> > > >>>> complete>
> > > >>>>>>> 2.  Implementing the read path.>
> > > >>>>>>>
> > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
> > > >>>> to>
> > > >>>>>>> contribute.>
> > > >>>>>>>
> > > >>>>>>> Thanks,>
> > > >>>>>>> Micah>
> > > >>>>>>>
> > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com
> > >>
> > > >>>>>>> wrote:>
> > > >>>>>>>
> > > >>>>>>>> Hi Wes,>
> > > >>>>>>>> I'm still interested in doing the work.  But don't to hold
> > > >>>> anybody up
> > > >>>>>> if>
> > > >>>>>>>> they have bandwidth.>
> > > >>>>>>>>>
> > > >>>>>>>> In order to actually make progress on this, my plan will be to:>
> > > >>>>>>>> 1.  Help with the current Java review backlog through early next
> > > >>>> week
> > > >>>>>> or>
> > > >>>>>>>> so (this has been taking the majority of my time allocated for
> > > >>>> Arrow>
> > > >>>>>>>> contributions for the last 6 months or so).>
> > > >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> > > >>>> means no>
> > > >>>>>>>> reviews other then closing out existing ones that I've started
> > > >>>> until it
> > > >>>>>> is>
> > > >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> > > >>>> backlog>
> > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
> > > >>>>>>>>>
> > > >>>>>>>> Thanks,>
> > > >>>>>>>> Micah>
> > > >>>>>>>>>
> > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> > > >>>> wrote:>
> > > >>>>>>>>>
> > > >>>>>>>>> hi folks,>
> > > >>>>>>>>>>
> > > >>>>>>>>> I think we have reached a point where the incomplete C++
> > > >>>> Parquet>
> > > >>>>>>>>> nested data assembly/disassembly is harming the value of
> > > >>>> several>
> > > >>>>>>>>> others parts of the project, for example the Datasets API. As
> > > >>>> another>
> > > >>>>>>>>> example, it's possible to ingest nested data from JSON but not
> > > >>>> write>
> > > >>>>>>>>> it to Parquet in general.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Implementing the nested data read and write path completely is
> > > >>>> a>
> > > >>>>>>>>> difficult project requiring at least several weeks of dedicated
> > > >>>> work,>
> > > >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
> > > >>>> I
> > > >>>>>> know>
> > > >>>>>>>>> that several people have expressed interest in working on it,
> > > >>>> but I>
> > > >>>>>>>>> would like to see if anyone would be able to volunteer a
> > > >>>> commitment
> > > >>>>>> of>
> > > >>>>>>>>> time and guess on a rough timeline when this work could be
> > > >>>> done. It>
> > > >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> > > >>>> diminish
> > > >>>>>> the>
> > > >>>>>>>>> value being created by other parts of the project.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> > > >>>> candidate>
> > > >>>>>>>>> person to take on this project (and I can dedicate the time,
> > > >>>> but it>
> > > >>>>>>>>> would come at the expense of other projects where I can also
> > be>
> > > >>>>>>>>> useful). But Micah and others expressed interest in working on
> > > >>>> it, so>
> > > >>>>>>>>> I wanted to have a discussion about it to see what others
> > > >>>> think.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Thanks>
> > > >>>>>>>>> Wes>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>
> > >
> >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Micah Kornfield <em...@gmail.com>.

Maarten, I don't expect regressions for flat cases (I'm going to try to run
benchmarks comparison tonight).

In terms of the flag, I'm more concerned about some corner case I didn't
think of in testing or a workload that for some reason is better with the
prior code. If either of these arise I would like to give users a way to
mitigate issues so a patch release isn't an imperative.

I'm happy to cleanup the existing code as soon as the next release happens
though.

Thoughts?

-Micah

On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:

> Maarten -- AFAIK Micah's work only affects nested / non-flat column
> paths, so flat data should not be impacted. Since we have a partial
> implementation of writes for nested data (lists-of-lists and
> structs-of-structs, but no mix of the two) that was the performance
> difference I was referencing.
>
> On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <ma...@xs4all.nl>
> wrote:
> >
> > Hi Micah,
> >
> > How does the performance change for “flat” schemas?
> > (particularly in the case of a large number of columns)
> >
> > Thanks,
> > Maarten
> >
> >
> >
> > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > Another status update.  I've integrated the level generation code with
> the
> > > parquet writing code [1].
> > >
> > > After that PR is merged I'll add bindings in Python to control
> versions of
> > > the level generation algorithm and plan on moving on to the read side.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/apache/arrow/pull/6586
> > >
> > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > >> Hi Igor,
> > >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960
> might
> > >> be a good task to pick up for this I think it should be a relatively
> small
> > >> amount of code, so it is probably a good contribution to the
> project.  Once
> > >> that is wrapped up we can see were we both are.
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calabria@gmail.com
> >
> > >> wrote:
> > >>
> > >>> Hi Micah, I actually got involved with another personal project and
> had
> > >>> to postpone my contribution to arrow a bit. The good news is that I'm
> > >>> almost done with it, so I could help you with the read side very
> soon. Any
> > >>> ideas how we could coordinate this?
> > >>>
> > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
> wesmckinn@gmail.com>
> > >>> escreveu:
> > >>>
> > >>>> hi Micah -- great news on the level generation PR. I'll try to carve
> > >>>> out some time for reviewing over the coming week.
> > >>>>
> > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
> emkornfield@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi Igor,
> > >>>>> I was wondering if you have made any progress on this?
> > >>>>>
> > >>>>> I posted a new PR [1] which I believe handles the difficult
> algorithmic
> > >>>>> part of writing.  There will be some follow-ups but I think this PR
> > >>>> might
> > >>>>> take a while to review, so I was thinking of starting to take a
> look
> > >>>> at the
> > >>>>> read side if you haven't started yet, and circle back to the final
> > >>>>> integration for the write side once the PR is checked in.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Micah
> > >>>>>
> > >>>>> [1] https://github.com/apache/arrow/pull/6490
> > >>>>>
> > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
> igor.calabria@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi, I would love to help with this issue. I'm aware that this is a
> > >>>> huge
> > >>>>>> task for a first contribution to arrow, but I feel that I could
> help
> > >>>> with
> > >>>>>> the read path.
> > >>>>>> Reading parquet seems like a extremely complex task since both
> > >>>> hive[0] and
> > >>>>>> spark[1] tried to implement a "vectorized" version and they all
> > >>>> stopped
> > >>>>>> short of supporting complex types.
> > >>>>>> I wanted to at least give it a try and find out where the
> challenge
> > >>>> lies.
> > >>>>>>
> > >>>>>> Since you guys are much more familiar with the current code base,
> I
> > >>>> could
> > >>>>>> use some starting tips so I don't fall in common pitfalls and
> > >>>> whatnot.
> > >>>>>>
> > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> > >>>>>> [1]
> > >>>>>>
> > >>>>>>
> > >>>>
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > >>>>>>
> > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
> > >>>>>>> Just to give an update.  I've been a little bit delayed, but my
> > >>>> progress
> > >>>>>> is>
> > >>>>>>> as follows:>
> > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> > >>>>>>> 2.  Have another PR open that allows a configuration option in
> C++
> > >>>> to>
> > >>>>>>> determine which algorithm version to use for reading/writing, the
> > >>>>>> existing>
> > >>>>>>> version and the new version supported complex-nested arrays.  I
> > >>>> think a>
> > >>>>>>> large amount of code will be reused/delegated to but I will err
> on
> > >>>> the
> > >>>>>> side>
> > >>>>>>> of not touching the existing code/algorithms so that any errors
> in
> > >>>> the>
> > >>>>>>> implementation  or performance regressions can hopefully be
> > >>>> mitigated at>
> > >>>>>>> runtime.  I expect in later releases (once the code has "baked")
> > >>>> will>
> > >>>>>>> become a no-op.>
> > >>>>>>> 3.  Started coding the write path.>
> > >>>>>>>
> > >>>>>>> Which leaves:>
> > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> > >>>> complete>
> > >>>>>>> 2.  Implementing the read path.>
> > >>>>>>>
> > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
> > >>>> to>
> > >>>>>>> contribute.>
> > >>>>>>>
> > >>>>>>> Thanks,>
> > >>>>>>> Micah>
> > >>>>>>>
> > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com
> >>
> > >>>>>>> wrote:>
> > >>>>>>>
> > >>>>>>>> Hi Wes,>
> > >>>>>>>> I'm still interested in doing the work.  But don't to hold
> > >>>> anybody up
> > >>>>>> if>
> > >>>>>>>> they have bandwidth.>
> > >>>>>>>>>
> > >>>>>>>> In order to actually make progress on this, my plan will be to:>
> > >>>>>>>> 1.  Help with the current Java review backlog through early next
> > >>>> week
> > >>>>>> or>
> > >>>>>>>> so (this has been taking the majority of my time allocated for
> > >>>> Arrow>
> > >>>>>>>> contributions for the last 6 months or so).>
> > >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> > >>>> means no>
> > >>>>>>>> reviews other then closing out existing ones that I've started
> > >>>> until it
> > >>>>>> is>
> > >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> > >>>> backlog>
> > >>>>>>>> further (Jacques thanks for you recent efforts here).>
> > >>>>>>>>>
> > >>>>>>>> Thanks,>
> > >>>>>>>> Micah>
> > >>>>>>>>>
> > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> > >>>> wrote:>
> > >>>>>>>>>
> > >>>>>>>>> hi folks,>
> > >>>>>>>>>>
> > >>>>>>>>> I think we have reached a point where the incomplete C++
> > >>>> Parquet>
> > >>>>>>>>> nested data assembly/disassembly is harming the value of
> > >>>> several>
> > >>>>>>>>> others parts of the project, for example the Datasets API. As
> > >>>> another>
> > >>>>>>>>> example, it's possible to ingest nested data from JSON but not
> > >>>> write>
> > >>>>>>>>> it to Parquet in general.>
> > >>>>>>>>>>
> > >>>>>>>>> Implementing the nested data read and write path completely is
> > >>>> a>
> > >>>>>>>>> difficult project requiring at least several weeks of dedicated
> > >>>> work,>
> > >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
> > >>>> I
> > >>>>>> know>
> > >>>>>>>>> that several people have expressed interest in working on it,
> > >>>> but I>
> > >>>>>>>>> would like to see if anyone would be able to volunteer a
> > >>>> commitment
> > >>>>>> of>
> > >>>>>>>>> time and guess on a rough timeline when this work could be
> > >>>> done. It>
> > >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> > >>>> diminish
> > >>>>>> the>
> > >>>>>>>>> value being created by other parts of the project.>
> > >>>>>>>>>>
> > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> > >>>> candidate>
> > >>>>>>>>> person to take on this project (and I can dedicate the time,
> > >>>> but it>
> > >>>>>>>>> would come at the expense of other projects where I can also
> be>
> > >>>>>>>>> useful). But Micah and others expressed interest in working on
> > >>>> it, so>
> > >>>>>>>>> I wanted to have a discussion about it to see what others
> > >>>> think.>
> > >>>>>>>>>>
> > >>>>>>>>> Thanks>
> > >>>>>>>>> Wes>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> >
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Wes McKinney <we...@gmail.com>.

Maarten -- AFAIK Micah's work only affects nested / non-flat column
paths, so flat data should not be impacted. Since we have a partial
implementation of writes for nested data (lists-of-lists and
structs-of-structs, but no mix of the two) that was the performance
difference I was referencing.

On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Hi Micah,
>
> How does the performance change for “flat” schemas?
> (particularly in the case of a large number of columns)
>
> Thanks,
> Maarten
>
>
>
> > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <em...@gmail.com> wrote:
> >
> > Another status update.  I've integrated the level generation code with the
> > parquet writing code [1].
> >
> > After that PR is merged I'll add bindings in Python to control versions of
> > the level generation algorithm and plan on moving on to the read side.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/arrow/pull/6586
> >
> > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Igor,
> >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
> >> be a good task to pick up for this I think it should be a relatively small
> >> amount of code, so it is probably a good contribution to the project.  Once
> >> that is wrapped up we can see were we both are.
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <ig...@gmail.com>
> >> wrote:
> >>
> >>> Hi Micah, I actually got involved with another personal project and had
> >>> to postpone my contribution to arrow a bit. The good news is that I'm
> >>> almost done with it, so I could help you with the read side very soon. Any
> >>> ideas how we could coordinate this?
> >>>
> >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
> >>> escreveu:
> >>>
> >>>> hi Micah -- great news on the level generation PR. I'll try to carve
> >>>> out some time for reviewing over the coming week.
> >>>>
> >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi Igor,
> >>>>> I was wondering if you have made any progress on this?
> >>>>>
> >>>>> I posted a new PR [1] which I believe handles the difficult algorithmic
> >>>>> part of writing.  There will be some follow-ups but I think this PR
> >>>> might
> >>>>> take a while to review, so I was thinking of starting to take a look
> >>>> at the
> >>>>> read side if you haven't started yet, and circle back to the final
> >>>>> integration for the write side once the PR is checked in.
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>> [1] https://github.com/apache/arrow/pull/6490
> >>>>>
> >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi, I would love to help with this issue. I'm aware that this is a
> >>>> huge
> >>>>>> task for a first contribution to arrow, but I feel that I could help
> >>>> with
> >>>>>> the read path.
> >>>>>> Reading parquet seems like a extremely complex task since both
> >>>> hive[0] and
> >>>>>> spark[1] tried to implement a "vectorized" version and they all
> >>>> stopped
> >>>>>> short of supporting complex types.
> >>>>>> I wanted to at least give it a try and find out where the challenge
> >>>> lies.
> >>>>>>
> >>>>>> Since you guys are much more familiar with the current code base, I
> >>>> could
> >>>>>> use some starting tips so I don't fall in common pitfalls and
> >>>> whatnot.
> >>>>>>
> >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> >>>>>>
> >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
> >>>>>>> Just to give an update.  I've been a little bit delayed, but my
> >>>> progress
> >>>>>> is>
> >>>>>>> as follows:>
> >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> >>>>>>> 2.  Have another PR open that allows a configuration option in C++
> >>>> to>
> >>>>>>> determine which algorithm version to use for reading/writing, the
> >>>>>> existing>
> >>>>>>> version and the new version supported complex-nested arrays.  I
> >>>> think a>
> >>>>>>> large amount of code will be reused/delegated to but I will err on
> >>>> the
> >>>>>> side>
> >>>>>>> of not touching the existing code/algorithms so that any errors in
> >>>> the>
> >>>>>>> implementation  or performance regressions can hopefully be
> >>>> mitigated at>
> >>>>>>> runtime.  I expect in later releases (once the code has "baked")
> >>>> will>
> >>>>>>> become a no-op.>
> >>>>>>> 3.  Started coding the write path.>
> >>>>>>>
> >>>>>>> Which leaves:>
> >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> >>>> complete>
> >>>>>>> 2.  Implementing the read path.>
> >>>>>>>
> >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
> >>>> to>
> >>>>>>> contribute.>
> >>>>>>>
> >>>>>>> Thanks,>
> >>>>>>> Micah>
> >>>>>>>
> >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
> >>>>>>> wrote:>
> >>>>>>>
> >>>>>>>> Hi Wes,>
> >>>>>>>> I'm still interested in doing the work.  But don't to hold
> >>>> anybody up
> >>>>>> if>
> >>>>>>>> they have bandwidth.>
> >>>>>>>>>
> >>>>>>>> In order to actually make progress on this, my plan will be to:>
> >>>>>>>> 1.  Help with the current Java review backlog through early next
> >>>> week
> >>>>>> or>
> >>>>>>>> so (this has been taking the majority of my time allocated for
> >>>> Arrow>
> >>>>>>>> contributions for the last 6 months or so).>
> >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> >>>> means no>
> >>>>>>>> reviews other then closing out existing ones that I've started
> >>>> until it
> >>>>>> is>
> >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> >>>> backlog>
> >>>>>>>> further (Jacques thanks for you recent efforts here).>
> >>>>>>>>>
> >>>>>>>> Thanks,>
> >>>>>>>> Micah>
> >>>>>>>>>
> >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> >>>> wrote:>
> >>>>>>>>>
> >>>>>>>>> hi folks,>
> >>>>>>>>>>
> >>>>>>>>> I think we have reached a point where the incomplete C++
> >>>> Parquet>
> >>>>>>>>> nested data assembly/disassembly is harming the value of
> >>>> several>
> >>>>>>>>> others parts of the project, for example the Datasets API. As
> >>>> another>
> >>>>>>>>> example, it's possible to ingest nested data from JSON but not
> >>>> write>
> >>>>>>>>> it to Parquet in general.>
> >>>>>>>>>>
> >>>>>>>>> Implementing the nested data read and write path completely is
> >>>> a>
> >>>>>>>>> difficult project requiring at least several weeks of dedicated
> >>>> work,>
> >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
> >>>> I
> >>>>>> know>
> >>>>>>>>> that several people have expressed interest in working on it,
> >>>> but I>
> >>>>>>>>> would like to see if anyone would be able to volunteer a
> >>>> commitment
> >>>>>> of>
> >>>>>>>>> time and guess on a rough timeline when this work could be
> >>>> done. It>
> >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> >>>> diminish
> >>>>>> the>
> >>>>>>>>> value being created by other parts of the project.>
> >>>>>>>>>>
> >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> >>>> candidate>
> >>>>>>>>> person to take on this project (and I can dedicate the time,
> >>>> but it>
> >>>>>>>>> would come at the expense of other projects where I can also be>
> >>>>>>>>> useful). But Micah and others expressed interest in working on
> >>>> it, so>
> >>>>>>>>> I wanted to have a discussion about it to see what others
> >>>> think.>
> >>>>>>>>>>
> >>>>>>>>> Thanks>
> >>>>>>>>> Wes>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Maarten Ballintijn <ma...@xs4all.nl>.

Hi Micah,

How does the performance change for “flat” schemas?
(particularly in the case of a large number of columns)

Thanks,
Maarten



> On Mar 11, 2020, at 11:53 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> Another status update.  I've integrated the level generation code with the
> parquet writing code [1].
> 
> After that PR is merged I'll add bindings in Python to control versions of
> the level generation algorithm and plan on moving on to the read side.
> 
> Thanks,
> Micah
> 
> [1] https://github.com/apache/arrow/pull/6586
> 
> On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
>> Hi Igor,
>> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
>> be a good task to pick up for this I think it should be a relatively small
>> amount of code, so it is probably a good contribution to the project.  Once
>> that is wrapped up we can see were we both are.
>> 
>> Cheers,
>> Micah
>> 
>> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <ig...@gmail.com>
>> wrote:
>> 
>>> Hi Micah, I actually got involved with another personal project and had
>>> to postpone my contribution to arrow a bit. The good news is that I'm
>>> almost done with it, so I could help you with the read side very soon. Any
>>> ideas how we could coordinate this?
>>> 
>>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
>>> escreveu:
>>> 
>>>> hi Micah -- great news on the level generation PR. I'll try to carve
>>>> out some time for reviewing over the coming week.
>>>> 
>>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi Igor,
>>>>> I was wondering if you have made any progress on this?
>>>>> 
>>>>> I posted a new PR [1] which I believe handles the difficult algorithmic
>>>>> part of writing.  There will be some follow-ups but I think this PR
>>>> might
>>>>> take a while to review, so I was thinking of starting to take a look
>>>> at the
>>>>> read side if you haven't started yet, and circle back to the final
>>>>> integration for the write side once the PR is checked in.
>>>>> 
>>>>> Thanks,
>>>>> Micah
>>>>> 
>>>>> [1] https://github.com/apache/arrow/pull/6490
>>>>> 
>>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi, I would love to help with this issue. I'm aware that this is a
>>>> huge
>>>>>> task for a first contribution to arrow, but I feel that I could help
>>>> with
>>>>>> the read path.
>>>>>> Reading parquet seems like a extremely complex task since both
>>>> hive[0] and
>>>>>> spark[1] tried to implement a "vectorized" version and they all
>>>> stopped
>>>>>> short of supporting complex types.
>>>>>> I wanted to at least give it a try and find out where the challenge
>>>> lies.
>>>>>> 
>>>>>> Since you guys are much more familiar with the current code base, I
>>>> could
>>>>>> use some starting tips so I don't fall in common pitfalls and
>>>> whatnot.
>>>>>> 
>>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
>>>>>> [1]
>>>>>> 
>>>>>> 
>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>>>>>> 
>>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
>>>>>>> Just to give an update.  I've been a little bit delayed, but my
>>>> progress
>>>>>> is>
>>>>>>> as follows:>
>>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
>>>>>>> 2.  Have another PR open that allows a configuration option in C++
>>>> to>
>>>>>>> determine which algorithm version to use for reading/writing, the
>>>>>> existing>
>>>>>>> version and the new version supported complex-nested arrays.  I
>>>> think a>
>>>>>>> large amount of code will be reused/delegated to but I will err on
>>>> the
>>>>>> side>
>>>>>>> of not touching the existing code/algorithms so that any errors in
>>>> the>
>>>>>>> implementation  or performance regressions can hopefully be
>>>> mitigated at>
>>>>>>> runtime.  I expect in later releases (once the code has "baked")
>>>> will>
>>>>>>> become a no-op.>
>>>>>>> 3.  Started coding the write path.>
>>>>>>> 
>>>>>>> Which leaves:>
>>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
>>>> complete>
>>>>>>> 2.  Implementing the read path.>
>>>>>>> 
>>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
>>>> to>
>>>>>>> contribute.>
>>>>>>> 
>>>>>>> Thanks,>
>>>>>>> Micah>
>>>>>>> 
>>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
>>>>>>> wrote:>
>>>>>>> 
>>>>>>>> Hi Wes,>
>>>>>>>> I'm still interested in doing the work.  But don't to hold
>>>> anybody up
>>>>>> if>
>>>>>>>> they have bandwidth.>
>>>>>>>>> 
>>>>>>>> In order to actually make progress on this, my plan will be to:>
>>>>>>>> 1.  Help with the current Java review backlog through early next
>>>> week
>>>>>> or>
>>>>>>>> so (this has been taking the majority of my time allocated for
>>>> Arrow>
>>>>>>>> contributions for the last 6 months or so).>
>>>>>>>> 2.  Shift all my attention to trying to get this done (this
>>>> means no>
>>>>>>>> reviews other then closing out existing ones that I've started
>>>> until it
>>>>>> is>
>>>>>>>> done).  Hopefully, other Java committers can help shrink the
>>>> backlog>
>>>>>>>> further (Jacques thanks for you recent efforts here).>
>>>>>>>>> 
>>>>>>>> Thanks,>
>>>>>>>> Micah>
>>>>>>>>> 
>>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
>>>> wrote:>
>>>>>>>>> 
>>>>>>>>> hi folks,>
>>>>>>>>>> 
>>>>>>>>> I think we have reached a point where the incomplete C++
>>>> Parquet>
>>>>>>>>> nested data assembly/disassembly is harming the value of
>>>> several>
>>>>>>>>> others parts of the project, for example the Datasets API. As
>>>> another>
>>>>>>>>> example, it's possible to ingest nested data from JSON but not
>>>> write>
>>>>>>>>> it to Parquet in general.>
>>>>>>>>>> 
>>>>>>>>> Implementing the nested data read and write path completely is
>>>> a>
>>>>>>>>> difficult project requiring at least several weeks of dedicated
>>>> work,>
>>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
>>>> I
>>>>>> know>
>>>>>>>>> that several people have expressed interest in working on it,
>>>> but I>
>>>>>>>>> would like to see if anyone would be able to volunteer a
>>>> commitment
>>>>>> of>
>>>>>>>>> time and guess on a rough timeline when this work could be
>>>> done. It>
>>>>>>>>> seems to me if this slips beyond 2020 it will significant
>>>> diminish
>>>>>> the>
>>>>>>>>> value being created by other parts of the project.>
>>>>>>>>>> 
>>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
>>>> candidate>
>>>>>>>>> person to take on this project (and I can dedicate the time,
>>>> but it>
>>>>>>>>> would come at the expense of other projects where I can also be>
>>>>>>>>> useful). But Micah and others expressed interest in working on
>>>> it, so>
>>>>>>>>> I wanted to have a discussion about it to see what others
>>>> think.>
>>>>>>>>>> 
>>>>>>>>> Thanks>
>>>>>>>>> Wes>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Wes McKinney <we...@gmail.com>.

hi Micah,

Great to hear about the progress, I'll help with code review.

FWIW, if the new code passes the existing unit tests I would be in
favor of deleting the old code so that we're fully invested in making
the new code suitably fast. Jump in with two feet, so to speak.

Thanks
Wes

On Wed, Mar 11, 2020 at 10:53 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Another status update.  I've integrated the level generation code with the
> parquet writing code [1].
>
> After that PR is merged I'll add bindings in Python to control versions of
> the level generation algorithm and plan on moving on to the read side.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/6586
>
> On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Igor,
> > If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
> > be a good task to pick up for this I think it should be a relatively small
> > amount of code, so it is probably a good contribution to the project.  Once
> > that is wrapped up we can see were we both are.
> >
> > Cheers,
> > Micah
> >
> > On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <ig...@gmail.com>
> > wrote:
> >
> >> Hi Micah, I actually got involved with another personal project and had
> >> to postpone my contribution to arrow a bit. The good news is that I'm
> >> almost done with it, so I could help you with the read side very soon. Any
> >> ideas how we could coordinate this?
> >>
> >> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
> >> escreveu:
> >>
> >>> hi Micah -- great news on the level generation PR. I'll try to carve
> >>> out some time for reviewing over the coming week.
> >>>
> >>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
> >>> wrote:
> >>> >
> >>> > Hi Igor,
> >>> > I was wondering if you have made any progress on this?
> >>> >
> >>> > I posted a new PR [1] which I believe handles the difficult algorithmic
> >>> > part of writing.  There will be some follow-ups but I think this PR
> >>> might
> >>> > take a while to review, so I was thinking of starting to take a look
> >>> at the
> >>> > read side if you haven't started yet, and circle back to the final
> >>> > integration for the write side once the PR is checked in.
> >>> >
> >>> > Thanks,
> >>> > Micah
> >>> >
> >>> > [1] https://github.com/apache/arrow/pull/6490
> >>> >
> >>> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > Hi, I would love to help with this issue. I'm aware that this is a
> >>> huge
> >>> > > task for a first contribution to arrow, but I feel that I could help
> >>> with
> >>> > > the read path.
> >>> > > Reading parquet seems like a extremely complex task since both
> >>> hive[0] and
> >>> > > spark[1] tried to implement a "vectorized" version and they all
> >>> stopped
> >>> > > short of supporting complex types.
> >>> > > I wanted to at least give it a try and find out where the challenge
> >>> lies.
> >>> > >
> >>> > > Since you guys are much more familiar with the current code base, I
> >>> could
> >>> > > use some starting tips so I don't fall in common pitfalls and
> >>> whatnot.
> >>> > >
> >>> > > [0] https://issues.apache.org/jira/browse/HIVE-18576
> >>> > > [1]
> >>> > >
> >>> > >
> >>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> >>> > >
> >>> > > On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
> >>> > > > Just to give an update.  I've been a little bit delayed, but my
> >>> progress
> >>> > > is>
> >>> > > > as follows:>
> >>> > > > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> >>> > > > 2.  Have another PR open that allows a configuration option in C++
> >>> to>
> >>> > > > determine which algorithm version to use for reading/writing, the
> >>> > > existing>
> >>> > > > version and the new version supported complex-nested arrays.  I
> >>> think a>
> >>> > > > large amount of code will be reused/delegated to but I will err on
> >>> the
> >>> > > side>
> >>> > > > of not touching the existing code/algorithms so that any errors in
> >>> the>
> >>> > > > implementation  or performance regressions can hopefully be
> >>> mitigated at>
> >>> > > > runtime.  I expect in later releases (once the code has "baked")
> >>> will>
> >>> > > > become a no-op.>
> >>> > > > 3.  Started coding the write path.>
> >>> > > >
> >>> > > > Which leaves:>
> >>> > > > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> >>> complete>
> >>> > > > 2.  Implementing the read path.>
> >>> > > >
> >>> > > > Again, I'm happy to collaborate if people have bandwidth and want
> >>> to>
> >>> > > > contribute.>
> >>> > > >
> >>> > > > Thanks,>
> >>> > > > Micah>
> >>> > > >
> >>> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
> >>> > > > wrote:>
> >>> > > >
> >>> > > > > Hi Wes,>
> >>> > > > > I'm still interested in doing the work.  But don't to hold
> >>> anybody up
> >>> > > if>
> >>> > > > > they have bandwidth.>
> >>> > > > >>
> >>> > > > > In order to actually make progress on this, my plan will be to:>
> >>> > > > > 1.  Help with the current Java review backlog through early next
> >>> week
> >>> > > or>
> >>> > > > > so (this has been taking the majority of my time allocated for
> >>> Arrow>
> >>> > > > > contributions for the last 6 months or so).>
> >>> > > > > 2.  Shift all my attention to trying to get this done (this
> >>> means no>
> >>> > > > > reviews other then closing out existing ones that I've started
> >>> until it
> >>> > > is>
> >>> > > > > done).  Hopefully, other Java committers can help shrink the
> >>> backlog>
> >>> > > > > further (Jacques thanks for you recent efforts here).>
> >>> > > > >>
> >>> > > > > Thanks,>
> >>> > > > > Micah>
> >>> > > > >>
> >>> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> >>> wrote:>
> >>> > > > >>
> >>> > > > >> hi folks,>
> >>> > > > >>>
> >>> > > > >> I think we have reached a point where the incomplete C++
> >>> Parquet>
> >>> > > > >> nested data assembly/disassembly is harming the value of
> >>> several>
> >>> > > > >> others parts of the project, for example the Datasets API. As
> >>> another>
> >>> > > > >> example, it's possible to ingest nested data from JSON but not
> >>> write>
> >>> > > > >> it to Parquet in general.>
> >>> > > > >>>
> >>> > > > >> Implementing the nested data read and write path completely is
> >>> a>
> >>> > > > >> difficult project requiring at least several weeks of dedicated
> >>> work,>
> >>> > > > >> so it's not so surprising that it hasn't been accomplished yet.
> >>> I
> >>> > > know>
> >>> > > > >> that several people have expressed interest in working on it,
> >>> but I>
> >>> > > > >> would like to see if anyone would be able to volunteer a
> >>> commitment
> >>> > > of>
> >>> > > > >> time and guess on a rough timeline when this work could be
> >>> done. It>
> >>> > > > >> seems to me if this slips beyond 2020 it will significant
> >>> diminish
> >>> > > the>
> >>> > > > >> value being created by other parts of the project.>
> >>> > > > >>>
> >>> > > > >> Since I'm pretty familiar with all the Parquet code I'm one
> >>> candidate>
> >>> > > > >> person to take on this project (and I can dedicate the time,
> >>> but it>
> >>> > > > >> would come at the expense of other projects where I can also be>
> >>> > > > >> useful). But Micah and others expressed interest in working on
> >>> it, so>
> >>> > > > >> I wanted to have a discussion about it to see what others
> >>> think.>
> >>> > > > >>>
> >>> > > > >> Thanks>
> >>> > > > >> Wes>
> >>> > > > >>>
> >>> > > > >>
> >>> > > >
> >>> > >
> >>>
> >>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Micah Kornfield <em...@gmail.com>.

Another status update.  I've integrated the level generation code with the
parquet writing code [1].

After that PR is merged I'll add bindings in Python to control versions of
the level generation algorithm and plan on moving on to the read side.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/6586

On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Igor,
> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
> be a good task to pick up for this I think it should be a relatively small
> amount of code, so it is probably a good contribution to the project.  Once
> that is wrapped up we can see were we both are.
>
> Cheers,
> Micah
>
> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <ig...@gmail.com>
> wrote:
>
>> Hi Micah, I actually got involved with another personal project and had
>> to postpone my contribution to arrow a bit. The good news is that I'm
>> almost done with it, so I could help you with the read side very soon. Any
>> ideas how we could coordinate this?
>>
>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
>> escreveu:
>>
>>> hi Micah -- great news on the level generation PR. I'll try to carve
>>> out some time for reviewing over the coming week.
>>>
>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Igor,
>>> > I was wondering if you have made any progress on this?
>>> >
>>> > I posted a new PR [1] which I believe handles the difficult algorithmic
>>> > part of writing.  There will be some follow-ups but I think this PR
>>> might
>>> > take a while to review, so I was thinking of starting to take a look
>>> at the
>>> > read side if you haven't started yet, and circle back to the final
>>> > integration for the write side once the PR is checked in.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://github.com/apache/arrow/pull/6490
>>> >
>>> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
>>> > wrote:
>>> >
>>> > > Hi, I would love to help with this issue. I'm aware that this is a
>>> huge
>>> > > task for a first contribution to arrow, but I feel that I could help
>>> with
>>> > > the read path.
>>> > > Reading parquet seems like a extremely complex task since both
>>> hive[0] and
>>> > > spark[1] tried to implement a "vectorized" version and they all
>>> stopped
>>> > > short of supporting complex types.
>>> > > I wanted to at least give it a try and find out where the challenge
>>> lies.
>>> > >
>>> > > Since you guys are much more familiar with the current code base, I
>>> could
>>> > > use some starting tips so I don't fall in common pitfalls and
>>> whatnot.
>>> > >
>>> > > [0] https://issues.apache.org/jira/browse/HIVE-18576
>>> > > [1]
>>> > >
>>> > >
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>>> > >
>>> > > On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
>>> > > > Just to give an update.  I've been a little bit delayed, but my
>>> progress
>>> > > is>
>>> > > > as follows:>
>>> > > > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
>>> > > > 2.  Have another PR open that allows a configuration option in C++
>>> to>
>>> > > > determine which algorithm version to use for reading/writing, the
>>> > > existing>
>>> > > > version and the new version supported complex-nested arrays.  I
>>> think a>
>>> > > > large amount of code will be reused/delegated to but I will err on
>>> the
>>> > > side>
>>> > > > of not touching the existing code/algorithms so that any errors in
>>> the>
>>> > > > implementation  or performance regressions can hopefully be
>>> mitigated at>
>>> > > > runtime.  I expect in later releases (once the code has "baked")
>>> will>
>>> > > > become a no-op.>
>>> > > > 3.  Started coding the write path.>
>>> > > >
>>> > > > Which leaves:>
>>> > > > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
>>> complete>
>>> > > > 2.  Implementing the read path.>
>>> > > >
>>> > > > Again, I'm happy to collaborate if people have bandwidth and want
>>> to>
>>> > > > contribute.>
>>> > > >
>>> > > > Thanks,>
>>> > > > Micah>
>>> > > >
>>> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
>>> > > > wrote:>
>>> > > >
>>> > > > > Hi Wes,>
>>> > > > > I'm still interested in doing the work.  But don't to hold
>>> anybody up
>>> > > if>
>>> > > > > they have bandwidth.>
>>> > > > >>
>>> > > > > In order to actually make progress on this, my plan will be to:>
>>> > > > > 1.  Help with the current Java review backlog through early next
>>> week
>>> > > or>
>>> > > > > so (this has been taking the majority of my time allocated for
>>> Arrow>
>>> > > > > contributions for the last 6 months or so).>
>>> > > > > 2.  Shift all my attention to trying to get this done (this
>>> means no>
>>> > > > > reviews other then closing out existing ones that I've started
>>> until it
>>> > > is>
>>> > > > > done).  Hopefully, other Java committers can help shrink the
>>> backlog>
>>> > > > > further (Jacques thanks for you recent efforts here).>
>>> > > > >>
>>> > > > > Thanks,>
>>> > > > > Micah>
>>> > > > >>
>>> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
>>> wrote:>
>>> > > > >>
>>> > > > >> hi folks,>
>>> > > > >>>
>>> > > > >> I think we have reached a point where the incomplete C++
>>> Parquet>
>>> > > > >> nested data assembly/disassembly is harming the value of
>>> several>
>>> > > > >> others parts of the project, for example the Datasets API. As
>>> another>
>>> > > > >> example, it's possible to ingest nested data from JSON but not
>>> write>
>>> > > > >> it to Parquet in general.>
>>> > > > >>>
>>> > > > >> Implementing the nested data read and write path completely is
>>> a>
>>> > > > >> difficult project requiring at least several weeks of dedicated
>>> work,>
>>> > > > >> so it's not so surprising that it hasn't been accomplished yet.
>>> I
>>> > > know>
>>> > > > >> that several people have expressed interest in working on it,
>>> but I>
>>> > > > >> would like to see if anyone would be able to volunteer a
>>> commitment
>>> > > of>
>>> > > > >> time and guess on a rough timeline when this work could be
>>> done. It>
>>> > > > >> seems to me if this slips beyond 2020 it will significant
>>> diminish
>>> > > the>
>>> > > > >> value being created by other parts of the project.>
>>> > > > >>>
>>> > > > >> Since I'm pretty familiar with all the Parquet code I'm one
>>> candidate>
>>> > > > >> person to take on this project (and I can dedicate the time,
>>> but it>
>>> > > > >> would come at the expense of other projects where I can also be>
>>> > > > >> useful). But Micah and others expressed interest in working on
>>> it, so>
>>> > > > >> I wanted to have a discussion about it to see what others
>>> think.>
>>> > > > >>>
>>> > > > >> Thanks>
>>> > > > >> Wes>
>>> > > > >>>
>>> > > > >>
>>> > > >
>>> > >
>>>
>>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Posted by Micah Kornfield <em...@gmail.com>.

Hi Igor,
If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
be a good task to pick up for this I think it should be a relatively small
amount of code, so it is probably a good contribution to the project.  Once
that is wrapped up we can see were we both are.

Cheers,
Micah

On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <ig...@gmail.com>
wrote:

> Hi Micah, I actually got involved with another personal project and had to
> postpone my contribution to arrow a bit. The good news is that I'm almost
> done with it, so I could help you with the read side very soon. Any ideas
> how we could coordinate this?
>
> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <we...@gmail.com>
> escreveu:
>
>> hi Micah -- great news on the level generation PR. I'll try to carve
>> out some time for reviewing over the coming week.
>>
>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>> >
>> > Hi Igor,
>> > I was wondering if you have made any progress on this?
>> >
>> > I posted a new PR [1] which I believe handles the difficult algorithmic
>> > part of writing.  There will be some follow-ups but I think this PR
>> might
>> > take a while to review, so I was thinking of starting to take a look at
>> the
>> > read side if you haven't started yet, and circle back to the final
>> > integration for the write side once the PR is checked in.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1] https://github.com/apache/arrow/pull/6490
>> >
>> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <ig...@gmail.com>
>> > wrote:
>> >
>> > > Hi, I would love to help with this issue. I'm aware that this is a
>> huge
>> > > task for a first contribution to arrow, but I feel that I could help
>> with
>> > > the read path.
>> > > Reading parquet seems like a extremely complex task since both
>> hive[0] and
>> > > spark[1] tried to implement a "vectorized" version and they all
>> stopped
>> > > short of supporting complex types.
>> > > I wanted to at least give it a try and find out where the challenge
>> lies.
>> > >
>> > > Since you guys are much more familiar with the current code base, I
>> could
>> > > use some starting tips so I don't fall in common pitfalls and whatnot.
>> > >
>> > > [0] https://issues.apache.org/jira/browse/HIVE-18576
>> > > [1]
>> > >
>> > >
>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>> > >
>> > > On 2020/02/03 06:01:25, Micah Kornfield <e....@gmail.com> wrote:
>> > > > Just to give an update.  I've been a little bit delayed, but my
>> progress
>> > > is>
>> > > > as follows:>
>> > > > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
>> > > > 2.  Have another PR open that allows a configuration option in C++
>> to>
>> > > > determine which algorithm version to use for reading/writing, the
>> > > existing>
>> > > > version and the new version supported complex-nested arrays.  I
>> think a>
>> > > > large amount of code will be reused/delegated to but I will err on
>> the
>> > > side>
>> > > > of not touching the existing code/algorithms so that any errors in
>> the>
>> > > > implementation  or performance regressions can hopefully be
>> mitigated at>
>> > > > runtime.  I expect in later releases (once the code has "baked")
>> will>
>> > > > become a no-op.>
>> > > > 3.  Started coding the write path.>
>> > > >
>> > > > Which leaves:>
>> > > > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
>> complete>
>> > > > 2.  Implementing the read path.>
>> > > >
>> > > > Again, I'm happy to collaborate if people have bandwidth and want
>> to>
>> > > > contribute.>
>> > > >
>> > > > Thanks,>
>> > > > Micah>
>> > > >
>> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
>> > > > wrote:>
>> > > >
>> > > > > Hi Wes,>
>> > > > > I'm still interested in doing the work.  But don't to hold
>> anybody up
>> > > if>
>> > > > > they have bandwidth.>
>> > > > >>
>> > > > > In order to actually make progress on this, my plan will be to:>
>> > > > > 1.  Help with the current Java review backlog through early next
>> week
>> > > or>
>> > > > > so (this has been taking the majority of my time allocated for
>> Arrow>
>> > > > > contributions for the last 6 months or so).>
>> > > > > 2.  Shift all my attention to trying to get this done (this means
>> no>
>> > > > > reviews other then closing out existing ones that I've started
>> until it
>> > > is>
>> > > > > done).  Hopefully, other Java committers can help shrink the
>> backlog>
>> > > > > further (Jacques thanks for you recent efforts here).>
>> > > > >>
>> > > > > Thanks,>
>> > > > > Micah>
>> > > > >>
>> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
>> wrote:>
>> > > > >>
>> > > > >> hi folks,>
>> > > > >>>
>> > > > >> I think we have reached a point where the incomplete C++ Parquet>
>> > > > >> nested data assembly/disassembly is harming the value of several>
>> > > > >> others parts of the project, for example the Datasets API. As
>> another>
>> > > > >> example, it's possible to ingest nested data from JSON but not
>> write>
>> > > > >> it to Parquet in general.>
>> > > > >>>
>> > > > >> Implementing the nested data read and write path completely is a>
>> > > > >> difficult project requiring at least several weeks of dedicated
>> work,>
>> > > > >> so it's not so surprising that it hasn't been accomplished yet. I
>> > > know>
>> > > > >> that several people have expressed interest in working on it,
>> but I>
>> > > > >> would like to see if anyone would be able to volunteer a
>> commitment
>> > > of>
>> > > > >> time and guess on a rough timeline when this work could be done.
>> It>
>> > > > >> seems to me if this slips beyond 2020 it will significant
>> diminish
>> > > the>
>> > > > >> value being created by other parts of the project.>
>> > > > >>>
>> > > > >> Since I'm pretty familiar with all the Parquet code I'm one
>> candidate>
>> > > > >> person to take on this project (and I can dedicate the time, but
>> it>
>> > > > >> would come at the expense of other projects where I can also be>
>> > > > >> useful). But Micah and others expressed interest in working on
>> it, so>
>> > > > >> I wanted to have a discussion about it to see what others think.>
>> > > > >>>
>> > > > >> Thanks>
>> > > > >> Wes>
>> > > > >>>
>> > > > >>
>> > > >
>> > >
>>
>