You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2018/07/28 23:44:50 UTC

[DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

hi folks,

We've been struggling for quite some time with the development
workflow between the Arrow and Parquet C++ (and Python) codebases.

To explain the root issues:

* parquet-cpp depends on "platform code" in Apache Arrow; this
includes file interfaces, memory management, miscellaneous algorithms
(e.g. dictionary encoding), etc. Note that before this "platform"
dependency was introduced, there was significant duplicated code
between these codebases and incompatible abstract interfaces for
things like files

* we maintain a Arrow conversion code in parquet-cpp for converting
between Arrow columnar memory format and Parquet

* we maintain Python bindings for parquet-cpp + Arrow interop in
Apache Arrow. This introduces a circular dependency into our CI.

* Substantial portions of our CMake build system and related tooling
are duplicated between the Arrow and Parquet repos

* API changes cause awkward release coordination issues between Arrow
and Parquet

I believe the best way to remedy the situation is to adopt a
"Community over Code" approach and find a way for the Parquet and
Arrow C++ development communities to operate out of the same code
repository, i.e. the apache/arrow git repository.

This would bring major benefits:

* Shared CMake build infrastructure, developer tools, and CI
infrastructure (Parquet is already being built as a dependency in
Arrow's CI systems)

* Share packaging and release management infrastructure

* Reduce / eliminate problems due to API changes (where we currently
introduce breakage into our CI workflow when there is a breaking /
incompatible change)

* Arrow releases would include a coordinated snapshot of the Parquet
implementation as it stands

Continuing with the status quo has become unsatisfactory to me and as
a result I've become less motivated to work on the parquet-cpp
codebase.

The only Parquet C++ committer who is not an Arrow committer is Deepak
Majeti. I think the issue of commit privileges could be resolved
without too much difficulty or time.

I also think if it is truly necessary that the Apache Parquet
community could create release scripts to cut a miniml versioned
Apache Parquet C++ release if that is deemed truly necessary.

I know that some people are wary of monorepos and megaprojects, but as
an example TensorFlow is at least 10 times as large of a projects in
terms of LOCs and number of different platform components, and it
seems to be getting along just fine. I think we should be able to work
together as a community to function just as well.

Interested in the opinions of others, and any other ideas for
practical solutions to the above problems.

Thanks,
Wes

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Antoine,

Thanks for chiming in.

On Mon, Jul 30, 2018 at 4:50 AM, Antoine Pitrou <an...@python.org> wrote:
>
> Hi Wes,
>
> Le 29/07/2018 à 01:44, Wes McKinney a écrit :
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out of the same code
>> repository, i.e. the apache/arrow git repository.
>
> I think this is reasonable.  I think the only reasonably solution would
> be to migrate the Python Parquet bindings to the parquet-cpp repository,
> to avoid the circulary dependency (then parquet-cpp would depend on
> arrow but not the other way round).  But I agree the monorepo approach
> would probably produce the least development friction.

Moving the Python Parquet bindings would increase our problems because
of dependency-hell / ABI issues in the shared libraries shipped with
Python releases. Right now things operate fairly smoothly because we
ship libparquet.so bundled with pyarrow, though unfortunately this
libparquet.so is based on an unreleased version of parquet-cpp (due to
bug fixes, ABI / API fixes).

To maintain the Python bindings in a separate codebase would mean
dealing with with release coordination issues both at the C++ level
and the Python level.

>
> From a community standpoint, I think it all depends whether it's ok to
> subsume parquet-cpp development under the Arrow umbrella.  Perhaps the
> Apache foundation has to give their approval?  I don't know how project
> governance works.

Julian has just replied re: this. From the ASF point of view, the git
repositories are merely an implementation detail en route to signed
releases created by the project PMCs. As far as who has permission to
merge patches, I would be comfortable (with the support of the Arrow
PMC) giving commit rights immediately to all Parquet committers who
participate actively in parquet-cpp.

>
>> The only Parquet C++ committer who is not an Arrow committer is Deepak
>> Majeti. I think the issue of commit privileges could be resolved
>> without too much difficulty or time.
>
> That's an important data point, thanks.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Antoine Pitrou <an...@python.org>.

Le 30/07/2018 à 10:50, Antoine Pitrou a écrit :
> 
> Hi Wes,
> 
> Le 29/07/2018 à 01:44, Wes McKinney a écrit :
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out of the same code
>> repository, i.e. the apache/arrow git repository.
> 
> I think this is reasonable.  I think the only reasonably solution would
> be to migrate the Python Parquet bindings to the parquet-cpp repository,

Sorry, I mistyped.  I meant to say "the only other reasonable solution...".

By the way, one concern with the monorepo approach: it would slightly
increase Arrow CI times (which are already too large).

Regards

Antoine.

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Antoine Pitrou <an...@python.org>.

Hi Wes,

Le 29/07/2018 à 01:44, Wes McKinney a écrit :
> I believe the best way to remedy the situation is to adopt a
> "Community over Code" approach and find a way for the Parquet and
> Arrow C++ development communities to operate out of the same code
> repository, i.e. the apache/arrow git repository.

I think this is reasonable.  I think the only reasonably solution would
be to migrate the Python Parquet bindings to the parquet-cpp repository,
to avoid the circulary dependency (then parquet-cpp would depend on
arrow but not the other way round).  But I agree the monorepo approach
would probably produce the least development friction.

From a community standpoint, I think it all depends whether it's ok to
subsume parquet-cpp development under the Arrow umbrella.  Perhaps the
Apache foundation has to give their approval?  I don't know how project
governance works.

> The only Parquet C++ committer who is not an Arrow committer is Deepak
> Majeti. I think the issue of commit privileges could be resolved
> without too much difficulty or time.

That's an important data point, thanks.

Regards

Antoine.

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jul 30, 2018 at 8:50 PM, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>
>>
>> > The community will be less willing to accept large
>> > changes that require multiple rounds of patches for stability and API
>> > convergence. Our contributions to Libhdfs++ in the HDFS community took a
>> > significantly long time for the very same reason.
>>
>> Please don't use bad experiences from another open source community as
>> leverage in this discussion. I'm sorry that things didn't go the way
>> you wanted in Apache Hadoop but this is a distinct community which
>> happens to operate under a similar open governance model.
>
>
> There are some more radical and community building options as well. Take
> the subversion project as a precedent. With subversion, any Apache
> committer can request and receive a commit bit on some large fraction of
> subversion.
>
> So why not take this a bit further and give every parquet committer a
> commit bit in Arrow? Or even make them be first class committers in Arrow?
> Possibly even make it policy that every Parquet committer who asks will be
> given committer status in Arrow.
>
> That relieves a lot of the social anxiety here. Parquet committers can't be
> worried at that point whether their patches will get merged; they can just
> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> committers. After all, Arrow already depends a lot on parquet so why not
> invite them in?

hi Ted,

I for one am with you on this idea, and don't see it as all that
radical. The Arrow and Parquet communities are working toward the same
goals: open standards for storage and in-memory analytics. This is
part of why so there is so much overlap already amongst the committers
and PMC members.

We are stronger working together than fragmented.

- Wes

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Julian Hyde <jh...@apache.org>.

A controlled fork doesn’t sound like a terrible option. Copy the code from parquet into arrow, and for a limited period of time it would be the primary. When that period is over, the code in parquet becomes the primary.

During the period during which arrow has the primary, the parquet release manager will have to synchronize parquet’s copy of the code (probably by patches) before making releases.

Julian


> On Jul 31, 2018, at 11:29 AM, Wes McKinney <we...@gmail.com> wrote:
> 
>> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.
> 
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
> 
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
> 
> - Wes
> 
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> Wes,
>> 
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>> 
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>> 
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>> 
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>> 
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>> 
>> "
>> *I question whether it's worth the community's time long term to wear*
>> 
>> 
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>> 
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>> 
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>> 
>> 
>> 
>> 
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>> 
>>> @Antoine
>>> 
>>>> By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>> 
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>> 
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> 
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>> 
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>> 
>>> - Wes
>>> 
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> I would like to point out that arrow's use of orc is a great example of
>>> how it would be possible to manage parquet-cpp as a separate codebase. That
>>> gives me hope that the projects could be managed separately some day.
>>>> 
>>>> Well, I don't know that ORC is the best example. The ORC C++ codebase
>>>> features several areas of duplicated logic which could be replaced by
>>>> components from the Arrow platform for better platform-wide
>>>> interoperability:
>>>> 
>>>> 
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>>> 
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>>>> 
>>>> ORC's use of symbols from Protocol Buffers was actually a cause of
>>>> bugs that we had to fix in Arrow's build system to prevent them from
>>>> leaking to third party linkers when statically linked (ORC is only
>>>> available for static linking at the moment AFAIK).
>>>> 
>>>> I question whether it's worth the community's time long term to wear
>>>> ourselves out defining custom "ports" / virtual interfaces in each
>>>> library to plug components together rather than utilizing common
>>>> platform APIs.
>>>> 
>>>> - Wes
>>>> 
>>>> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>>>> You're point about the constraints of the ASF release process are well
>>>>> taken and as a developer who's trying to work in the current
>>> environment I
>>>>> would be much happier if the codebases were merged. The main issues I
>>> worry
>>>>> about when you put codebases like these together are:
>>>>> 
>>>>> 1. The delineation of API's become blurred and the code becomes too
>>> coupled
>>>>> 2. Release of artifacts that are lower in the dependency tree are
>>> delayed
>>>>> by artifacts higher in the dependency tree
>>>>> 
>>>>> If the project/release management is structured well and someone keeps
>>> an
>>>>> eye on the coupling, then I don't have any concerns.
>>>>> 
>>>>> I would like to point out that arrow's use of orc is a great example of
>>> how
>>>>> it would be possible to manage parquet-cpp as a separate codebase. That
>>>>> gives me hope that the projects could be managed separately some day.
>>>>> 
>>>>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> hi Josh,
>>>>>> 
>>>>>>> I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>>>>> them together seems like the wrong choice.
>>>>>> 
>>>>>> Apache is "Community over Code"; right now it's the same people
>>>>>> building these projects -- my argument (which I think you agree with?)
>>>>>> is that we should work more closely together until the community grows
>>>>>> large enough to support larger-scope process than we have now. As
>>>>>> you've seen, our process isn't serving developers of these projects.
>>>>>> 
>>>>>>> I also think build tooling should be pulled into its own codebase.
>>>>>> 
>>>>>> I don't see how this can possibly be practical taking into
>>>>>> consideration the constraints imposed by the combination of the GitHub
>>>>>> platform and the ASF release process. I'm all for being idealistic,
>>>>>> but right now we need to be practical. Unless we can devise a
>>>>>> practical procedure that can accommodate at least 1 patch per day
>>>>>> which may touch both code and build system simultaneously without
>>>>>> being a hindrance to contributor or maintainer, I don't see how we can
>>>>>> move forward.
>>>>>> 
>>>>>>> That being said, I think it makes sense to merge the codebases in the
>>>>>> short term with the express purpose of separating them in the near
>>> term.
>>>>>> 
>>>>>> I would agree but only if separation can be demonstrated to be
>>>>>> practical and result in net improvements in productivity and community
>>>>>> growth. I think experience has clearly demonstrated that the current
>>>>>> separation is impractical, and is causing problems.
>>>>>> 
>>>>>> Per Julian's and Ted's comments, I think we need to consider
>>>>>> development process and ASF releases separately. My argument is as
>>>>>> follows:
>>>>>> 
>>>>>> * Monorepo for development (for practicality)
>>>>>> * Releases structured according to the desires of the PMCs
>>>>>> 
>>>>>> - Wes
>>>>>> 
>>>>>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> I recently worked on an issue that had to be implemented in
>>> parquet-cpp
>>>>>>> (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>>>>>> ARROW-2586). I found the circular dependencies confusing and hard to
>>> work
>>>>>>> with. For example, I still have a PR open in parquet-cpp (created on
>>> May
>>>>>>> 10) because of a PR that it depended on in arrow that was recently
>>>>>> merged.
>>>>>>> I couldn't even address any CI issues in the PR because the change in
>>>>>> arrow
>>>>>>> was not yet in master. In a separate PR, I changed the
>>>>>> run_clang_format.py
>>>>>>> script in the arrow project only to find out later that there was an
>>>>>> exact
>>>>>>> copy of it in parquet-cpp.
>>>>>>> 
>>>>>>> However, I don't think merging the codebases makes sense in the long
>>>>>> term.
>>>>>>> I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>>>>> them
>>>>>>> together seems like the wrong choice. There will be other formats
>>> that
>>>>>>> arrow needs to support that will be kept separate (e.g. - Orc), so I
>>>>>> don't
>>>>>>> see why parquet should be special. I also think build tooling should
>>> be
>>>>>>> pulled into its own codebase. GNU has had a long history of
>>> developing
>>>>>> open
>>>>>>> source C/C++ projects that way and made projects like
>>>>>>> autoconf/automake/make to support them. I don't think CI is a good
>>>>>>> counter-example since there have been lots of successful open source
>>>>>>> projects that have used nightly build systems that pinned versions of
>>>>>>> dependent software.
>>>>>>> 
>>>>>>> That being said, I think it makes sense to merge the codebases in the
>>>>>> short
>>>>>>> term with the express purpose of separating them in the near  term.
>>> My
>>>>>>> reasoning is as follows. By putting the codebases together, you can
>>> more
>>>>>>> easily delineate the boundaries between the API's with a single PR.
>>>>>> Second,
>>>>>>> it will force the build tooling to converge instead of diverge,
>>> which has
>>>>>>> already happened. Once the boundaries and tooling have been sorted
>>> out,
>>>>>> it
>>>>>>> should be easy to separate them back into their own codebases.
>>>>>>> 
>>>>>>> If the codebases are merged, I would ask that the C++ codebases for
>>> arrow
>>>>>>> be separated from other languages. Looking at it from the
>>> perspective of
>>>>>> a
>>>>>>> parquet-cpp library user, having a dependency on Java is a large tax
>>> to
>>>>>> pay
>>>>>>> if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>>>>>> release of arrow, many of which were holding up the release. I hope
>>> that
>>>>>>> seems like a reasonable compromise, and I think it will help reduce
>>> the
>>>>>>> complexity of the build/release tooling.
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> The community will be less willing to accept large
>>>>>>>>>> changes that require multiple rounds of patches for stability
>>> and
>>>>>> API
>>>>>>>>>> convergence. Our contributions to Libhdfs++ in the HDFS
>>> community
>>>>>> took
>>>>>>>> a
>>>>>>>>>> significantly long time for the very same reason.
>>>>>>>>> 
>>>>>>>>> Please don't use bad experiences from another open source
>>> community as
>>>>>>>>> leverage in this discussion. I'm sorry that things didn't go the
>>> way
>>>>>>>>> you wanted in Apache Hadoop but this is a distinct community which
>>>>>>>>> happens to operate under a similar open governance model.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There are some more radical and community building options as well.
>>> Take
>>>>>>>> the subversion project as a precedent. With subversion, any Apache
>>>>>>>> committer can request and receive a commit bit on some large
>>> fraction of
>>>>>>>> subversion.
>>>>>>>> 
>>>>>>>> So why not take this a bit further and give every parquet committer
>>> a
>>>>>>>> commit bit in Arrow? Or even make them be first class committers in
>>>>>> Arrow?
>>>>>>>> Possibly even make it policy that every Parquet committer who asks
>>> will
>>>>>> be
>>>>>>>> given committer status in Arrow.
>>>>>>>> 
>>>>>>>> That relieves a lot of the social anxiety here. Parquet committers
>>>>>> can't be
>>>>>>>> worried at that point whether their patches will get merged; they
>>> can
>>>>>> just
>>>>>>>> merge them.  Arrow shouldn't worry much about inviting in the
>>> Parquet
>>>>>>>> committers. After all, Arrow already depends a lot on parquet so
>>> why not
>>>>>>>> invite them in?
>>>>>>>> 
>>>>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Julian Hyde <jh...@apache.org>.

A controlled fork doesn’t sound like a terrible option. Copy the code from parquet into arrow, and for a limited period of time it would be the primary. When that period is over, the code in parquet becomes the primary.

During the period during which arrow has the primary, the parquet release manager will have to synchronize parquet’s copy of the code (probably by patches) before making releases.

Julian


> On Jul 31, 2018, at 11:29 AM, Wes McKinney <we...@gmail.com> wrote:
> 
>> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.
> 
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
> 
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
> 
> - Wes
> 
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> Wes,
>> 
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>> 
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>> 
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>> 
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>> 
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>> 
>> "
>> *I question whether it's worth the community's time long term to wear*
>> 
>> 
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>> 
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>> 
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>> 
>> 
>> 
>> 
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>> 
>>> @Antoine
>>> 
>>>> By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>> 
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>> 
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> 
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>> 
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>> 
>>> - Wes
>>> 
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> I would like to point out that arrow's use of orc is a great example of
>>> how it would be possible to manage parquet-cpp as a separate codebase. That
>>> gives me hope that the projects could be managed separately some day.
>>>> 
>>>> Well, I don't know that ORC is the best example. The ORC C++ codebase
>>>> features several areas of duplicated logic which could be replaced by
>>>> components from the Arrow platform for better platform-wide
>>>> interoperability:
>>>> 
>>>> 
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>>> 
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>>>> 
>>>> ORC's use of symbols from Protocol Buffers was actually a cause of
>>>> bugs that we had to fix in Arrow's build system to prevent them from
>>>> leaking to third party linkers when statically linked (ORC is only
>>>> available for static linking at the moment AFAIK).
>>>> 
>>>> I question whether it's worth the community's time long term to wear
>>>> ourselves out defining custom "ports" / virtual interfaces in each
>>>> library to plug components together rather than utilizing common
>>>> platform APIs.
>>>> 
>>>> - Wes
>>>> 
>>>> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>>>> You're point about the constraints of the ASF release process are well
>>>>> taken and as a developer who's trying to work in the current
>>> environment I
>>>>> would be much happier if the codebases were merged. The main issues I
>>> worry
>>>>> about when you put codebases like these together are:
>>>>> 
>>>>> 1. The delineation of API's become blurred and the code becomes too
>>> coupled
>>>>> 2. Release of artifacts that are lower in the dependency tree are
>>> delayed
>>>>> by artifacts higher in the dependency tree
>>>>> 
>>>>> If the project/release management is structured well and someone keeps
>>> an
>>>>> eye on the coupling, then I don't have any concerns.
>>>>> 
>>>>> I would like to point out that arrow's use of orc is a great example of
>>> how
>>>>> it would be possible to manage parquet-cpp as a separate codebase. That
>>>>> gives me hope that the projects could be managed separately some day.
>>>>> 
>>>>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> hi Josh,
>>>>>> 
>>>>>>> I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>>>>> them together seems like the wrong choice.
>>>>>> 
>>>>>> Apache is "Community over Code"; right now it's the same people
>>>>>> building these projects -- my argument (which I think you agree with?)
>>>>>> is that we should work more closely together until the community grows
>>>>>> large enough to support larger-scope process than we have now. As
>>>>>> you've seen, our process isn't serving developers of these projects.
>>>>>> 
>>>>>>> I also think build tooling should be pulled into its own codebase.
>>>>>> 
>>>>>> I don't see how this can possibly be practical taking into
>>>>>> consideration the constraints imposed by the combination of the GitHub
>>>>>> platform and the ASF release process. I'm all for being idealistic,
>>>>>> but right now we need to be practical. Unless we can devise a
>>>>>> practical procedure that can accommodate at least 1 patch per day
>>>>>> which may touch both code and build system simultaneously without
>>>>>> being a hindrance to contributor or maintainer, I don't see how we can
>>>>>> move forward.
>>>>>> 
>>>>>>> That being said, I think it makes sense to merge the codebases in the
>>>>>> short term with the express purpose of separating them in the near
>>> term.
>>>>>> 
>>>>>> I would agree but only if separation can be demonstrated to be
>>>>>> practical and result in net improvements in productivity and community
>>>>>> growth. I think experience has clearly demonstrated that the current
>>>>>> separation is impractical, and is causing problems.
>>>>>> 
>>>>>> Per Julian's and Ted's comments, I think we need to consider
>>>>>> development process and ASF releases separately. My argument is as
>>>>>> follows:
>>>>>> 
>>>>>> * Monorepo for development (for practicality)
>>>>>> * Releases structured according to the desires of the PMCs
>>>>>> 
>>>>>> - Wes
>>>>>> 
>>>>>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> I recently worked on an issue that had to be implemented in
>>> parquet-cpp
>>>>>>> (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>>>>>> ARROW-2586). I found the circular dependencies confusing and hard to
>>> work
>>>>>>> with. For example, I still have a PR open in parquet-cpp (created on
>>> May
>>>>>>> 10) because of a PR that it depended on in arrow that was recently
>>>>>> merged.
>>>>>>> I couldn't even address any CI issues in the PR because the change in
>>>>>> arrow
>>>>>>> was not yet in master. In a separate PR, I changed the
>>>>>> run_clang_format.py
>>>>>>> script in the arrow project only to find out later that there was an
>>>>>> exact
>>>>>>> copy of it in parquet-cpp.
>>>>>>> 
>>>>>>> However, I don't think merging the codebases makes sense in the long
>>>>>> term.
>>>>>>> I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>>>>> them
>>>>>>> together seems like the wrong choice. There will be other formats
>>> that
>>>>>>> arrow needs to support that will be kept separate (e.g. - Orc), so I
>>>>>> don't
>>>>>>> see why parquet should be special. I also think build tooling should
>>> be
>>>>>>> pulled into its own codebase. GNU has had a long history of
>>> developing
>>>>>> open
>>>>>>> source C/C++ projects that way and made projects like
>>>>>>> autoconf/automake/make to support them. I don't think CI is a good
>>>>>>> counter-example since there have been lots of successful open source
>>>>>>> projects that have used nightly build systems that pinned versions of
>>>>>>> dependent software.
>>>>>>> 
>>>>>>> That being said, I think it makes sense to merge the codebases in the
>>>>>> short
>>>>>>> term with the express purpose of separating them in the near  term.
>>> My
>>>>>>> reasoning is as follows. By putting the codebases together, you can
>>> more
>>>>>>> easily delineate the boundaries between the API's with a single PR.
>>>>>> Second,
>>>>>>> it will force the build tooling to converge instead of diverge,
>>> which has
>>>>>>> already happened. Once the boundaries and tooling have been sorted
>>> out,
>>>>>> it
>>>>>>> should be easy to separate them back into their own codebases.
>>>>>>> 
>>>>>>> If the codebases are merged, I would ask that the C++ codebases for
>>> arrow
>>>>>>> be separated from other languages. Looking at it from the
>>> perspective of
>>>>>> a
>>>>>>> parquet-cpp library user, having a dependency on Java is a large tax
>>> to
>>>>>> pay
>>>>>>> if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>>>>>> release of arrow, many of which were holding up the release. I hope
>>> that
>>>>>>> seems like a reasonable compromise, and I think it will help reduce
>>> the
>>>>>>> complexity of the build/release tooling.
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> The community will be less willing to accept large
>>>>>>>>>> changes that require multiple rounds of patches for stability
>>> and
>>>>>> API
>>>>>>>>>> convergence. Our contributions to Libhdfs++ in the HDFS
>>> community
>>>>>> took
>>>>>>>> a
>>>>>>>>>> significantly long time for the very same reason.
>>>>>>>>> 
>>>>>>>>> Please don't use bad experiences from another open source
>>> community as
>>>>>>>>> leverage in this discussion. I'm sorry that things didn't go the
>>> way
>>>>>>>>> you wanted in Apache Hadoop but this is a distinct community which
>>>>>>>>> happens to operate under a similar open governance model.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There are some more radical and community building options as well.
>>> Take
>>>>>>>> the subversion project as a precedent. With subversion, any Apache
>>>>>>>> committer can request and receive a commit bit on some large
>>> fraction of
>>>>>>>> subversion.
>>>>>>>> 
>>>>>>>> So why not take this a bit further and give every parquet committer
>>> a
>>>>>>>> commit bit in Arrow? Or even make them be first class committers in
>>>>>> Arrow?
>>>>>>>> Possibly even make it policy that every Parquet committer who asks
>>> will
>>>>>> be
>>>>>>>> given committer status in Arrow.
>>>>>>>> 
>>>>>>>> That relieves a lot of the social anxiety here. Parquet committers
>>>>>> can't be
>>>>>>>> worried at that point whether their patches will get merged; they
>>> can
>>>>>> just
>>>>>>>> merge them.  Arrow shouldn't worry much about inviting in the
>>> Parquet
>>>>>>>> committers. After all, Arrow already depends a lot on parquet so
>>> why not
>>>>>>>> invite them in?
>>>>>>>> 
>>>>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

A couple more points to make re: Uwe's comments:

> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository.

An important distinction here is community and development process,
why I focused on the "Community over Code" rationale in my original
e-mail. I think we should make decisions that optimize for the
community's health and productivity over satisfying arbitrary
constraints.

Our community's health should be measured by our ability to author and
merge changes as painlessly as possible and to be able to consistently
deliver high quality software releases. IMHO we have fallen short of
both of these goals

> Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm.

A monorepo does not imply that components cannot have separate release
cycles. As an example, in the JS community a tool Lerna has developed
precisely because of this

https://lernajs.io/

"Splitting up large codebases into separate independently versioned
packages is extremely useful for code sharing. However, making changes
across many repositories is messy and difficult to track, and testing
across repositories gets complicated really fast.

To solve these (and many other) problems, some projects will organize
their codebases into multi-package repositories. Projects like Babel,
React, Angular, Ember, Meteor, Jest, and many others develop all of
their packages within a single repository."

I think that we have arrived exactly to the same point as these other
projects. We may need to develop some of our own tooling to assist
with managing our monorepo (where we are already shipping multiple
components)

- Wes

On Sun, Aug 19, 2018 at 1:30 PM, Wes McKinney <we...@gmail.com> wrote:
> hi Uwe,
>
> I agree with your points. Currently we have 3 software artifacts:
>
> 1. Arrow C++ libraries
> 2. Parquet C++ libraries with Arrow columnar integration
> 3. C++ interop layer for Python + Cython bindings
>
> Changes in #1 prompt an awkward workflow involving multiple PRs; as a
> result of this we just recently jumped 8 months from the pinned
> version of Arrow in parquet-cpp. This obviously is an antipattern. If
> we had a much larger group of core developers, this might be more
> maintainable
>
> Of course changes in #2 also impact #3; a lot of our bug reports and
> feature requests are coming inbound because of #3, and we have
> struggled to be able to respond to the needs of users (and other
> developers like Robert Gruener who are trying to use this software in
> a large data warehouse)
>
> There is also the release coordination issue where having users
> simultaneously using a released version of both projects hasn't really
> happened, so we effectively already have been treating Parquet like a
> vendored component in our packaging process.
>
> Realistically I think once #2 has become a more functionally complete
> and as a result a more slowly moving piece of software, we can
> contemplate splitting out all or parts of its development process back
> into another repository. I think we have a ton of work to do yet on
> Parquet core, particularly optimizing for high latency storage (HDFS,
> S3, GCP, etc.), and it wouldn't really make sense to do such platform
> level work anywhere but #1
>
> - Wes
>
> On Sun, Aug 19, 2018 at 8:37 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> Back from vacation, I also want to finally raise my voice.
>>
>> With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.
>>
>> Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.
>>
>> Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.
>>
>> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.
>>
>> Uwe
>>
>> On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
>>> Thanks Ryan, will do. The people I'd still like to hear from are:
>>>
>>> * Phillip Cloud
>>> * Uwe Korn
>>>
>>> As ASF contributors we are responsible to both be pragmatic as well as
>>> act in the best interests of the community's health and productivity.
>>>
>>>
>>>
>>> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>> > I don't have an opinion here, but could someone send a summary of what is
>>> > decided to the dev list once there is consensus? This is a long thread for
>>> > parts of the project I don't work on, so I haven't followed it very closely.
>>> >
>>> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>>> >
>>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>>> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>>> >> Can we enforce that parquet-cpp changes will not be committed without a
>>> >> corresponding Parquet JIRA?
>>> >>
>>> >> I think we would use the following policy:
>>> >>
>>> >> * use PARQUET-XXX for issues relating to Parquet core
>>> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>>> >> core (e.g. changes that are in parquet/arrow right now)
>>> >>
>>> >> We've already been dealing with annoyances relating to issues
>>> >> straddling the two projects (debugging an issue on Arrow side to find
>>> >> that it has to be fixed on Parquet side); this would make things
>>> >> simpler for us
>>> >>
>>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>>> >> simplify forking later (if needed) and be able to maintain the commit
>>> >> history.  I don't know if its possible to squash parquet-cpp commits and
>>> >> arrow commits separately before merging.
>>> >>
>>> >> This seems rather onerous for both contributors and maintainers and
>>> >> not in line with the goal of improving productivity. In the event that
>>> >> we fork I see it as a traumatic event for the community. If it does
>>> >> happen, then we can write a script (using git filter-branch and other
>>> >> such tools) to extract commits related to the forked code.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>>> >> wrote:
>>> >> > I have a few more logistical questions to add.
>>> >> >
>>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>>> >> > Arrow changes. Will we establish some guidelines for filing Parquet
>>> >> JIRAs?
>>> >> > Can we enforce that parquet-cpp changes will not be committed without a
>>> >> > corresponding Parquet JIRA?
>>> >> >
>>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>>> >> > simplify forking later (if needed) and be able to maintain the commit
>>> >> > history.  I don't know if its possible to squash parquet-cpp commits and
>>> >> > arrow commits separately before merging.
>>> >> >
>>> >> >
>>> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>>> >> >
>>> >> >> Do other people have opinions? I would like to undertake this work in
>>> >> >> the near future (the next 8-10 weeks); I would be OK with taking
>>> >> >> responsibility for the primary codebase surgery.
>>> >> >>
>>> >> >> Some logistical questions:
>>> >> >>
>>> >> >> * We have a handful of pull requests in flight in parquet-cpp that
>>> >> >> would need to be resolved / merged
>>> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>>> >> >> releases cut out of the new structure
>>> >> >> * Management of shared commit rights (I can discuss with the Arrow
>>> >> >> PMC; I believe that approving any committer who has actively
>>> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
>>> >> >> comments)
>>> >> >>
>>> >> >> If working more closely together proves to not be working out after
>>> >> >> some period of time, I will be fully supportive of a fork or something
>>> >> >> like it
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Wes
>>> >> >>
>>> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>>> >> wrote:
>>> >> >> > Thanks Tim.
>>> >> >> >
>>> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>>> >> >> > platform code intending to improve the performance of bit-packing in
>>> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
>>> >> >> >
>>> >> >> > * https://github.com/apache/parquet-cpp/pull/483
>>> >> >> > * https://github.com/apache/arrow/pull/2355
>>> >> >> >
>>> >> >> > Changes that impact the Python interface to Parquet are even more
>>> >> >> complex.
>>> >> >> >
>>> >> >> > Adding options to Arrow's CMake build system to only build
>>> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
>>> >> >> > not be difficult, and amount to writing "make parquet".
>>> >> >> >
>>> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>>> >> to
>>> >> >> > build and install the Parquet core libraries and their dependencies
>>> >> >> > would be:
>>> >> >> >
>>> >> >> > ninja parquet && ninja install
>>> >> >> >
>>> >> >> > - Wes
>>> >> >> >
>>> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>>> >> >> > <ta...@cloudera.com.invalid> wrote:
>>> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>>> >> >> >> successful, but I thought I'd give my two cents.
>>> >> >> >>
>>> >> >> >> For me, the thing that makes the biggest difference in contributing
>>> >> to a
>>> >> >> >> new codebase is the number of steps in the workflow for writing,
>>> >> >> testing,
>>> >> >> >> posting and iterating on a commit and also the number of
>>> >> opportunities
>>> >> >> for
>>> >> >> >> missteps. The size of the repo and build/test times matter but are
>>> >> >> >> secondary so long as the workflow is simple and reliable.
>>> >> >> >>
>>> >> >> >> I don't really know what the current state of things is, but it
>>> >> sounds
>>> >> >> like
>>> >> >> >> it's not as simple as check out -> build -> test if you're doing a
>>> >> >> >> cross-repo change. Circular dependencies are a real headache.
>>> >> >> >>
>>> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>>> >> >> wrote:
>>> >> >> >>
>>> >> >> >>> hi,
>>> >> >> >>>
>>> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>>> >> >> majeti.deepak@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> > I think the circular dependency can be broken if we build a new
>>> >> >> library
>>> >> >> >>> for
>>> >> >> >>> > the platform code. This will also make it easy for other projects
>>> >> >> such as
>>> >> >> >>> > ORC to use it.
>>> >> >> >>> > I also remember your proposal a while ago of having a separate
>>> >> >> project
>>> >> >> >>> for
>>> >> >> >>> > the platform code.  That project can live in the arrow repo.
>>> >> >> However, one
>>> >> >> >>> > has to clone the entire apache arrow repo but can just build the
>>> >> >> platform
>>> >> >> >>> > code. This will be temporary until we can find a new home for it.
>>> >> >> >>> >
>>> >> >> >>> > The dependency will look like:
>>> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>>> >> >> >>> > libplatform(platform api)
>>> >> >> >>> >
>>> >> >> >>> > CI workflow will clone the arrow project twice, once for the
>>> >> platform
>>> >> >> >>> > library and once for the arrow-core/bindings library.
>>> >> >> >>>
>>> >> >> >>> This seems like an interesting proposal; the best place to work
>>> >> toward
>>> >> >> >>> this goal (if it is even possible; the build system interactions and
>>> >> >> >>> ASF release management are the hard problems) is to have all of the
>>> >> >> >>> code in a single repository. ORC could already be using Arrow if it
>>> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> > There is no doubt that the collaborations between the Arrow and
>>> >> >> Parquet
>>> >> >> >>> > communities so far have been very successful.
>>> >> >> >>> > The reason to maintain this relationship moving forward is to
>>> >> >> continue to
>>> >> >> >>> > reap the mutual benefits.
>>> >> >> >>> > We should continue to take advantage of sharing code as well.
>>> >> >> However, I
>>> >> >> >>> > don't see any code sharing opportunities between arrow-core and
>>> >> the
>>> >> >> >>> > parquet-core. Both have different functions.
>>> >> >> >>>
>>> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>>> >> format
>>> >> >> >>> is only one part of a project that has become quite large already
>>> >> >> >>> (
>>> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>>> >> >> >>> platform-for-inmemory-data-105427919).
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
>>> >> stable.
>>> >> >> We
>>> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
>>> >> is
>>> >> >> to
>>> >> >> >>> > keep them nimble since we can.
>>> >> >> >>>
>>> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>>> >> >> >>> have done little work in asynchronous IO and concurrency which would
>>> >> >> >>> yield both improved read and write throughput. This aligns well with
>>> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>>> >> >> >>> believe that more development will happen on parquet-core once the
>>> >> >> >>> development process issues are resolved by having a single codebase,
>>> >> >> >>> single build system, and a single CI framework.
>>> >> >> >>>
>>> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>>> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>>> >> >> >>> goal I think we should still be open to making significant changes
>>> >> in
>>> >> >> >>> the interest of long term progress.
>>> >> >> >>>
>>> >> >> >>> Having now worked on these projects for more than 2 and a half years
>>> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>>> >> >> >>> past the "breaking point" and not willing to continue contributing
>>> >> in
>>> >> >> >>> a significant way to parquet-cpp if the projects remained structured
>>> >> >> >>> as they are now. It's hampering progress and not serving the
>>> >> >> >>> community.
>>> >> >> >>>
>>> >> >> >>> - Wes
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>>> >> >
>>> >> >> >>> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>>> >> >> arrow
>>> >> >> >>> >> repo. That will remove a majority of the dependency issues.
>>> >> Joshua's
>>> >> >> >>> work
>>> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>>> >> >> the
>>> >> >> >>> arrow
>>> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
>>> >> >> >>> >>
>>> >> >> >>> >> This has been suggested before, but I don't see how it would
>>> >> >> alleviate
>>> >> >> >>> >> any issues because of the significant dependencies on other
>>> >> parts of
>>> >> >> >>> >> the Arrow codebase. What you are proposing is:
>>> >> >> >>> >>
>>> >> >> >>> >> - (Arrow) arrow platform
>>> >> >> >>> >> - (Parquet) parquet core
>>> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>>> >> >> >>> >> - (Arrow) Python bindings
>>> >> >> >>> >>
>>> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>>> >> >> >>> >> built before invoking the Parquet core part of the build system.
>>> >> You
>>> >> >> >>> >> would need to pass dependent targets across different CMake build
>>> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>>> >> >> into
>>> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
>>> >> >> "concrete
>>> >> >> >>> >> and actionable plan". The only thing that would really work
>>> >> would be
>>> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>>> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>>> >> builds
>>> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>>> >> >> build
>>> >> >> >>> >> system because it's only depended upon by the Python bindings.
>>> >> >> >>> >>
>>> >> >> >>> >> And even if a solution could be devised, it would not wholly
>>> >> resolve
>>> >> >> >>> >> the CI workflow issues.
>>> >> >> >>> >>
>>> >> >> >>> >> You could make Parquet completely independent of the Arrow
>>> >> codebase,
>>> >> >> >>> >> but at that point there is little reason to maintain a
>>> >> relationship
>>> >> >> >>> >> between the projects or their communities. We have spent a great
>>> >> >> deal
>>> >> >> >>> >> of effort refactoring the two projects to enable as much code
>>> >> >> sharing
>>> >> >> >>> >> as there is now.
>>> >> >> >>> >>
>>> >> >> >>> >> - Wes
>>> >> >> >>> >>
>>> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>>> >> wesmckinn@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>>> >> clone
>>> >> >> the
>>> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>>> >> Having
>>> >> >> two
>>> >> >> >>> >> parquet-cpp repos is no way a better approach.
>>> >> >> >>> >> >
>>> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>>> >> is
>>> >> >> to
>>> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>>> >> >> >>> >> >
>>> >> >> >>> >> > It doesn't look like I will be able to convince you that a
>>> >> >> monorepo is
>>> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>>> >> >> give
>>> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
>>> >> >> (which I
>>> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>>> >> >> >>> >> >
>>> >> >> >>> >> > - Wes
>>> >> >> >>> >> >
>>> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>>> >> >> >>> majeti.deepak@gmail.com>
>>> >> >> >>> >> wrote:
>>> >> >> >>> >> >> Wes,
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>>> >> >> problems
>>> >> >> >>> of a
>>> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>>> >> >> >>> >> >> Bringing in related Apache community experiences are more
>>> >> >> meaningful
>>> >> >> >>> >> than
>>> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
>>> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>>> >> >> developers.
>>> >> >> >>> >> >> You are very well aware of how difficult it has been to find
>>> >> more
>>> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>>> >> has
>>> >> >> a low
>>> >> >> >>> >> >> contribution rate to its core components.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> We should target to ensure that new volunteers who want to
>>> >> >> contribute
>>> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>>> >> >> figuring
>>> >> >> >>> out
>>> >> >> >>> >> >> the project repo. We can never come up with an automated build
>>> >> >> system
>>> >> >> >>> >> that
>>> >> >> >>> >> >> caters to every possible environment.
>>> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>>> >> new
>>> >> >> >>> >> developers
>>> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>>> >> >> build
>>> >> >> >>> and
>>> >> >> >>> >> test
>>> >> >> >>> >> >> dependencies.
>>> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
>>> >> less
>>> >> >> >>> >> >> co-operative.
>>> >> >> >>> >> >> I just don't think the mono-repo structure model will be
>>> >> >> sustainable
>>> >> >> >>> in
>>> >> >> >>> >> an
>>> >> >> >>> >> >> open source community unless there are long-term vested
>>> >> >> interests. We
>>> >> >> >>> >> can't
>>> >> >> >>> >> >> predict that.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The current circular dependency problems between Arrow and
>>> >> >> Parquet
>>> >> >> >>> is a
>>> >> >> >>> >> >> major problem for the community and it is important.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>>> >> >> arrow
>>> >> >> >>> >> repo.
>>> >> >> >>> >> >> That will remove a majority of the dependency issues.
>>> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>>> >> that
>>> >> >> >>> adapter
>>> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>>> >> adaptor.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>>> >> >> changes
>>> >> >> >>> in
>>> >> >> >>> >> the
>>> >> >> >>> >> >> future to this code should not be the main reason to combine
>>> >> the
>>> >> >> >>> arrow
>>> >> >> >>> >> >> parquet repos.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> "
>>> >> >> >>> >> >> *I question whether it's worth the community's time long term
>>> >> to
>>> >> >> >>> wear*
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>>> >> >> >>> >> eachlibrary
>>> >> >> >>> >> >> to plug components together rather than utilizing
>>> >> commonplatform
>>> >> >> >>> APIs.*"
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> My answer to your question below would be "Yes".
>>> >> >> >>> Modularity/separation
>>> >> >> >>> >> is
>>> >> >> >>> >> >> very important in an open source community where priorities of
>>> >> >> >>> >> contributors
>>> >> >> >>> >> >> are often short term.
>>> >> >> >>> >> >> The retention is low and therefore the acquisition costs
>>> >> should
>>> >> >> be
>>> >> >> >>> low
>>> >> >> >>> >> as
>>> >> >> >>> >> >> well. This is the community over code approach according to
>>> >> me.
>>> >> >> Minor
>>> >> >> >>> >> code
>>> >> >> >>> >> >> duplication is not a deal breaker.
>>> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>>> >> big
>>> >> >> >>> data
>>> >> >> >>> >> >> space serving their own functions.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>>> >> clone
>>> >> >> the
>>> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>>> >> >> Having
>>> >> >> >>> two
>>> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>>> >> >> wesmckinn@gmail.com>
>>> >> >> >>> >> wrote:
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>> @Antoine
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>>> >> would
>>> >> >> >>> slightly
>>> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>>> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> Parquet run takes about 28
>>> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>>> >> >> certain
>>> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>>> >> >> could be
>>> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>>> >> >> (like
>>> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>>> >> >> nightly
>>> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>>> >> >> also
>>> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
>>> >> nightly
>>> >> >> >>> >> >>> exhaustive test run)
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> - Wes
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>>> >> >> wesmckinn@gmail.com
>>> >> >> >>> >
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>>> >> great
>>> >> >> >>> >> example of
>>> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>>> >> >> >>> codebase.
>>> >> >> >>> >> That
>>> >> >> >>> >> >>> gives me hope that the projects could be managed separately
>>> >> some
>>> >> >> >>> day.
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>>> >> C++
>>> >> >> >>> codebase
>>> >> >> >>> >> >>> > features several areas of duplicated logic which could be
>>> >> >> >>> replaced by
>>> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>>> >> >> >>> >> >>> > interoperability:
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>>
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> >> >> >>> orc/OrcFile.hh#L37
>>> >> >> >>> >> >>> >
>>> >> >> >>> >>
>>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>>
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> >> >> >>> orc/MemoryPool.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >>
>>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>>> >> >> >>> OutputStream.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>>> >> >> cause of
>>> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>>> >> >> them
>>> >> >> >>> from
>>> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>>> >> is
>>> >> >> only
>>> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > I question whether it's worth the community's time long
>>> >> term
>>> >> >> to
>>> >> >> >>> wear
>>> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>>> >> in
>>> >> >> each
>>> >> >> >>> >> >>> > library to plug components together rather than utilizing
>>> >> >> common
>>> >> >> >>> >> >>> > platform APIs.
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > - Wes
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>>> >> >> >>> >> joshuastorck@gmail.com>
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
>>> >> >> process are
>>> >> >> >>> >> well
>>> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>>> >> current
>>> >> >> >>> >> >>> environment I
>>> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>>> >> main
>>> >> >> >>> issues
>>> >> >> >>> >> I
>>> >> >> >>> >> >>> worry
>>> >> >> >>> >> >>> >> about when you put codebases like these together are:
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>>> >> >> becomes
>>> >> >> >>> too
>>> >> >> >>> >> >>> coupled
>>> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>>> >> >> tree are
>>> >> >> >>> >> >>> delayed
>>> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> If the project/release management is structured well and
>>> >> >> someone
>>> >> >> >>> >> keeps
>>> >> >> >>> >> >>> an
>>> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>>> >> great
>>> >> >> >>> >> example of
>>> >> >> >>> >> >>> how
>>> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>>> >> >> >>> codebase.
>>> >> >> >>> >> That
>>> >> >> >>> >> >>> >> gives me hope that the projects could be managed
>>> >> separately
>>> >> >> some
>>> >> >> >>> >> day.
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>>> >> >> >>> wesmckinn@gmail.com>
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >>> hi Josh,
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>>> >> >> arrow
>>> >> >> >>> and
>>> >> >> >>> >> >>> tying
>>> >> >> >>> >> >>> >>> them together seems like the wrong choice.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>>> >> >> people
>>> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>>> >> >> agree
>>> >> >> >>> >> with?)
>>> >> >> >>> >> >>> >>> is that we should work more closely together until the
>>> >> >> community
>>> >> >> >>> >> grows
>>> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>>> >> >> now.
>>> >> >> >>> As
>>> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>>> >> these
>>> >> >> >>> >> projects.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>>> >> own
>>> >> >> >>> >> codebase.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>>> >> into
>>> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>>> >> of
>>> >> >> the
>>> >> >> >>> >> GitHub
>>> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>>> >> >> >>> idealistic,
>>> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>>> >> devise
>>> >> >> a
>>> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>>> >> >> per
>>> >> >> >>> day
>>> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>>> >> >> >>> without
>>> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>>> >> see
>>> >> >> how
>>> >> >> >>> we
>>> >> >> >>> >> can
>>> >> >> >>> >> >>> >>> move forward.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>>> >> >> codebases
>>> >> >> >>> >> in the
>>> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
>>> >> >> the
>>> >> >> >>> near
>>> >> >> >>> >> >>> term.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>>> >> to
>>> >> >> be
>>> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
>>> >> and
>>> >> >> >>> >> community
>>> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>>> >> the
>>> >> >> >>> >> current
>>> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>>> >> consider
>>> >> >> >>> >> >>> >>> development process and ASF releases separately. My
>>> >> >> argument is
>>> >> >> >>> as
>>> >> >> >>> >> >>> >>> follows:
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>>> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
>>> >> PMCs
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> - Wes
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>>> >> >> >>> >> joshuastorck@gmail.com
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>>> >> implemented
>>> >> >> in
>>> >> >> >>> >> >>> parquet-cpp
>>> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>>> >> >> >>> >> (ARROW-2585,
>>> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>>> >> confusing
>>> >> >> and
>>> >> >> >>> >> hard to
>>> >> >> >>> >> >>> work
>>> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>>> >> parquet-cpp
>>> >> >> >>> >> (created on
>>> >> >> >>> >> >>> May
>>> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>>> >> was
>>> >> >> >>> >> recently
>>> >> >> >>> >> >>> >>> merged.
>>> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>>> >> >> the
>>> >> >> >>> >> change in
>>> >> >> >>> >> >>> >>> arrow
>>> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>>> >> >> >>> >> >>> >>> run_clang_format.py
>>> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>>> >> >> there
>>> >> >> >>> >> was an
>>> >> >> >>> >> >>> >>> exact
>>> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>>> >> sense
>>> >> >> in
>>> >> >> >>> the
>>> >> >> >>> >> long
>>> >> >> >>> >> >>> >>> term.
>>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>>> >> >> arrow
>>> >> >> >>> and
>>> >> >> >>> >> >>> tying
>>> >> >> >>> >> >>> >>> them
>>> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>>> >> other
>>> >> >> >>> formats
>>> >> >> >>> >> >>> that
>>> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>>> >> (e.g. -
>>> >> >> >>> Orc),
>>> >> >> >>> >> so I
>>> >> >> >>> >> >>> >>> don't
>>> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>>> >> >> tooling
>>> >> >> >>> >> should
>>> >> >> >>> >> >>> be
>>> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>>> >> history
>>> >> >> of
>>> >> >> >>> >> >>> developing
>>> >> >> >>> >> >>> >>> open
>>> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>>> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>>> >> CI
>>> >> >> is a
>>> >> >> >>> >> good
>>> >> >> >>> >> >>> >>> > counter-example since there have been lots of
>>> >> successful
>>> >> >> open
>>> >> >> >>> >> source
>>> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
>>> >> pinned
>>> >> >> >>> >> versions of
>>> >> >> >>> >> >>> >>> > dependent software.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>>> >> >> codebases
>>> >> >> >>> >> in the
>>> >> >> >>> >> >>> >>> short
>>> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>>> >> >> near
>>> >> >> >>> >> term.
>>> >> >> >>> >> >>> My
>>> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>>> >> >> together,
>>> >> >> >>> you
>>> >> >> >>> >> can
>>> >> >> >>> >> >>> more
>>> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>>> >> a
>>> >> >> >>> single
>>> >> >> >>> >> PR.
>>> >> >> >>> >> >>> >>> Second,
>>> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>>> >> >> >>> diverge,
>>> >> >> >>> >> >>> which has
>>> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>>> >> >> been
>>> >> >> >>> >> sorted
>>> >> >> >>> >> >>> out,
>>> >> >> >>> >> >>> >>> it
>>> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
>>> >> >> codebases.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>>> >> >> >>> codebases
>>> >> >> >>> >> for
>>> >> >> >>> >> >>> arrow
>>> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>>> >> the
>>> >> >> >>> >> >>> perspective of
>>> >> >> >>> >> >>> >>> a
>>> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>>> >> is a
>>> >> >> >>> large
>>> >> >> >>> >> tax
>>> >> >> >>> >> >>> to
>>> >> >> >>> >> >>> >>> pay
>>> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>>> >> >> in the
>>> >> >> >>> >> 0.10.0
>>> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>>> >> >> release. I
>>> >> >> >>> >> hope
>>> >> >> >>> >> >>> that
>>> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>>> >> >> help
>>> >> >> >>> >> reduce
>>> >> >> >>> >> >>> the
>>> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>>> >> >> >>> >> ted.dunning@gmail.com>
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>>> >> >> >>> >> wesmckinn@gmail.com>
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> >
>>> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>>> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>>> >> for
>>> >> >> >>> >> stability
>>> >> >> >>> >> >>> and
>>> >> >> >>> >> >>> >>> API
>>> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>>> >> >> HDFS
>>> >> >> >>> >> >>> community
>>> >> >> >>> >> >>> >>> took
>>> >> >> >>> >> >>> >>> >> a
>>> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>>> >> >> >>> >> >>> >>> >> >
>>> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>>> >> >> source
>>> >> >> >>> >> >>> community as
>>> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>>> >> >> didn't
>>> >> >> >>> go
>>> >> >> >>> >> the
>>> >> >> >>> >> >>> way
>>> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>>> >> >> >>> community
>>> >> >> >>> >> which
>>> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>>> >> >> model.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> There are some more radical and community building
>>> >> >> options as
>>> >> >> >>> >> well.
>>> >> >> >>> >> >>> Take
>>> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>>> >> subversion,
>>> >> >> any
>>> >> >> >>> >> Apache
>>> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>>> >> >> large
>>> >> >> >>> >> >>> fraction of
>>> >> >> >>> >> >>> >>> >> subversion.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>>> >> parquet
>>> >> >> >>> >> committer
>>> >> >> >>> >> >>> a
>>> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>>> >> >> >>> >> committers in
>>> >> >> >>> >> >>> >>> Arrow?
>>> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>>> >> >> committer who
>>> >> >> >>> >> asks
>>> >> >> >>> >> >>> will
>>> >> >> >>> >> >>> >>> be
>>> >> >> >>> >> >>> >>> >> given committer status in Arrow.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>>> >> Parquet
>>> >> >> >>> >> committers
>>> >> >> >>> >> >>> >>> can't be
>>> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>>> >> >> merged;
>>> >> >> >>> >> they
>>> >> >> >>> >> >>> can
>>> >> >> >>> >> >>> >>> just
>>> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>>> >> >> in the
>>> >> >> >>> >> >>> Parquet
>>> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>>> >> >> >>> parquet so
>>> >> >> >>> >> >>> why not
>>> >> >> >>> >> >>> >>> >> invite them in?
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> --
>>> >> >> >>> >> >> regards,
>>> >> >> >>> >> >> Deepak Majeti
>>> >> >> >>> >>
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > --
>>> >> >> >>> > regards,
>>> >> >> >>> > Deepak Majeti
>>> >> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > regards,
>>> >> > Deepak Majeti
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

A couple more points to make re: Uwe's comments:

> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository.

An important distinction here is community and development process,
why I focused on the "Community over Code" rationale in my original
e-mail. I think we should make decisions that optimize for the
community's health and productivity over satisfying arbitrary
constraints.

Our community's health should be measured by our ability to author and
merge changes as painlessly as possible and to be able to consistently
deliver high quality software releases. IMHO we have fallen short of
both of these goals

> Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm.

A monorepo does not imply that components cannot have separate release
cycles. As an example, in the JS community a tool Lerna has developed
precisely because of this

https://lernajs.io/

"Splitting up large codebases into separate independently versioned
packages is extremely useful for code sharing. However, making changes
across many repositories is messy and difficult to track, and testing
across repositories gets complicated really fast.

To solve these (and many other) problems, some projects will organize
their codebases into multi-package repositories. Projects like Babel,
React, Angular, Ember, Meteor, Jest, and many others develop all of
their packages within a single repository."

I think that we have arrived exactly to the same point as these other
projects. We may need to develop some of our own tooling to assist
with managing our monorepo (where we are already shipping multiple
components)

- Wes

On Sun, Aug 19, 2018 at 1:30 PM, Wes McKinney <we...@gmail.com> wrote:
> hi Uwe,
>
> I agree with your points. Currently we have 3 software artifacts:
>
> 1. Arrow C++ libraries
> 2. Parquet C++ libraries with Arrow columnar integration
> 3. C++ interop layer for Python + Cython bindings
>
> Changes in #1 prompt an awkward workflow involving multiple PRs; as a
> result of this we just recently jumped 8 months from the pinned
> version of Arrow in parquet-cpp. This obviously is an antipattern. If
> we had a much larger group of core developers, this might be more
> maintainable
>
> Of course changes in #2 also impact #3; a lot of our bug reports and
> feature requests are coming inbound because of #3, and we have
> struggled to be able to respond to the needs of users (and other
> developers like Robert Gruener who are trying to use this software in
> a large data warehouse)
>
> There is also the release coordination issue where having users
> simultaneously using a released version of both projects hasn't really
> happened, so we effectively already have been treating Parquet like a
> vendored component in our packaging process.
>
> Realistically I think once #2 has become a more functionally complete
> and as a result a more slowly moving piece of software, we can
> contemplate splitting out all or parts of its development process back
> into another repository. I think we have a ton of work to do yet on
> Parquet core, particularly optimizing for high latency storage (HDFS,
> S3, GCP, etc.), and it wouldn't really make sense to do such platform
> level work anywhere but #1
>
> - Wes
>
> On Sun, Aug 19, 2018 at 8:37 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> Back from vacation, I also want to finally raise my voice.
>>
>> With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.
>>
>> Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.
>>
>> Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.
>>
>> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.
>>
>> Uwe
>>
>> On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
>>> Thanks Ryan, will do. The people I'd still like to hear from are:
>>>
>>> * Phillip Cloud
>>> * Uwe Korn
>>>
>>> As ASF contributors we are responsible to both be pragmatic as well as
>>> act in the best interests of the community's health and productivity.
>>>
>>>
>>>
>>> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>> > I don't have an opinion here, but could someone send a summary of what is
>>> > decided to the dev list once there is consensus? This is a long thread for
>>> > parts of the project I don't work on, so I haven't followed it very closely.
>>> >
>>> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>>> >
>>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>>> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>>> >> Can we enforce that parquet-cpp changes will not be committed without a
>>> >> corresponding Parquet JIRA?
>>> >>
>>> >> I think we would use the following policy:
>>> >>
>>> >> * use PARQUET-XXX for issues relating to Parquet core
>>> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>>> >> core (e.g. changes that are in parquet/arrow right now)
>>> >>
>>> >> We've already been dealing with annoyances relating to issues
>>> >> straddling the two projects (debugging an issue on Arrow side to find
>>> >> that it has to be fixed on Parquet side); this would make things
>>> >> simpler for us
>>> >>
>>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>>> >> simplify forking later (if needed) and be able to maintain the commit
>>> >> history.  I don't know if its possible to squash parquet-cpp commits and
>>> >> arrow commits separately before merging.
>>> >>
>>> >> This seems rather onerous for both contributors and maintainers and
>>> >> not in line with the goal of improving productivity. In the event that
>>> >> we fork I see it as a traumatic event for the community. If it does
>>> >> happen, then we can write a script (using git filter-branch and other
>>> >> such tools) to extract commits related to the forked code.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>>> >> wrote:
>>> >> > I have a few more logistical questions to add.
>>> >> >
>>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>>> >> > Arrow changes. Will we establish some guidelines for filing Parquet
>>> >> JIRAs?
>>> >> > Can we enforce that parquet-cpp changes will not be committed without a
>>> >> > corresponding Parquet JIRA?
>>> >> >
>>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>>> >> > simplify forking later (if needed) and be able to maintain the commit
>>> >> > history.  I don't know if its possible to squash parquet-cpp commits and
>>> >> > arrow commits separately before merging.
>>> >> >
>>> >> >
>>> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>>> >> >
>>> >> >> Do other people have opinions? I would like to undertake this work in
>>> >> >> the near future (the next 8-10 weeks); I would be OK with taking
>>> >> >> responsibility for the primary codebase surgery.
>>> >> >>
>>> >> >> Some logistical questions:
>>> >> >>
>>> >> >> * We have a handful of pull requests in flight in parquet-cpp that
>>> >> >> would need to be resolved / merged
>>> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>>> >> >> releases cut out of the new structure
>>> >> >> * Management of shared commit rights (I can discuss with the Arrow
>>> >> >> PMC; I believe that approving any committer who has actively
>>> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
>>> >> >> comments)
>>> >> >>
>>> >> >> If working more closely together proves to not be working out after
>>> >> >> some period of time, I will be fully supportive of a fork or something
>>> >> >> like it
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Wes
>>> >> >>
>>> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>>> >> wrote:
>>> >> >> > Thanks Tim.
>>> >> >> >
>>> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>>> >> >> > platform code intending to improve the performance of bit-packing in
>>> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
>>> >> >> >
>>> >> >> > * https://github.com/apache/parquet-cpp/pull/483
>>> >> >> > * https://github.com/apache/arrow/pull/2355
>>> >> >> >
>>> >> >> > Changes that impact the Python interface to Parquet are even more
>>> >> >> complex.
>>> >> >> >
>>> >> >> > Adding options to Arrow's CMake build system to only build
>>> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
>>> >> >> > not be difficult, and amount to writing "make parquet".
>>> >> >> >
>>> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>>> >> to
>>> >> >> > build and install the Parquet core libraries and their dependencies
>>> >> >> > would be:
>>> >> >> >
>>> >> >> > ninja parquet && ninja install
>>> >> >> >
>>> >> >> > - Wes
>>> >> >> >
>>> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>>> >> >> > <ta...@cloudera.com.invalid> wrote:
>>> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>>> >> >> >> successful, but I thought I'd give my two cents.
>>> >> >> >>
>>> >> >> >> For me, the thing that makes the biggest difference in contributing
>>> >> to a
>>> >> >> >> new codebase is the number of steps in the workflow for writing,
>>> >> >> testing,
>>> >> >> >> posting and iterating on a commit and also the number of
>>> >> opportunities
>>> >> >> for
>>> >> >> >> missteps. The size of the repo and build/test times matter but are
>>> >> >> >> secondary so long as the workflow is simple and reliable.
>>> >> >> >>
>>> >> >> >> I don't really know what the current state of things is, but it
>>> >> sounds
>>> >> >> like
>>> >> >> >> it's not as simple as check out -> build -> test if you're doing a
>>> >> >> >> cross-repo change. Circular dependencies are a real headache.
>>> >> >> >>
>>> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>>> >> >> wrote:
>>> >> >> >>
>>> >> >> >>> hi,
>>> >> >> >>>
>>> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>>> >> >> majeti.deepak@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> > I think the circular dependency can be broken if we build a new
>>> >> >> library
>>> >> >> >>> for
>>> >> >> >>> > the platform code. This will also make it easy for other projects
>>> >> >> such as
>>> >> >> >>> > ORC to use it.
>>> >> >> >>> > I also remember your proposal a while ago of having a separate
>>> >> >> project
>>> >> >> >>> for
>>> >> >> >>> > the platform code.  That project can live in the arrow repo.
>>> >> >> However, one
>>> >> >> >>> > has to clone the entire apache arrow repo but can just build the
>>> >> >> platform
>>> >> >> >>> > code. This will be temporary until we can find a new home for it.
>>> >> >> >>> >
>>> >> >> >>> > The dependency will look like:
>>> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>>> >> >> >>> > libplatform(platform api)
>>> >> >> >>> >
>>> >> >> >>> > CI workflow will clone the arrow project twice, once for the
>>> >> platform
>>> >> >> >>> > library and once for the arrow-core/bindings library.
>>> >> >> >>>
>>> >> >> >>> This seems like an interesting proposal; the best place to work
>>> >> toward
>>> >> >> >>> this goal (if it is even possible; the build system interactions and
>>> >> >> >>> ASF release management are the hard problems) is to have all of the
>>> >> >> >>> code in a single repository. ORC could already be using Arrow if it
>>> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> > There is no doubt that the collaborations between the Arrow and
>>> >> >> Parquet
>>> >> >> >>> > communities so far have been very successful.
>>> >> >> >>> > The reason to maintain this relationship moving forward is to
>>> >> >> continue to
>>> >> >> >>> > reap the mutual benefits.
>>> >> >> >>> > We should continue to take advantage of sharing code as well.
>>> >> >> However, I
>>> >> >> >>> > don't see any code sharing opportunities between arrow-core and
>>> >> the
>>> >> >> >>> > parquet-core. Both have different functions.
>>> >> >> >>>
>>> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>>> >> format
>>> >> >> >>> is only one part of a project that has become quite large already
>>> >> >> >>> (
>>> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>>> >> >> >>> platform-for-inmemory-data-105427919).
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
>>> >> stable.
>>> >> >> We
>>> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
>>> >> is
>>> >> >> to
>>> >> >> >>> > keep them nimble since we can.
>>> >> >> >>>
>>> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>>> >> >> >>> have done little work in asynchronous IO and concurrency which would
>>> >> >> >>> yield both improved read and write throughput. This aligns well with
>>> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>>> >> >> >>> believe that more development will happen on parquet-core once the
>>> >> >> >>> development process issues are resolved by having a single codebase,
>>> >> >> >>> single build system, and a single CI framework.
>>> >> >> >>>
>>> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>>> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>>> >> >> >>> goal I think we should still be open to making significant changes
>>> >> in
>>> >> >> >>> the interest of long term progress.
>>> >> >> >>>
>>> >> >> >>> Having now worked on these projects for more than 2 and a half years
>>> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>>> >> >> >>> past the "breaking point" and not willing to continue contributing
>>> >> in
>>> >> >> >>> a significant way to parquet-cpp if the projects remained structured
>>> >> >> >>> as they are now. It's hampering progress and not serving the
>>> >> >> >>> community.
>>> >> >> >>>
>>> >> >> >>> - Wes
>>> >> >> >>>
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>>> >> >
>>> >> >> >>> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>>> >> >> arrow
>>> >> >> >>> >> repo. That will remove a majority of the dependency issues.
>>> >> Joshua's
>>> >> >> >>> work
>>> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>>> >> >> the
>>> >> >> >>> arrow
>>> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
>>> >> >> >>> >>
>>> >> >> >>> >> This has been suggested before, but I don't see how it would
>>> >> >> alleviate
>>> >> >> >>> >> any issues because of the significant dependencies on other
>>> >> parts of
>>> >> >> >>> >> the Arrow codebase. What you are proposing is:
>>> >> >> >>> >>
>>> >> >> >>> >> - (Arrow) arrow platform
>>> >> >> >>> >> - (Parquet) parquet core
>>> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>>> >> >> >>> >> - (Arrow) Python bindings
>>> >> >> >>> >>
>>> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>>> >> >> >>> >> built before invoking the Parquet core part of the build system.
>>> >> You
>>> >> >> >>> >> would need to pass dependent targets across different CMake build
>>> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>>> >> >> into
>>> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
>>> >> >> "concrete
>>> >> >> >>> >> and actionable plan". The only thing that would really work
>>> >> would be
>>> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>>> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>>> >> builds
>>> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>>> >> >> build
>>> >> >> >>> >> system because it's only depended upon by the Python bindings.
>>> >> >> >>> >>
>>> >> >> >>> >> And even if a solution could be devised, it would not wholly
>>> >> resolve
>>> >> >> >>> >> the CI workflow issues.
>>> >> >> >>> >>
>>> >> >> >>> >> You could make Parquet completely independent of the Arrow
>>> >> codebase,
>>> >> >> >>> >> but at that point there is little reason to maintain a
>>> >> relationship
>>> >> >> >>> >> between the projects or their communities. We have spent a great
>>> >> >> deal
>>> >> >> >>> >> of effort refactoring the two projects to enable as much code
>>> >> >> sharing
>>> >> >> >>> >> as there is now.
>>> >> >> >>> >>
>>> >> >> >>> >> - Wes
>>> >> >> >>> >>
>>> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>>> >> wesmckinn@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>>> >> clone
>>> >> >> the
>>> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>>> >> Having
>>> >> >> two
>>> >> >> >>> >> parquet-cpp repos is no way a better approach.
>>> >> >> >>> >> >
>>> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>>> >> is
>>> >> >> to
>>> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>>> >> >> >>> >> >
>>> >> >> >>> >> > It doesn't look like I will be able to convince you that a
>>> >> >> monorepo is
>>> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>>> >> >> give
>>> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
>>> >> >> (which I
>>> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>>> >> >> >>> >> >
>>> >> >> >>> >> > - Wes
>>> >> >> >>> >> >
>>> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>>> >> >> >>> majeti.deepak@gmail.com>
>>> >> >> >>> >> wrote:
>>> >> >> >>> >> >> Wes,
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>>> >> >> problems
>>> >> >> >>> of a
>>> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>>> >> >> >>> >> >> Bringing in related Apache community experiences are more
>>> >> >> meaningful
>>> >> >> >>> >> than
>>> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
>>> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>>> >> >> developers.
>>> >> >> >>> >> >> You are very well aware of how difficult it has been to find
>>> >> more
>>> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>>> >> has
>>> >> >> a low
>>> >> >> >>> >> >> contribution rate to its core components.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> We should target to ensure that new volunteers who want to
>>> >> >> contribute
>>> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>>> >> >> figuring
>>> >> >> >>> out
>>> >> >> >>> >> >> the project repo. We can never come up with an automated build
>>> >> >> system
>>> >> >> >>> >> that
>>> >> >> >>> >> >> caters to every possible environment.
>>> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>>> >> new
>>> >> >> >>> >> developers
>>> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>>> >> >> build
>>> >> >> >>> and
>>> >> >> >>> >> test
>>> >> >> >>> >> >> dependencies.
>>> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
>>> >> less
>>> >> >> >>> >> >> co-operative.
>>> >> >> >>> >> >> I just don't think the mono-repo structure model will be
>>> >> >> sustainable
>>> >> >> >>> in
>>> >> >> >>> >> an
>>> >> >> >>> >> >> open source community unless there are long-term vested
>>> >> >> interests. We
>>> >> >> >>> >> can't
>>> >> >> >>> >> >> predict that.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The current circular dependency problems between Arrow and
>>> >> >> Parquet
>>> >> >> >>> is a
>>> >> >> >>> >> >> major problem for the community and it is important.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>>> >> >> arrow
>>> >> >> >>> >> repo.
>>> >> >> >>> >> >> That will remove a majority of the dependency issues.
>>> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>>> >> that
>>> >> >> >>> adapter
>>> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>>> >> adaptor.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>>> >> >> changes
>>> >> >> >>> in
>>> >> >> >>> >> the
>>> >> >> >>> >> >> future to this code should not be the main reason to combine
>>> >> the
>>> >> >> >>> arrow
>>> >> >> >>> >> >> parquet repos.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> "
>>> >> >> >>> >> >> *I question whether it's worth the community's time long term
>>> >> to
>>> >> >> >>> wear*
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>>> >> >> >>> >> eachlibrary
>>> >> >> >>> >> >> to plug components together rather than utilizing
>>> >> commonplatform
>>> >> >> >>> APIs.*"
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> My answer to your question below would be "Yes".
>>> >> >> >>> Modularity/separation
>>> >> >> >>> >> is
>>> >> >> >>> >> >> very important in an open source community where priorities of
>>> >> >> >>> >> contributors
>>> >> >> >>> >> >> are often short term.
>>> >> >> >>> >> >> The retention is low and therefore the acquisition costs
>>> >> should
>>> >> >> be
>>> >> >> >>> low
>>> >> >> >>> >> as
>>> >> >> >>> >> >> well. This is the community over code approach according to
>>> >> me.
>>> >> >> Minor
>>> >> >> >>> >> code
>>> >> >> >>> >> >> duplication is not a deal breaker.
>>> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>>> >> big
>>> >> >> >>> data
>>> >> >> >>> >> >> space serving their own functions.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>>> >> clone
>>> >> >> the
>>> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>>> >> >> Having
>>> >> >> >>> two
>>> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>>> >> >> wesmckinn@gmail.com>
>>> >> >> >>> >> wrote:
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>> @Antoine
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>>> >> would
>>> >> >> >>> slightly
>>> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>>> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> Parquet run takes about 28
>>> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>>> >> >> certain
>>> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>>> >> >> could be
>>> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>>> >> >> (like
>>> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>>> >> >> nightly
>>> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>>> >> >> also
>>> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
>>> >> nightly
>>> >> >> >>> >> >>> exhaustive test run)
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> - Wes
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>>> >> >> wesmckinn@gmail.com
>>> >> >> >>> >
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>>> >> great
>>> >> >> >>> >> example of
>>> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>>> >> >> >>> codebase.
>>> >> >> >>> >> That
>>> >> >> >>> >> >>> gives me hope that the projects could be managed separately
>>> >> some
>>> >> >> >>> day.
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>>> >> C++
>>> >> >> >>> codebase
>>> >> >> >>> >> >>> > features several areas of duplicated logic which could be
>>> >> >> >>> replaced by
>>> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>>> >> >> >>> >> >>> > interoperability:
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>>
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> >> >> >>> orc/OrcFile.hh#L37
>>> >> >> >>> >> >>> >
>>> >> >> >>> >>
>>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>>
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> >> >> >>> orc/MemoryPool.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >>
>>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>>> >> >> >>> OutputStream.hh
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>>> >> >> cause of
>>> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>>> >> >> them
>>> >> >> >>> from
>>> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>>> >> is
>>> >> >> only
>>> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > I question whether it's worth the community's time long
>>> >> term
>>> >> >> to
>>> >> >> >>> wear
>>> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>>> >> in
>>> >> >> each
>>> >> >> >>> >> >>> > library to plug components together rather than utilizing
>>> >> >> common
>>> >> >> >>> >> >>> > platform APIs.
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > - Wes
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>>> >> >> >>> >> joshuastorck@gmail.com>
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
>>> >> >> process are
>>> >> >> >>> >> well
>>> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>>> >> current
>>> >> >> >>> >> >>> environment I
>>> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>>> >> main
>>> >> >> >>> issues
>>> >> >> >>> >> I
>>> >> >> >>> >> >>> worry
>>> >> >> >>> >> >>> >> about when you put codebases like these together are:
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>>> >> >> becomes
>>> >> >> >>> too
>>> >> >> >>> >> >>> coupled
>>> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>>> >> >> tree are
>>> >> >> >>> >> >>> delayed
>>> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> If the project/release management is structured well and
>>> >> >> someone
>>> >> >> >>> >> keeps
>>> >> >> >>> >> >>> an
>>> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>>> >> great
>>> >> >> >>> >> example of
>>> >> >> >>> >> >>> how
>>> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>>> >> >> >>> codebase.
>>> >> >> >>> >> That
>>> >> >> >>> >> >>> >> gives me hope that the projects could be managed
>>> >> separately
>>> >> >> some
>>> >> >> >>> >> day.
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>>> >> >> >>> wesmckinn@gmail.com>
>>> >> >> >>> >> >>> wrote:
>>> >> >> >>> >> >>> >>
>>> >> >> >>> >> >>> >>> hi Josh,
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>>> >> >> arrow
>>> >> >> >>> and
>>> >> >> >>> >> >>> tying
>>> >> >> >>> >> >>> >>> them together seems like the wrong choice.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>>> >> >> people
>>> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>>> >> >> agree
>>> >> >> >>> >> with?)
>>> >> >> >>> >> >>> >>> is that we should work more closely together until the
>>> >> >> community
>>> >> >> >>> >> grows
>>> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>>> >> >> now.
>>> >> >> >>> As
>>> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>>> >> these
>>> >> >> >>> >> projects.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>>> >> own
>>> >> >> >>> >> codebase.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>>> >> into
>>> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>>> >> of
>>> >> >> the
>>> >> >> >>> >> GitHub
>>> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>>> >> >> >>> idealistic,
>>> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>>> >> devise
>>> >> >> a
>>> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>>> >> >> per
>>> >> >> >>> day
>>> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>>> >> >> >>> without
>>> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>>> >> see
>>> >> >> how
>>> >> >> >>> we
>>> >> >> >>> >> can
>>> >> >> >>> >> >>> >>> move forward.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>>> >> >> codebases
>>> >> >> >>> >> in the
>>> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
>>> >> >> the
>>> >> >> >>> near
>>> >> >> >>> >> >>> term.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>>> >> to
>>> >> >> be
>>> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
>>> >> and
>>> >> >> >>> >> community
>>> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>>> >> the
>>> >> >> >>> >> current
>>> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>>> >> consider
>>> >> >> >>> >> >>> >>> development process and ASF releases separately. My
>>> >> >> argument is
>>> >> >> >>> as
>>> >> >> >>> >> >>> >>> follows:
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>>> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
>>> >> PMCs
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> - Wes
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>>> >> >> >>> >> joshuastorck@gmail.com
>>> >> >> >>> >> >>> >
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>>> >> implemented
>>> >> >> in
>>> >> >> >>> >> >>> parquet-cpp
>>> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>>> >> >> >>> >> (ARROW-2585,
>>> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>>> >> confusing
>>> >> >> and
>>> >> >> >>> >> hard to
>>> >> >> >>> >> >>> work
>>> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>>> >> parquet-cpp
>>> >> >> >>> >> (created on
>>> >> >> >>> >> >>> May
>>> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>>> >> was
>>> >> >> >>> >> recently
>>> >> >> >>> >> >>> >>> merged.
>>> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>>> >> >> the
>>> >> >> >>> >> change in
>>> >> >> >>> >> >>> >>> arrow
>>> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>>> >> >> >>> >> >>> >>> run_clang_format.py
>>> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>>> >> >> there
>>> >> >> >>> >> was an
>>> >> >> >>> >> >>> >>> exact
>>> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>>> >> sense
>>> >> >> in
>>> >> >> >>> the
>>> >> >> >>> >> long
>>> >> >> >>> >> >>> >>> term.
>>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>>> >> >> arrow
>>> >> >> >>> and
>>> >> >> >>> >> >>> tying
>>> >> >> >>> >> >>> >>> them
>>> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>>> >> other
>>> >> >> >>> formats
>>> >> >> >>> >> >>> that
>>> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>>> >> (e.g. -
>>> >> >> >>> Orc),
>>> >> >> >>> >> so I
>>> >> >> >>> >> >>> >>> don't
>>> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>>> >> >> tooling
>>> >> >> >>> >> should
>>> >> >> >>> >> >>> be
>>> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>>> >> history
>>> >> >> of
>>> >> >> >>> >> >>> developing
>>> >> >> >>> >> >>> >>> open
>>> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>>> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>>> >> CI
>>> >> >> is a
>>> >> >> >>> >> good
>>> >> >> >>> >> >>> >>> > counter-example since there have been lots of
>>> >> successful
>>> >> >> open
>>> >> >> >>> >> source
>>> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
>>> >> pinned
>>> >> >> >>> >> versions of
>>> >> >> >>> >> >>> >>> > dependent software.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>>> >> >> codebases
>>> >> >> >>> >> in the
>>> >> >> >>> >> >>> >>> short
>>> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>>> >> >> near
>>> >> >> >>> >> term.
>>> >> >> >>> >> >>> My
>>> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>>> >> >> together,
>>> >> >> >>> you
>>> >> >> >>> >> can
>>> >> >> >>> >> >>> more
>>> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>>> >> a
>>> >> >> >>> single
>>> >> >> >>> >> PR.
>>> >> >> >>> >> >>> >>> Second,
>>> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>>> >> >> >>> diverge,
>>> >> >> >>> >> >>> which has
>>> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>>> >> >> been
>>> >> >> >>> >> sorted
>>> >> >> >>> >> >>> out,
>>> >> >> >>> >> >>> >>> it
>>> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
>>> >> >> codebases.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>>> >> >> >>> codebases
>>> >> >> >>> >> for
>>> >> >> >>> >> >>> arrow
>>> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>>> >> the
>>> >> >> >>> >> >>> perspective of
>>> >> >> >>> >> >>> >>> a
>>> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>>> >> is a
>>> >> >> >>> large
>>> >> >> >>> >> tax
>>> >> >> >>> >> >>> to
>>> >> >> >>> >> >>> >>> pay
>>> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>>> >> >> in the
>>> >> >> >>> >> 0.10.0
>>> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>>> >> >> release. I
>>> >> >> >>> >> hope
>>> >> >> >>> >> >>> that
>>> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>>> >> >> help
>>> >> >> >>> >> reduce
>>> >> >> >>> >> >>> the
>>> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>>> >> >> >>> >> ted.dunning@gmail.com>
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> >
>>> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>>> >> >> >>> >> wesmckinn@gmail.com>
>>> >> >> >>> >> >>> >>> wrote:
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> >
>>> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>>> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>>> >> for
>>> >> >> >>> >> stability
>>> >> >> >>> >> >>> and
>>> >> >> >>> >> >>> >>> API
>>> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>>> >> >> HDFS
>>> >> >> >>> >> >>> community
>>> >> >> >>> >> >>> >>> took
>>> >> >> >>> >> >>> >>> >> a
>>> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>>> >> >> >>> >> >>> >>> >> >
>>> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>>> >> >> source
>>> >> >> >>> >> >>> community as
>>> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>>> >> >> didn't
>>> >> >> >>> go
>>> >> >> >>> >> the
>>> >> >> >>> >> >>> way
>>> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>>> >> >> >>> community
>>> >> >> >>> >> which
>>> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>>> >> >> model.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> There are some more radical and community building
>>> >> >> options as
>>> >> >> >>> >> well.
>>> >> >> >>> >> >>> Take
>>> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>>> >> subversion,
>>> >> >> any
>>> >> >> >>> >> Apache
>>> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>>> >> >> large
>>> >> >> >>> >> >>> fraction of
>>> >> >> >>> >> >>> >>> >> subversion.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>>> >> parquet
>>> >> >> >>> >> committer
>>> >> >> >>> >> >>> a
>>> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>>> >> >> >>> >> committers in
>>> >> >> >>> >> >>> >>> Arrow?
>>> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>>> >> >> committer who
>>> >> >> >>> >> asks
>>> >> >> >>> >> >>> will
>>> >> >> >>> >> >>> >>> be
>>> >> >> >>> >> >>> >>> >> given committer status in Arrow.
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>>> >> Parquet
>>> >> >> >>> >> committers
>>> >> >> >>> >> >>> >>> can't be
>>> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>>> >> >> merged;
>>> >> >> >>> >> they
>>> >> >> >>> >> >>> can
>>> >> >> >>> >> >>> >>> just
>>> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>>> >> >> in the
>>> >> >> >>> >> >>> Parquet
>>> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>>> >> >> >>> parquet so
>>> >> >> >>> >> >>> why not
>>> >> >> >>> >> >>> >>> >> invite them in?
>>> >> >> >>> >> >>> >>> >>
>>> >> >> >>> >> >>> >>>
>>> >> >> >>> >> >>>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> --
>>> >> >> >>> >> >> regards,
>>> >> >> >>> >> >> Deepak Majeti
>>> >> >> >>> >>
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > --
>>> >> >> >>> > regards,
>>> >> >> >>> > Deepak Majeti
>>> >> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > regards,
>>> >> > Deepak Majeti
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Uwe,

I agree with your points. Currently we have 3 software artifacts:

1. Arrow C++ libraries
2. Parquet C++ libraries with Arrow columnar integration
3. C++ interop layer for Python + Cython bindings

Changes in #1 prompt an awkward workflow involving multiple PRs; as a
result of this we just recently jumped 8 months from the pinned
version of Arrow in parquet-cpp. This obviously is an antipattern. If
we had a much larger group of core developers, this might be more
maintainable

Of course changes in #2 also impact #3; a lot of our bug reports and
feature requests are coming inbound because of #3, and we have
struggled to be able to respond to the needs of users (and other
developers like Robert Gruener who are trying to use this software in
a large data warehouse)

There is also the release coordination issue where having users
simultaneously using a released version of both projects hasn't really
happened, so we effectively already have been treating Parquet like a
vendored component in our packaging process.

Realistically I think once #2 has become a more functionally complete
and as a result a more slowly moving piece of software, we can
contemplate splitting out all or parts of its development process back
into another repository. I think we have a ton of work to do yet on
Parquet core, particularly optimizing for high latency storage (HDFS,
S3, GCP, etc.), and it wouldn't really make sense to do such platform
level work anywhere but #1

- Wes

On Sun, Aug 19, 2018 at 8:37 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Back from vacation, I also want to finally raise my voice.
>
> With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.
>
> Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.
>
> Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.
>
> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.
>
> Uwe
>
> On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
>> Thanks Ryan, will do. The people I'd still like to hear from are:
>>
>> * Phillip Cloud
>> * Uwe Korn
>>
>> As ASF contributors we are responsible to both be pragmatic as well as
>> act in the best interests of the community's health and productivity.
>>
>>
>>
>> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
>> > I don't have an opinion here, but could someone send a summary of what is
>> > decided to the dev list once there is consensus? This is a long thread for
>> > parts of the project I don't work on, so I haven't followed it very closely.
>> >
>> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>> >> Can we enforce that parquet-cpp changes will not be committed without a
>> >> corresponding Parquet JIRA?
>> >>
>> >> I think we would use the following policy:
>> >>
>> >> * use PARQUET-XXX for issues relating to Parquet core
>> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>> >> core (e.g. changes that are in parquet/arrow right now)
>> >>
>> >> We've already been dealing with annoyances relating to issues
>> >> straddling the two projects (debugging an issue on Arrow side to find
>> >> that it has to be fixed on Parquet side); this would make things
>> >> simpler for us
>> >>
>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>> >> simplify forking later (if needed) and be able to maintain the commit
>> >> history.  I don't know if its possible to squash parquet-cpp commits and
>> >> arrow commits separately before merging.
>> >>
>> >> This seems rather onerous for both contributors and maintainers and
>> >> not in line with the goal of improving productivity. In the event that
>> >> we fork I see it as a traumatic event for the community. If it does
>> >> happen, then we can write a script (using git filter-branch and other
>> >> such tools) to extract commits related to the forked code.
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>> >> wrote:
>> >> > I have a few more logistical questions to add.
>> >> >
>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>> >> > Arrow changes. Will we establish some guidelines for filing Parquet
>> >> JIRAs?
>> >> > Can we enforce that parquet-cpp changes will not be committed without a
>> >> > corresponding Parquet JIRA?
>> >> >
>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>> >> > simplify forking later (if needed) and be able to maintain the commit
>> >> > history.  I don't know if its possible to squash parquet-cpp commits and
>> >> > arrow commits separately before merging.
>> >> >
>> >> >
>> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>> >> >
>> >> >> Do other people have opinions? I would like to undertake this work in
>> >> >> the near future (the next 8-10 weeks); I would be OK with taking
>> >> >> responsibility for the primary codebase surgery.
>> >> >>
>> >> >> Some logistical questions:
>> >> >>
>> >> >> * We have a handful of pull requests in flight in parquet-cpp that
>> >> >> would need to be resolved / merged
>> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> >> >> releases cut out of the new structure
>> >> >> * Management of shared commit rights (I can discuss with the Arrow
>> >> >> PMC; I believe that approving any committer who has actively
>> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
>> >> >> comments)
>> >> >>
>> >> >> If working more closely together proves to not be working out after
>> >> >> some period of time, I will be fully supportive of a fork or something
>> >> >> like it
>> >> >>
>> >> >> Thanks,
>> >> >> Wes
>> >> >>
>> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >> > Thanks Tim.
>> >> >> >
>> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> >> >> > platform code intending to improve the performance of bit-packing in
>> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
>> >> >> >
>> >> >> > * https://github.com/apache/parquet-cpp/pull/483
>> >> >> > * https://github.com/apache/arrow/pull/2355
>> >> >> >
>> >> >> > Changes that impact the Python interface to Parquet are even more
>> >> >> complex.
>> >> >> >
>> >> >> > Adding options to Arrow's CMake build system to only build
>> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
>> >> >> > not be difficult, and amount to writing "make parquet".
>> >> >> >
>> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>> >> to
>> >> >> > build and install the Parquet core libraries and their dependencies
>> >> >> > would be:
>> >> >> >
>> >> >> > ninja parquet && ninja install
>> >> >> >
>> >> >> > - Wes
>> >> >> >
>> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> >> >> > <ta...@cloudera.com.invalid> wrote:
>> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> >> >> successful, but I thought I'd give my two cents.
>> >> >> >>
>> >> >> >> For me, the thing that makes the biggest difference in contributing
>> >> to a
>> >> >> >> new codebase is the number of steps in the workflow for writing,
>> >> >> testing,
>> >> >> >> posting and iterating on a commit and also the number of
>> >> opportunities
>> >> >> for
>> >> >> >> missteps. The size of the repo and build/test times matter but are
>> >> >> >> secondary so long as the workflow is simple and reliable.
>> >> >> >>
>> >> >> >> I don't really know what the current state of things is, but it
>> >> sounds
>> >> >> like
>> >> >> >> it's not as simple as check out -> build -> test if you're doing a
>> >> >> >> cross-repo change. Circular dependencies are a real headache.
>> >> >> >>
>> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> >> >> wrote:
>> >> >> >>
>> >> >> >>> hi,
>> >> >> >>>
>> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> >> >> majeti.deepak@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> > I think the circular dependency can be broken if we build a new
>> >> >> library
>> >> >> >>> for
>> >> >> >>> > the platform code. This will also make it easy for other projects
>> >> >> such as
>> >> >> >>> > ORC to use it.
>> >> >> >>> > I also remember your proposal a while ago of having a separate
>> >> >> project
>> >> >> >>> for
>> >> >> >>> > the platform code.  That project can live in the arrow repo.
>> >> >> However, one
>> >> >> >>> > has to clone the entire apache arrow repo but can just build the
>> >> >> platform
>> >> >> >>> > code. This will be temporary until we can find a new home for it.
>> >> >> >>> >
>> >> >> >>> > The dependency will look like:
>> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >> >> >>> > libplatform(platform api)
>> >> >> >>> >
>> >> >> >>> > CI workflow will clone the arrow project twice, once for the
>> >> platform
>> >> >> >>> > library and once for the arrow-core/bindings library.
>> >> >> >>>
>> >> >> >>> This seems like an interesting proposal; the best place to work
>> >> toward
>> >> >> >>> this goal (if it is even possible; the build system interactions and
>> >> >> >>> ASF release management are the hard problems) is to have all of the
>> >> >> >>> code in a single repository. ORC could already be using Arrow if it
>> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> > There is no doubt that the collaborations between the Arrow and
>> >> >> Parquet
>> >> >> >>> > communities so far have been very successful.
>> >> >> >>> > The reason to maintain this relationship moving forward is to
>> >> >> continue to
>> >> >> >>> > reap the mutual benefits.
>> >> >> >>> > We should continue to take advantage of sharing code as well.
>> >> >> However, I
>> >> >> >>> > don't see any code sharing opportunities between arrow-core and
>> >> the
>> >> >> >>> > parquet-core. Both have different functions.
>> >> >> >>>
>> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>> >> format
>> >> >> >>> is only one part of a project that has become quite large already
>> >> >> >>> (
>> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >> >> >>> platform-for-inmemory-data-105427919).
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
>> >> stable.
>> >> >> We
>> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
>> >> is
>> >> >> to
>> >> >> >>> > keep them nimble since we can.
>> >> >> >>>
>> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >> >> >>> have done little work in asynchronous IO and concurrency which would
>> >> >> >>> yield both improved read and write throughput. This aligns well with
>> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >> >> >>> believe that more development will happen on parquet-core once the
>> >> >> >>> development process issues are resolved by having a single codebase,
>> >> >> >>> single build system, and a single CI framework.
>> >> >> >>>
>> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >> >> >>> goal I think we should still be open to making significant changes
>> >> in
>> >> >> >>> the interest of long term progress.
>> >> >> >>>
>> >> >> >>> Having now worked on these projects for more than 2 and a half years
>> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >> >> >>> past the "breaking point" and not willing to continue contributing
>> >> in
>> >> >> >>> a significant way to parquet-cpp if the projects remained structured
>> >> >> >>> as they are now. It's hampering progress and not serving the
>> >> >> >>> community.
>> >> >> >>>
>> >> >> >>> - Wes
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>> >> >
>> >> >> >>> wrote:
>> >> >> >>> >
>> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>> >> >> arrow
>> >> >> >>> >> repo. That will remove a majority of the dependency issues.
>> >> Joshua's
>> >> >> >>> work
>> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> >> >> the
>> >> >> >>> arrow
>> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
>> >> >> >>> >>
>> >> >> >>> >> This has been suggested before, but I don't see how it would
>> >> >> alleviate
>> >> >> >>> >> any issues because of the significant dependencies on other
>> >> parts of
>> >> >> >>> >> the Arrow codebase. What you are proposing is:
>> >> >> >>> >>
>> >> >> >>> >> - (Arrow) arrow platform
>> >> >> >>> >> - (Parquet) parquet core
>> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> >> >>> >> - (Arrow) Python bindings
>> >> >> >>> >>
>> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> >> >>> >> built before invoking the Parquet core part of the build system.
>> >> You
>> >> >> >>> >> would need to pass dependent targets across different CMake build
>> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>> >> >> into
>> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
>> >> >> "concrete
>> >> >> >>> >> and actionable plan". The only thing that would really work
>> >> would be
>> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>> >> builds
>> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> >> >> build
>> >> >> >>> >> system because it's only depended upon by the Python bindings.
>> >> >> >>> >>
>> >> >> >>> >> And even if a solution could be devised, it would not wholly
>> >> resolve
>> >> >> >>> >> the CI workflow issues.
>> >> >> >>> >>
>> >> >> >>> >> You could make Parquet completely independent of the Arrow
>> >> codebase,
>> >> >> >>> >> but at that point there is little reason to maintain a
>> >> relationship
>> >> >> >>> >> between the projects or their communities. We have spent a great
>> >> >> deal
>> >> >> >>> >> of effort refactoring the two projects to enable as much code
>> >> >> sharing
>> >> >> >>> >> as there is now.
>> >> >> >>> >>
>> >> >> >>> >> - Wes
>> >> >> >>> >>
>> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>> >> clone
>> >> >> the
>> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> Having
>> >> >> two
>> >> >> >>> >> parquet-cpp repos is no way a better approach.
>> >> >> >>> >> >
>> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>> >> is
>> >> >> to
>> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >> >>> >> >
>> >> >> >>> >> > It doesn't look like I will be able to convince you that a
>> >> >> monorepo is
>> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>> >> >> give
>> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
>> >> >> (which I
>> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >> >>> >> >
>> >> >> >>> >> > - Wes
>> >> >> >>> >> >
>> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >> >> >>> majeti.deepak@gmail.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >> >> Wes,
>> >> >> >>> >> >>
>> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> >> >> problems
>> >> >> >>> of a
>> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >> >>> >> >> Bringing in related Apache community experiences are more
>> >> >> meaningful
>> >> >> >>> >> than
>> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
>> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> >> >> developers.
>> >> >> >>> >> >> You are very well aware of how difficult it has been to find
>> >> more
>> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>> >> has
>> >> >> a low
>> >> >> >>> >> >> contribution rate to its core components.
>> >> >> >>> >> >>
>> >> >> >>> >> >> We should target to ensure that new volunteers who want to
>> >> >> contribute
>> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>> >> >> figuring
>> >> >> >>> out
>> >> >> >>> >> >> the project repo. We can never come up with an automated build
>> >> >> system
>> >> >> >>> >> that
>> >> >> >>> >> >> caters to every possible environment.
>> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>> >> new
>> >> >> >>> >> developers
>> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> >> >> build
>> >> >> >>> and
>> >> >> >>> >> test
>> >> >> >>> >> >> dependencies.
>> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
>> >> less
>> >> >> >>> >> >> co-operative.
>> >> >> >>> >> >> I just don't think the mono-repo structure model will be
>> >> >> sustainable
>> >> >> >>> in
>> >> >> >>> >> an
>> >> >> >>> >> >> open source community unless there are long-term vested
>> >> >> interests. We
>> >> >> >>> >> can't
>> >> >> >>> >> >> predict that.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The current circular dependency problems between Arrow and
>> >> >> Parquet
>> >> >> >>> is a
>> >> >> >>> >> >> major problem for the community and it is important.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> >> >> arrow
>> >> >> >>> >> repo.
>> >> >> >>> >> >> That will remove a majority of the dependency issues.
>> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>> >> that
>> >> >> >>> adapter
>> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>> >> adaptor.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>> >> >> changes
>> >> >> >>> in
>> >> >> >>> >> the
>> >> >> >>> >> >> future to this code should not be the main reason to combine
>> >> the
>> >> >> >>> arrow
>> >> >> >>> >> >> parquet repos.
>> >> >> >>> >> >>
>> >> >> >>> >> >> "
>> >> >> >>> >> >> *I question whether it's worth the community's time long term
>> >> to
>> >> >> >>> wear*
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> >> >>> >> eachlibrary
>> >> >> >>> >> >> to plug components together rather than utilizing
>> >> commonplatform
>> >> >> >>> APIs.*"
>> >> >> >>> >> >>
>> >> >> >>> >> >> My answer to your question below would be "Yes".
>> >> >> >>> Modularity/separation
>> >> >> >>> >> is
>> >> >> >>> >> >> very important in an open source community where priorities of
>> >> >> >>> >> contributors
>> >> >> >>> >> >> are often short term.
>> >> >> >>> >> >> The retention is low and therefore the acquisition costs
>> >> should
>> >> >> be
>> >> >> >>> low
>> >> >> >>> >> as
>> >> >> >>> >> >> well. This is the community over code approach according to
>> >> me.
>> >> >> Minor
>> >> >> >>> >> code
>> >> >> >>> >> >> duplication is not a deal breaker.
>> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>> >> big
>> >> >> >>> data
>> >> >> >>> >> >> space serving their own functions.
>> >> >> >>> >> >>
>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>> >> clone
>> >> >> the
>> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> >> Having
>> >> >> >>> two
>> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> >> >> wesmckinn@gmail.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >> >>
>> >> >> >>> >> >>> @Antoine
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>> >> would
>> >> >> >>> slightly
>> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> Parquet run takes about 28
>> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> >> >> certain
>> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> >> >> could be
>> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>> >> >> (like
>> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> >> >> nightly
>> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> >> >> also
>> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
>> >> nightly
>> >> >> >>> >> >>> exhaustive test run)
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> - Wes
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> >> >> wesmckinn@gmail.com
>> >> >> >>> >
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> >> great
>> >> >> >>> >> example of
>> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >> >> >>> codebase.
>> >> >> >>> >> That
>> >> >> >>> >> >>> gives me hope that the projects could be managed separately
>> >> some
>> >> >> >>> day.
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>> >> C++
>> >> >> >>> codebase
>> >> >> >>> >> >>> > features several areas of duplicated logic which could be
>> >> >> >>> replaced by
>> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >> >>> >> >>> > interoperability:
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>>
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >> >>> orc/OrcFile.hh#L37
>> >> >> >>> >> >>> >
>> >> >> >>> >>
>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>>
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >> >>> orc/MemoryPool.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >>
>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >> >> >>> OutputStream.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> >> >> cause of
>> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> >> >> them
>> >> >> >>> from
>> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>> >> is
>> >> >> only
>> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > I question whether it's worth the community's time long
>> >> term
>> >> >> to
>> >> >> >>> wear
>> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>> >> in
>> >> >> each
>> >> >> >>> >> >>> > library to plug components together rather than utilizing
>> >> >> common
>> >> >> >>> >> >>> > platform APIs.
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > - Wes
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> >> >>> >> joshuastorck@gmail.com>
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
>> >> >> process are
>> >> >> >>> >> well
>> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>> >> current
>> >> >> >>> >> >>> environment I
>> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>> >> main
>> >> >> >>> issues
>> >> >> >>> >> I
>> >> >> >>> >> >>> worry
>> >> >> >>> >> >>> >> about when you put codebases like these together are:
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> >> >> becomes
>> >> >> >>> too
>> >> >> >>> >> >>> coupled
>> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> >> >> tree are
>> >> >> >>> >> >>> delayed
>> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> If the project/release management is structured well and
>> >> >> someone
>> >> >> >>> >> keeps
>> >> >> >>> >> >>> an
>> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> >> great
>> >> >> >>> >> example of
>> >> >> >>> >> >>> how
>> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >> >> >>> codebase.
>> >> >> >>> >> That
>> >> >> >>> >> >>> >> gives me hope that the projects could be managed
>> >> separately
>> >> >> some
>> >> >> >>> >> day.
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >> >> >>> wesmckinn@gmail.com>
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >>> hi Josh,
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> >> arrow
>> >> >> >>> and
>> >> >> >>> >> >>> tying
>> >> >> >>> >> >>> >>> them together seems like the wrong choice.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> >> >> people
>> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>> >> >> agree
>> >> >> >>> >> with?)
>> >> >> >>> >> >>> >>> is that we should work more closely together until the
>> >> >> community
>> >> >> >>> >> grows
>> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>> >> >> now.
>> >> >> >>> As
>> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>> >> these
>> >> >> >>> >> projects.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>> >> own
>> >> >> >>> >> codebase.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>> >> into
>> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>> >> of
>> >> >> the
>> >> >> >>> >> GitHub
>> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >> >> >>> idealistic,
>> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>> >> devise
>> >> >> a
>> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> >> >> per
>> >> >> >>> day
>> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >> >> >>> without
>> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>> >> see
>> >> >> how
>> >> >> >>> we
>> >> >> >>> >> can
>> >> >> >>> >> >>> >>> move forward.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> >> codebases
>> >> >> >>> >> in the
>> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
>> >> >> the
>> >> >> >>> near
>> >> >> >>> >> >>> term.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>> >> to
>> >> >> be
>> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
>> >> and
>> >> >> >>> >> community
>> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>> >> the
>> >> >> >>> >> current
>> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>> >> consider
>> >> >> >>> >> >>> >>> development process and ASF releases separately. My
>> >> >> argument is
>> >> >> >>> as
>> >> >> >>> >> >>> >>> follows:
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
>> >> PMCs
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> - Wes
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> >> >>> >> joshuastorck@gmail.com
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>> >> implemented
>> >> >> in
>> >> >> >>> >> >>> parquet-cpp
>> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> >> >>> >> (ARROW-2585,
>> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>> >> confusing
>> >> >> and
>> >> >> >>> >> hard to
>> >> >> >>> >> >>> work
>> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>> >> parquet-cpp
>> >> >> >>> >> (created on
>> >> >> >>> >> >>> May
>> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>> >> was
>> >> >> >>> >> recently
>> >> >> >>> >> >>> >>> merged.
>> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> >> >> the
>> >> >> >>> >> change in
>> >> >> >>> >> >>> >>> arrow
>> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >> >>> >> >>> >>> run_clang_format.py
>> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>> >> >> there
>> >> >> >>> >> was an
>> >> >> >>> >> >>> >>> exact
>> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>> >> sense
>> >> >> in
>> >> >> >>> the
>> >> >> >>> >> long
>> >> >> >>> >> >>> >>> term.
>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> >> arrow
>> >> >> >>> and
>> >> >> >>> >> >>> tying
>> >> >> >>> >> >>> >>> them
>> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>> >> other
>> >> >> >>> formats
>> >> >> >>> >> >>> that
>> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>> >> (e.g. -
>> >> >> >>> Orc),
>> >> >> >>> >> so I
>> >> >> >>> >> >>> >>> don't
>> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>> >> >> tooling
>> >> >> >>> >> should
>> >> >> >>> >> >>> be
>> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>> >> history
>> >> >> of
>> >> >> >>> >> >>> developing
>> >> >> >>> >> >>> >>> open
>> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>> >> CI
>> >> >> is a
>> >> >> >>> >> good
>> >> >> >>> >> >>> >>> > counter-example since there have been lots of
>> >> successful
>> >> >> open
>> >> >> >>> >> source
>> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
>> >> pinned
>> >> >> >>> >> versions of
>> >> >> >>> >> >>> >>> > dependent software.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> >> codebases
>> >> >> >>> >> in the
>> >> >> >>> >> >>> >>> short
>> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>> >> >> near
>> >> >> >>> >> term.
>> >> >> >>> >> >>> My
>> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> >> >> together,
>> >> >> >>> you
>> >> >> >>> >> can
>> >> >> >>> >> >>> more
>> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>> >> a
>> >> >> >>> single
>> >> >> >>> >> PR.
>> >> >> >>> >> >>> >>> Second,
>> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >> >> >>> diverge,
>> >> >> >>> >> >>> which has
>> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> >> >> been
>> >> >> >>> >> sorted
>> >> >> >>> >> >>> out,
>> >> >> >>> >> >>> >>> it
>> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
>> >> >> codebases.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >> >> >>> codebases
>> >> >> >>> >> for
>> >> >> >>> >> >>> arrow
>> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>> >> the
>> >> >> >>> >> >>> perspective of
>> >> >> >>> >> >>> >>> a
>> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>> >> is a
>> >> >> >>> large
>> >> >> >>> >> tax
>> >> >> >>> >> >>> to
>> >> >> >>> >> >>> >>> pay
>> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> >> >> in the
>> >> >> >>> >> 0.10.0
>> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> >> >> release. I
>> >> >> >>> >> hope
>> >> >> >>> >> >>> that
>> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> >> >> help
>> >> >> >>> >> reduce
>> >> >> >>> >> >>> the
>> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> >> >>> >> ted.dunning@gmail.com>
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> >> >>> >> wesmckinn@gmail.com>
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> >
>> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>> >> for
>> >> >> >>> >> stability
>> >> >> >>> >> >>> and
>> >> >> >>> >> >>> >>> API
>> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> >> >> HDFS
>> >> >> >>> >> >>> community
>> >> >> >>> >> >>> >>> took
>> >> >> >>> >> >>> >>> >> a
>> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >> >>> >> >>> >>> >> >
>> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> >> >> source
>> >> >> >>> >> >>> community as
>> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> >> >> didn't
>> >> >> >>> go
>> >> >> >>> >> the
>> >> >> >>> >> >>> way
>> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >> >> >>> community
>> >> >> >>> >> which
>> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> >> >> model.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> There are some more radical and community building
>> >> >> options as
>> >> >> >>> >> well.
>> >> >> >>> >> >>> Take
>> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>> >> subversion,
>> >> >> any
>> >> >> >>> >> Apache
>> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> >> >> large
>> >> >> >>> >> >>> fraction of
>> >> >> >>> >> >>> >>> >> subversion.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>> >> parquet
>> >> >> >>> >> committer
>> >> >> >>> >> >>> a
>> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> >> >>> >> committers in
>> >> >> >>> >> >>> >>> Arrow?
>> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> >> >> committer who
>> >> >> >>> >> asks
>> >> >> >>> >> >>> will
>> >> >> >>> >> >>> >>> be
>> >> >> >>> >> >>> >>> >> given committer status in Arrow.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>> >> Parquet
>> >> >> >>> >> committers
>> >> >> >>> >> >>> >>> can't be
>> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>> >> >> merged;
>> >> >> >>> >> they
>> >> >> >>> >> >>> can
>> >> >> >>> >> >>> >>> just
>> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> >> >> in the
>> >> >> >>> >> >>> Parquet
>> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >> >> >>> parquet so
>> >> >> >>> >> >>> why not
>> >> >> >>> >> >>> >>> >> invite them in?
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> --
>> >> >> >>> >> >> regards,
>> >> >> >>> >> >> Deepak Majeti
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > --
>> >> >> >>> > regards,
>> >> >> >>> > Deepak Majeti
>> >> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > regards,
>> >> > Deepak Majeti
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Uwe,

I agree with your points. Currently we have 3 software artifacts:

1. Arrow C++ libraries
2. Parquet C++ libraries with Arrow columnar integration
3. C++ interop layer for Python + Cython bindings

Changes in #1 prompt an awkward workflow involving multiple PRs; as a
result of this we just recently jumped 8 months from the pinned
version of Arrow in parquet-cpp. This obviously is an antipattern. If
we had a much larger group of core developers, this might be more
maintainable

Of course changes in #2 also impact #3; a lot of our bug reports and
feature requests are coming inbound because of #3, and we have
struggled to be able to respond to the needs of users (and other
developers like Robert Gruener who are trying to use this software in
a large data warehouse)

There is also the release coordination issue where having users
simultaneously using a released version of both projects hasn't really
happened, so we effectively already have been treating Parquet like a
vendored component in our packaging process.

Realistically I think once #2 has become a more functionally complete
and as a result a more slowly moving piece of software, we can
contemplate splitting out all or parts of its development process back
into another repository. I think we have a ton of work to do yet on
Parquet core, particularly optimizing for high latency storage (HDFS,
S3, GCP, etc.), and it wouldn't really make sense to do such platform
level work anywhere but #1

- Wes

On Sun, Aug 19, 2018 at 8:37 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Back from vacation, I also want to finally raise my voice.
>
> With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.
>
> Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.
>
> Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.
>
> An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.
>
> Uwe
>
> On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
>> Thanks Ryan, will do. The people I'd still like to hear from are:
>>
>> * Phillip Cloud
>> * Uwe Korn
>>
>> As ASF contributors we are responsible to both be pragmatic as well as
>> act in the best interests of the community's health and productivity.
>>
>>
>>
>> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
>> > I don't have an opinion here, but could someone send a summary of what is
>> > decided to the dev list once there is consensus? This is a long thread for
>> > parts of the project I don't work on, so I haven't followed it very closely.
>> >
>> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>> >> Can we enforce that parquet-cpp changes will not be committed without a
>> >> corresponding Parquet JIRA?
>> >>
>> >> I think we would use the following policy:
>> >>
>> >> * use PARQUET-XXX for issues relating to Parquet core
>> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>> >> core (e.g. changes that are in parquet/arrow right now)
>> >>
>> >> We've already been dealing with annoyances relating to issues
>> >> straddling the two projects (debugging an issue on Arrow side to find
>> >> that it has to be fixed on Parquet side); this would make things
>> >> simpler for us
>> >>
>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>> >> simplify forking later (if needed) and be able to maintain the commit
>> >> history.  I don't know if its possible to squash parquet-cpp commits and
>> >> arrow commits separately before merging.
>> >>
>> >> This seems rather onerous for both contributors and maintainers and
>> >> not in line with the goal of improving productivity. In the event that
>> >> we fork I see it as a traumatic event for the community. If it does
>> >> happen, then we can write a script (using git filter-branch and other
>> >> such tools) to extract commits related to the forked code.
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>> >> wrote:
>> >> > I have a few more logistical questions to add.
>> >> >
>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>> >> > Arrow changes. Will we establish some guidelines for filing Parquet
>> >> JIRAs?
>> >> > Can we enforce that parquet-cpp changes will not be committed without a
>> >> > corresponding Parquet JIRA?
>> >> >
>> >> > I would also like to keep changes to parquet-cpp on a separate commit to
>> >> > simplify forking later (if needed) and be able to maintain the commit
>> >> > history.  I don't know if its possible to squash parquet-cpp commits and
>> >> > arrow commits separately before merging.
>> >> >
>> >> >
>> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>> >> >
>> >> >> Do other people have opinions? I would like to undertake this work in
>> >> >> the near future (the next 8-10 weeks); I would be OK with taking
>> >> >> responsibility for the primary codebase surgery.
>> >> >>
>> >> >> Some logistical questions:
>> >> >>
>> >> >> * We have a handful of pull requests in flight in parquet-cpp that
>> >> >> would need to be resolved / merged
>> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> >> >> releases cut out of the new structure
>> >> >> * Management of shared commit rights (I can discuss with the Arrow
>> >> >> PMC; I believe that approving any committer who has actively
>> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
>> >> >> comments)
>> >> >>
>> >> >> If working more closely together proves to not be working out after
>> >> >> some period of time, I will be fully supportive of a fork or something
>> >> >> like it
>> >> >>
>> >> >> Thanks,
>> >> >> Wes
>> >> >>
>> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >> > Thanks Tim.
>> >> >> >
>> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> >> >> > platform code intending to improve the performance of bit-packing in
>> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
>> >> >> >
>> >> >> > * https://github.com/apache/parquet-cpp/pull/483
>> >> >> > * https://github.com/apache/arrow/pull/2355
>> >> >> >
>> >> >> > Changes that impact the Python interface to Parquet are even more
>> >> >> complex.
>> >> >> >
>> >> >> > Adding options to Arrow's CMake build system to only build
>> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
>> >> >> > not be difficult, and amount to writing "make parquet".
>> >> >> >
>> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>> >> to
>> >> >> > build and install the Parquet core libraries and their dependencies
>> >> >> > would be:
>> >> >> >
>> >> >> > ninja parquet && ninja install
>> >> >> >
>> >> >> > - Wes
>> >> >> >
>> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> >> >> > <ta...@cloudera.com.invalid> wrote:
>> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> >> >> successful, but I thought I'd give my two cents.
>> >> >> >>
>> >> >> >> For me, the thing that makes the biggest difference in contributing
>> >> to a
>> >> >> >> new codebase is the number of steps in the workflow for writing,
>> >> >> testing,
>> >> >> >> posting and iterating on a commit and also the number of
>> >> opportunities
>> >> >> for
>> >> >> >> missteps. The size of the repo and build/test times matter but are
>> >> >> >> secondary so long as the workflow is simple and reliable.
>> >> >> >>
>> >> >> >> I don't really know what the current state of things is, but it
>> >> sounds
>> >> >> like
>> >> >> >> it's not as simple as check out -> build -> test if you're doing a
>> >> >> >> cross-repo change. Circular dependencies are a real headache.
>> >> >> >>
>> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> >> >> wrote:
>> >> >> >>
>> >> >> >>> hi,
>> >> >> >>>
>> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> >> >> majeti.deepak@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> > I think the circular dependency can be broken if we build a new
>> >> >> library
>> >> >> >>> for
>> >> >> >>> > the platform code. This will also make it easy for other projects
>> >> >> such as
>> >> >> >>> > ORC to use it.
>> >> >> >>> > I also remember your proposal a while ago of having a separate
>> >> >> project
>> >> >> >>> for
>> >> >> >>> > the platform code.  That project can live in the arrow repo.
>> >> >> However, one
>> >> >> >>> > has to clone the entire apache arrow repo but can just build the
>> >> >> platform
>> >> >> >>> > code. This will be temporary until we can find a new home for it.
>> >> >> >>> >
>> >> >> >>> > The dependency will look like:
>> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >> >> >>> > libplatform(platform api)
>> >> >> >>> >
>> >> >> >>> > CI workflow will clone the arrow project twice, once for the
>> >> platform
>> >> >> >>> > library and once for the arrow-core/bindings library.
>> >> >> >>>
>> >> >> >>> This seems like an interesting proposal; the best place to work
>> >> toward
>> >> >> >>> this goal (if it is even possible; the build system interactions and
>> >> >> >>> ASF release management are the hard problems) is to have all of the
>> >> >> >>> code in a single repository. ORC could already be using Arrow if it
>> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> > There is no doubt that the collaborations between the Arrow and
>> >> >> Parquet
>> >> >> >>> > communities so far have been very successful.
>> >> >> >>> > The reason to maintain this relationship moving forward is to
>> >> >> continue to
>> >> >> >>> > reap the mutual benefits.
>> >> >> >>> > We should continue to take advantage of sharing code as well.
>> >> >> However, I
>> >> >> >>> > don't see any code sharing opportunities between arrow-core and
>> >> the
>> >> >> >>> > parquet-core. Both have different functions.
>> >> >> >>>
>> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>> >> format
>> >> >> >>> is only one part of a project that has become quite large already
>> >> >> >>> (
>> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >> >> >>> platform-for-inmemory-data-105427919).
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
>> >> stable.
>> >> >> We
>> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
>> >> is
>> >> >> to
>> >> >> >>> > keep them nimble since we can.
>> >> >> >>>
>> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >> >> >>> have done little work in asynchronous IO and concurrency which would
>> >> >> >>> yield both improved read and write throughput. This aligns well with
>> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >> >> >>> believe that more development will happen on parquet-core once the
>> >> >> >>> development process issues are resolved by having a single codebase,
>> >> >> >>> single build system, and a single CI framework.
>> >> >> >>>
>> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >> >> >>> goal I think we should still be open to making significant changes
>> >> in
>> >> >> >>> the interest of long term progress.
>> >> >> >>>
>> >> >> >>> Having now worked on these projects for more than 2 and a half years
>> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >> >> >>> past the "breaking point" and not willing to continue contributing
>> >> in
>> >> >> >>> a significant way to parquet-cpp if the projects remained structured
>> >> >> >>> as they are now. It's hampering progress and not serving the
>> >> >> >>> community.
>> >> >> >>>
>> >> >> >>> - Wes
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>> >> >
>> >> >> >>> wrote:
>> >> >> >>> >
>> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>> >> >> arrow
>> >> >> >>> >> repo. That will remove a majority of the dependency issues.
>> >> Joshua's
>> >> >> >>> work
>> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> >> >> the
>> >> >> >>> arrow
>> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
>> >> >> >>> >>
>> >> >> >>> >> This has been suggested before, but I don't see how it would
>> >> >> alleviate
>> >> >> >>> >> any issues because of the significant dependencies on other
>> >> parts of
>> >> >> >>> >> the Arrow codebase. What you are proposing is:
>> >> >> >>> >>
>> >> >> >>> >> - (Arrow) arrow platform
>> >> >> >>> >> - (Parquet) parquet core
>> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> >> >>> >> - (Arrow) Python bindings
>> >> >> >>> >>
>> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> >> >>> >> built before invoking the Parquet core part of the build system.
>> >> You
>> >> >> >>> >> would need to pass dependent targets across different CMake build
>> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>> >> >> into
>> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
>> >> >> "concrete
>> >> >> >>> >> and actionable plan". The only thing that would really work
>> >> would be
>> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>> >> builds
>> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> >> >> build
>> >> >> >>> >> system because it's only depended upon by the Python bindings.
>> >> >> >>> >>
>> >> >> >>> >> And even if a solution could be devised, it would not wholly
>> >> resolve
>> >> >> >>> >> the CI workflow issues.
>> >> >> >>> >>
>> >> >> >>> >> You could make Parquet completely independent of the Arrow
>> >> codebase,
>> >> >> >>> >> but at that point there is little reason to maintain a
>> >> relationship
>> >> >> >>> >> between the projects or their communities. We have spent a great
>> >> >> deal
>> >> >> >>> >> of effort refactoring the two projects to enable as much code
>> >> >> sharing
>> >> >> >>> >> as there is now.
>> >> >> >>> >>
>> >> >> >>> >> - Wes
>> >> >> >>> >>
>> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>> >> clone
>> >> >> the
>> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> Having
>> >> >> two
>> >> >> >>> >> parquet-cpp repos is no way a better approach.
>> >> >> >>> >> >
>> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>> >> is
>> >> >> to
>> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >> >>> >> >
>> >> >> >>> >> > It doesn't look like I will be able to convince you that a
>> >> >> monorepo is
>> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>> >> >> give
>> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
>> >> >> (which I
>> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >> >>> >> >
>> >> >> >>> >> > - Wes
>> >> >> >>> >> >
>> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >> >> >>> majeti.deepak@gmail.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >> >> Wes,
>> >> >> >>> >> >>
>> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> >> >> problems
>> >> >> >>> of a
>> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >> >>> >> >> Bringing in related Apache community experiences are more
>> >> >> meaningful
>> >> >> >>> >> than
>> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
>> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> >> >> developers.
>> >> >> >>> >> >> You are very well aware of how difficult it has been to find
>> >> more
>> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>> >> has
>> >> >> a low
>> >> >> >>> >> >> contribution rate to its core components.
>> >> >> >>> >> >>
>> >> >> >>> >> >> We should target to ensure that new volunteers who want to
>> >> >> contribute
>> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>> >> >> figuring
>> >> >> >>> out
>> >> >> >>> >> >> the project repo. We can never come up with an automated build
>> >> >> system
>> >> >> >>> >> that
>> >> >> >>> >> >> caters to every possible environment.
>> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>> >> new
>> >> >> >>> >> developers
>> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> >> >> build
>> >> >> >>> and
>> >> >> >>> >> test
>> >> >> >>> >> >> dependencies.
>> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
>> >> less
>> >> >> >>> >> >> co-operative.
>> >> >> >>> >> >> I just don't think the mono-repo structure model will be
>> >> >> sustainable
>> >> >> >>> in
>> >> >> >>> >> an
>> >> >> >>> >> >> open source community unless there are long-term vested
>> >> >> interests. We
>> >> >> >>> >> can't
>> >> >> >>> >> >> predict that.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The current circular dependency problems between Arrow and
>> >> >> Parquet
>> >> >> >>> is a
>> >> >> >>> >> >> major problem for the community and it is important.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> >> >> arrow
>> >> >> >>> >> repo.
>> >> >> >>> >> >> That will remove a majority of the dependency issues.
>> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>> >> that
>> >> >> >>> adapter
>> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>> >> adaptor.
>> >> >> >>> >> >>
>> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>> >> >> changes
>> >> >> >>> in
>> >> >> >>> >> the
>> >> >> >>> >> >> future to this code should not be the main reason to combine
>> >> the
>> >> >> >>> arrow
>> >> >> >>> >> >> parquet repos.
>> >> >> >>> >> >>
>> >> >> >>> >> >> "
>> >> >> >>> >> >> *I question whether it's worth the community's time long term
>> >> to
>> >> >> >>> wear*
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> >> >>> >> eachlibrary
>> >> >> >>> >> >> to plug components together rather than utilizing
>> >> commonplatform
>> >> >> >>> APIs.*"
>> >> >> >>> >> >>
>> >> >> >>> >> >> My answer to your question below would be "Yes".
>> >> >> >>> Modularity/separation
>> >> >> >>> >> is
>> >> >> >>> >> >> very important in an open source community where priorities of
>> >> >> >>> >> contributors
>> >> >> >>> >> >> are often short term.
>> >> >> >>> >> >> The retention is low and therefore the acquisition costs
>> >> should
>> >> >> be
>> >> >> >>> low
>> >> >> >>> >> as
>> >> >> >>> >> >> well. This is the community over code approach according to
>> >> me.
>> >> >> Minor
>> >> >> >>> >> code
>> >> >> >>> >> >> duplication is not a deal breaker.
>> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>> >> big
>> >> >> >>> data
>> >> >> >>> >> >> space serving their own functions.
>> >> >> >>> >> >>
>> >> >> >>> >> >> If you still strongly feel that the only way forward is to
>> >> clone
>> >> >> the
>> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> >> Having
>> >> >> >>> two
>> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> >> >> wesmckinn@gmail.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >> >>
>> >> >> >>> >> >>> @Antoine
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>> >> would
>> >> >> >>> slightly
>> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> Parquet run takes about 28
>> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> >> >> certain
>> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> >> >> could be
>> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>> >> >> (like
>> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> >> >> nightly
>> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> >> >> also
>> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
>> >> nightly
>> >> >> >>> >> >>> exhaustive test run)
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> - Wes
>> >> >> >>> >> >>>
>> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> >> >> wesmckinn@gmail.com
>> >> >> >>> >
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> >> great
>> >> >> >>> >> example of
>> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >> >> >>> codebase.
>> >> >> >>> >> That
>> >> >> >>> >> >>> gives me hope that the projects could be managed separately
>> >> some
>> >> >> >>> day.
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>> >> C++
>> >> >> >>> codebase
>> >> >> >>> >> >>> > features several areas of duplicated logic which could be
>> >> >> >>> replaced by
>> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >> >>> >> >>> > interoperability:
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>>
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >> >>> orc/OrcFile.hh#L37
>> >> >> >>> >> >>> >
>> >> >> >>> >>
>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>>
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >> >>> orc/MemoryPool.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >>
>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >> >> >>> OutputStream.hh
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> >> >> cause of
>> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> >> >> them
>> >> >> >>> from
>> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>> >> is
>> >> >> only
>> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > I question whether it's worth the community's time long
>> >> term
>> >> >> to
>> >> >> >>> wear
>> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>> >> in
>> >> >> each
>> >> >> >>> >> >>> > library to plug components together rather than utilizing
>> >> >> common
>> >> >> >>> >> >>> > platform APIs.
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > - Wes
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> >> >>> >> joshuastorck@gmail.com>
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
>> >> >> process are
>> >> >> >>> >> well
>> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>> >> current
>> >> >> >>> >> >>> environment I
>> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>> >> main
>> >> >> >>> issues
>> >> >> >>> >> I
>> >> >> >>> >> >>> worry
>> >> >> >>> >> >>> >> about when you put codebases like these together are:
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> >> >> becomes
>> >> >> >>> too
>> >> >> >>> >> >>> coupled
>> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> >> >> tree are
>> >> >> >>> >> >>> delayed
>> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> If the project/release management is structured well and
>> >> >> someone
>> >> >> >>> >> keeps
>> >> >> >>> >> >>> an
>> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> >> great
>> >> >> >>> >> example of
>> >> >> >>> >> >>> how
>> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >> >> >>> codebase.
>> >> >> >>> >> That
>> >> >> >>> >> >>> >> gives me hope that the projects could be managed
>> >> separately
>> >> >> some
>> >> >> >>> >> day.
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >> >> >>> wesmckinn@gmail.com>
>> >> >> >>> >> >>> wrote:
>> >> >> >>> >> >>> >>
>> >> >> >>> >> >>> >>> hi Josh,
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> >> arrow
>> >> >> >>> and
>> >> >> >>> >> >>> tying
>> >> >> >>> >> >>> >>> them together seems like the wrong choice.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> >> >> people
>> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>> >> >> agree
>> >> >> >>> >> with?)
>> >> >> >>> >> >>> >>> is that we should work more closely together until the
>> >> >> community
>> >> >> >>> >> grows
>> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>> >> >> now.
>> >> >> >>> As
>> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>> >> these
>> >> >> >>> >> projects.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>> >> own
>> >> >> >>> >> codebase.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>> >> into
>> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>> >> of
>> >> >> the
>> >> >> >>> >> GitHub
>> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >> >> >>> idealistic,
>> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>> >> devise
>> >> >> a
>> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> >> >> per
>> >> >> >>> day
>> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >> >> >>> without
>> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>> >> see
>> >> >> how
>> >> >> >>> we
>> >> >> >>> >> can
>> >> >> >>> >> >>> >>> move forward.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> >> codebases
>> >> >> >>> >> in the
>> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
>> >> >> the
>> >> >> >>> near
>> >> >> >>> >> >>> term.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>> >> to
>> >> >> be
>> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
>> >> and
>> >> >> >>> >> community
>> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>> >> the
>> >> >> >>> >> current
>> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>> >> consider
>> >> >> >>> >> >>> >>> development process and ASF releases separately. My
>> >> >> argument is
>> >> >> >>> as
>> >> >> >>> >> >>> >>> follows:
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
>> >> PMCs
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> - Wes
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> >> >>> >> joshuastorck@gmail.com
>> >> >> >>> >> >>> >
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>> >> implemented
>> >> >> in
>> >> >> >>> >> >>> parquet-cpp
>> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> >> >>> >> (ARROW-2585,
>> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>> >> confusing
>> >> >> and
>> >> >> >>> >> hard to
>> >> >> >>> >> >>> work
>> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>> >> parquet-cpp
>> >> >> >>> >> (created on
>> >> >> >>> >> >>> May
>> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>> >> was
>> >> >> >>> >> recently
>> >> >> >>> >> >>> >>> merged.
>> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> >> >> the
>> >> >> >>> >> change in
>> >> >> >>> >> >>> >>> arrow
>> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >> >>> >> >>> >>> run_clang_format.py
>> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>> >> >> there
>> >> >> >>> >> was an
>> >> >> >>> >> >>> >>> exact
>> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>> >> sense
>> >> >> in
>> >> >> >>> the
>> >> >> >>> >> long
>> >> >> >>> >> >>> >>> term.
>> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> >> arrow
>> >> >> >>> and
>> >> >> >>> >> >>> tying
>> >> >> >>> >> >>> >>> them
>> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>> >> other
>> >> >> >>> formats
>> >> >> >>> >> >>> that
>> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>> >> (e.g. -
>> >> >> >>> Orc),
>> >> >> >>> >> so I
>> >> >> >>> >> >>> >>> don't
>> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>> >> >> tooling
>> >> >> >>> >> should
>> >> >> >>> >> >>> be
>> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>> >> history
>> >> >> of
>> >> >> >>> >> >>> developing
>> >> >> >>> >> >>> >>> open
>> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>> >> CI
>> >> >> is a
>> >> >> >>> >> good
>> >> >> >>> >> >>> >>> > counter-example since there have been lots of
>> >> successful
>> >> >> open
>> >> >> >>> >> source
>> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
>> >> pinned
>> >> >> >>> >> versions of
>> >> >> >>> >> >>> >>> > dependent software.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> >> codebases
>> >> >> >>> >> in the
>> >> >> >>> >> >>> >>> short
>> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>> >> >> near
>> >> >> >>> >> term.
>> >> >> >>> >> >>> My
>> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> >> >> together,
>> >> >> >>> you
>> >> >> >>> >> can
>> >> >> >>> >> >>> more
>> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>> >> a
>> >> >> >>> single
>> >> >> >>> >> PR.
>> >> >> >>> >> >>> >>> Second,
>> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >> >> >>> diverge,
>> >> >> >>> >> >>> which has
>> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> >> >> been
>> >> >> >>> >> sorted
>> >> >> >>> >> >>> out,
>> >> >> >>> >> >>> >>> it
>> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
>> >> >> codebases.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >> >> >>> codebases
>> >> >> >>> >> for
>> >> >> >>> >> >>> arrow
>> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>> >> the
>> >> >> >>> >> >>> perspective of
>> >> >> >>> >> >>> >>> a
>> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>> >> is a
>> >> >> >>> large
>> >> >> >>> >> tax
>> >> >> >>> >> >>> to
>> >> >> >>> >> >>> >>> pay
>> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> >> >> in the
>> >> >> >>> >> 0.10.0
>> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> >> >> release. I
>> >> >> >>> >> hope
>> >> >> >>> >> >>> that
>> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> >> >> help
>> >> >> >>> >> reduce
>> >> >> >>> >> >>> the
>> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> >> >>> >> ted.dunning@gmail.com>
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> >
>> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> >> >>> >> wesmckinn@gmail.com>
>> >> >> >>> >> >>> >>> wrote:
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> >
>> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>> >> for
>> >> >> >>> >> stability
>> >> >> >>> >> >>> and
>> >> >> >>> >> >>> >>> API
>> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> >> >> HDFS
>> >> >> >>> >> >>> community
>> >> >> >>> >> >>> >>> took
>> >> >> >>> >> >>> >>> >> a
>> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >> >>> >> >>> >>> >> >
>> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> >> >> source
>> >> >> >>> >> >>> community as
>> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> >> >> didn't
>> >> >> >>> go
>> >> >> >>> >> the
>> >> >> >>> >> >>> way
>> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >> >> >>> community
>> >> >> >>> >> which
>> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> >> >> model.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> There are some more radical and community building
>> >> >> options as
>> >> >> >>> >> well.
>> >> >> >>> >> >>> Take
>> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>> >> subversion,
>> >> >> any
>> >> >> >>> >> Apache
>> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> >> >> large
>> >> >> >>> >> >>> fraction of
>> >> >> >>> >> >>> >>> >> subversion.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>> >> parquet
>> >> >> >>> >> committer
>> >> >> >>> >> >>> a
>> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> >> >>> >> committers in
>> >> >> >>> >> >>> >>> Arrow?
>> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> >> >> committer who
>> >> >> >>> >> asks
>> >> >> >>> >> >>> will
>> >> >> >>> >> >>> >>> be
>> >> >> >>> >> >>> >>> >> given committer status in Arrow.
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>> >> Parquet
>> >> >> >>> >> committers
>> >> >> >>> >> >>> >>> can't be
>> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>> >> >> merged;
>> >> >> >>> >> they
>> >> >> >>> >> >>> can
>> >> >> >>> >> >>> >>> just
>> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> >> >> in the
>> >> >> >>> >> >>> Parquet
>> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >> >> >>> parquet so
>> >> >> >>> >> >>> why not
>> >> >> >>> >> >>> >>> >> invite them in?
>> >> >> >>> >> >>> >>> >>
>> >> >> >>> >> >>> >>>
>> >> >> >>> >> >>>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >> --
>> >> >> >>> >> >> regards,
>> >> >> >>> >> >> Deepak Majeti
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > --
>> >> >> >>> > regards,
>> >> >> >>> > Deepak Majeti
>> >> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > regards,
>> >> > Deepak Majeti
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Back from vacation, I also want to finally raise my voice.

With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.

Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.

Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.

An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.

Uwe

On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
> Thanks Ryan, will do. The people I'd still like to hear from are:
> 
> * Phillip Cloud
> * Uwe Korn
> 
> As ASF contributors we are responsible to both be pragmatic as well as
> act in the best interests of the community's health and productivity.
> 
> 
> 
> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> > I don't have an opinion here, but could someone send a summary of what is
> > decided to the dev list once there is consensus? This is a long thread for
> > parts of the project I don't work on, so I haven't followed it very closely.
> >
> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> >> Can we enforce that parquet-cpp changes will not be committed without a
> >> corresponding Parquet JIRA?
> >>
> >> I think we would use the following policy:
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> >> We've already been dealing with annoyances relating to issues
> >> straddling the two projects (debugging an issue on Arrow side to find
> >> that it has to be fixed on Parquet side); this would make things
> >> simpler for us
> >>
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> simplify forking later (if needed) and be able to maintain the commit
> >> history.  I don't know if its possible to squash parquet-cpp commits and
> >> arrow commits separately before merging.
> >>
> >> This seems rather onerous for both contributors and maintainers and
> >> not in line with the goal of improving productivity. In the event that
> >> we fork I see it as a traumatic event for the community. If it does
> >> happen, then we can write a script (using git filter-branch and other
> >> such tools) to extract commits related to the forked code.
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
> >> wrote:
> >> > I have a few more logistical questions to add.
> >> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> > Arrow changes. Will we establish some guidelines for filing Parquet
> >> JIRAs?
> >> > Can we enforce that parquet-cpp changes will not be committed without a
> >> > corresponding Parquet JIRA?
> >> >
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> > simplify forking later (if needed) and be able to maintain the commit
> >> > history.  I don't know if its possible to squash parquet-cpp commits and
> >> > arrow commits separately before merging.
> >> >
> >> >
> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
> >> >
> >> >> Do other people have opinions? I would like to undertake this work in
> >> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> >> responsibility for the primary codebase surgery.
> >> >>
> >> >> Some logistical questions:
> >> >>
> >> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> >> would need to be resolved / merged
> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> >> releases cut out of the new structure
> >> >> * Management of shared commit rights (I can discuss with the Arrow
> >> >> PMC; I believe that approving any committer who has actively
> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> >> comments)
> >> >>
> >> >> If working more closely together proves to not be working out after
> >> >> some period of time, I will be fully supportive of a fork or something
> >> >> like it
> >> >>
> >> >> Thanks,
> >> >> Wes
> >> >>
> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >> > Thanks Tim.
> >> >> >
> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> >> > platform code intending to improve the performance of bit-packing in
> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >> >
> >> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> >> > * https://github.com/apache/arrow/pull/2355
> >> >> >
> >> >> > Changes that impact the Python interface to Parquet are even more
> >> >> complex.
> >> >> >
> >> >> > Adding options to Arrow's CMake build system to only build
> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> >> > not be difficult, and amount to writing "make parquet".
> >> >> >
> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> >> to
> >> >> > build and install the Parquet core libraries and their dependencies
> >> >> > would be:
> >> >> >
> >> >> > ninja parquet && ninja install
> >> >> >
> >> >> > - Wes
> >> >> >
> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> >> > <ta...@cloudera.com.invalid> wrote:
> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> >> successful, but I thought I'd give my two cents.
> >> >> >>
> >> >> >> For me, the thing that makes the biggest difference in contributing
> >> to a
> >> >> >> new codebase is the number of steps in the workflow for writing,
> >> >> testing,
> >> >> >> posting and iterating on a commit and also the number of
> >> opportunities
> >> >> for
> >> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> >> secondary so long as the workflow is simple and reliable.
> >> >> >>
> >> >> >> I don't really know what the current state of things is, but it
> >> sounds
> >> >> like
> >> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >> >>
> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >>> hi,
> >> >> >>>
> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> >> majeti.deepak@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > I think the circular dependency can be broken if we build a new
> >> >> library
> >> >> >>> for
> >> >> >>> > the platform code. This will also make it easy for other projects
> >> >> such as
> >> >> >>> > ORC to use it.
> >> >> >>> > I also remember your proposal a while ago of having a separate
> >> >> project
> >> >> >>> for
> >> >> >>> > the platform code.  That project can live in the arrow repo.
> >> >> However, one
> >> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> >> platform
> >> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >> >>> >
> >> >> >>> > The dependency will look like:
> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >> >>> > libplatform(platform api)
> >> >> >>> >
> >> >> >>> > CI workflow will clone the arrow project twice, once for the
> >> platform
> >> >> >>> > library and once for the arrow-core/bindings library.
> >> >> >>>
> >> >> >>> This seems like an interesting proposal; the best place to work
> >> toward
> >> >> >>> this goal (if it is even possible; the build system interactions and
> >> >> >>> ASF release management are the hard problems) is to have all of the
> >> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >> >>>
> >> >> >>> >
> >> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> >> Parquet
> >> >> >>> > communities so far have been very successful.
> >> >> >>> > The reason to maintain this relationship moving forward is to
> >> >> continue to
> >> >> >>> > reap the mutual benefits.
> >> >> >>> > We should continue to take advantage of sharing code as well.
> >> >> However, I
> >> >> >>> > don't see any code sharing opportunities between arrow-core and
> >> the
> >> >> >>> > parquet-core. Both have different functions.
> >> >> >>>
> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> >> format
> >> >> >>> is only one part of a project that has become quite large already
> >> >> >>> (
> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >> >>> platform-for-inmemory-data-105427919).
> >> >> >>>
> >> >> >>> >
> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
> >> stable.
> >> >> We
> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
> >> is
> >> >> to
> >> >> >>> > keep them nimble since we can.
> >> >> >>>
> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >> >>> yield both improved read and write throughput. This aligns well with
> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >> >>> believe that more development will happen on parquet-core once the
> >> >> >>> development process issues are resolved by having a single codebase,
> >> >> >>> single build system, and a single CI framework.
> >> >> >>>
> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >> >>> goal I think we should still be open to making significant changes
> >> in
> >> >> >>> the interest of long term progress.
> >> >> >>>
> >> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >> >>> past the "breaking point" and not willing to continue contributing
> >> in
> >> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >> >>> as they are now. It's hampering progress and not serving the
> >> >> >>> community.
> >> >> >>>
> >> >> >>> - Wes
> >> >> >>>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo. That will remove a majority of the dependency issues.
> >> Joshua's
> >> >> >>> work
> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> >> the
> >> >> >>> arrow
> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >> >>> >>
> >> >> >>> >> This has been suggested before, but I don't see how it would
> >> >> alleviate
> >> >> >>> >> any issues because of the significant dependencies on other
> >> parts of
> >> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >> >>> >>
> >> >> >>> >> - (Arrow) arrow platform
> >> >> >>> >> - (Parquet) parquet core
> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >> >>> >> - (Arrow) Python bindings
> >> >> >>> >>
> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >> >>> >> built before invoking the Parquet core part of the build system.
> >> You
> >> >> >>> >> would need to pass dependent targets across different CMake build
> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> >> into
> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> >> "concrete
> >> >> >>> >> and actionable plan". The only thing that would really work
> >> would be
> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> >> builds
> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> >> build
> >> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >> >>> >>
> >> >> >>> >> And even if a solution could be devised, it would not wholly
> >> resolve
> >> >> >>> >> the CI workflow issues.
> >> >> >>> >>
> >> >> >>> >> You could make Parquet completely independent of the Arrow
> >> codebase,
> >> >> >>> >> but at that point there is little reason to maintain a
> >> relationship
> >> >> >>> >> between the projects or their communities. We have spent a great
> >> >> deal
> >> >> >>> >> of effort refactoring the two projects to enable as much code
> >> >> sharing
> >> >> >>> >> as there is now.
> >> >> >>> >>
> >> >> >>> >> - Wes
> >> >> >>> >>
> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> >> wesmckinn@gmail.com>
> >> >> >>> wrote:
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >> two
> >> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >
> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> >> is
> >> >> to
> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >> >>> >> >
> >> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> >> monorepo is
> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> >> give
> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> >> (which I
> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >> >>> >> >
> >> >> >>> >> > - Wes
> >> >> >>> >> >
> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >> >>> majeti.deepak@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >> Wes,
> >> >> >>> >> >>
> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> >> problems
> >> >> >>> of a
> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >> >>> >> >> Bringing in related Apache community experiences are more
> >> >> meaningful
> >> >> >>> >> than
> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> >> developers.
> >> >> >>> >> >> You are very well aware of how difficult it has been to find
> >> more
> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> >> has
> >> >> a low
> >> >> >>> >> >> contribution rate to its core components.
> >> >> >>> >> >>
> >> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> >> contribute
> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> >> figuring
> >> >> >>> out
> >> >> >>> >> >> the project repo. We can never come up with an automated build
> >> >> system
> >> >> >>> >> that
> >> >> >>> >> >> caters to every possible environment.
> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> >> new
> >> >> >>> >> developers
> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> >> build
> >> >> >>> and
> >> >> >>> >> test
> >> >> >>> >> >> dependencies.
> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
> >> less
> >> >> >>> >> >> co-operative.
> >> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> >> sustainable
> >> >> >>> in
> >> >> >>> >> an
> >> >> >>> >> >> open source community unless there are long-term vested
> >> >> interests. We
> >> >> >>> >> can't
> >> >> >>> >> >> predict that.
> >> >> >>> >> >>
> >> >> >>> >> >> The current circular dependency problems between Arrow and
> >> >> Parquet
> >> >> >>> is a
> >> >> >>> >> >> major problem for the community and it is important.
> >> >> >>> >> >>
> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo.
> >> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> >> that
> >> >> >>> adapter
> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> >> adaptor.
> >> >> >>> >> >>
> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> >> changes
> >> >> >>> in
> >> >> >>> >> the
> >> >> >>> >> >> future to this code should not be the main reason to combine
> >> the
> >> >> >>> arrow
> >> >> >>> >> >> parquet repos.
> >> >> >>> >> >>
> >> >> >>> >> >> "
> >> >> >>> >> >> *I question whether it's worth the community's time long term
> >> to
> >> >> >>> wear*
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >> >>> >> eachlibrary
> >> >> >>> >> >> to plug components together rather than utilizing
> >> commonplatform
> >> >> >>> APIs.*"
> >> >> >>> >> >>
> >> >> >>> >> >> My answer to your question below would be "Yes".
> >> >> >>> Modularity/separation
> >> >> >>> >> is
> >> >> >>> >> >> very important in an open source community where priorities of
> >> >> >>> >> contributors
> >> >> >>> >> >> are often short term.
> >> >> >>> >> >> The retention is low and therefore the acquisition costs
> >> should
> >> >> be
> >> >> >>> low
> >> >> >>> >> as
> >> >> >>> >> >> well. This is the community over code approach according to
> >> me.
> >> >> Minor
> >> >> >>> >> code
> >> >> >>> >> >> duplication is not a deal breaker.
> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> >> big
> >> >> >>> data
> >> >> >>> >> >> space serving their own functions.
> >> >> >>> >> >>
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> >> Having
> >> >> >>> two
> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> >> wesmckinn@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >>> @Antoine
> >> >> >>> >> >>>
> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> >> would
> >> >> >>> slightly
> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >> >>> >> >>>
> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >> >>> >> >>>
> >> >> >>> >> >>> Parquet run takes about 28
> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >> >>> >> >>>
> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> >> certain
> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >> >>> >> >>>
> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> >> could be
> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> >> (like
> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> >> nightly
> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> >> also
> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
> >> nightly
> >> >> >>> >> >>> exhaustive test run)
> >> >> >>> >> >>>
> >> >> >>> >> >>> - Wes
> >> >> >>> >> >>>
> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> >> wesmckinn@gmail.com
> >> >> >>> >
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> gives me hope that the projects could be managed separately
> >> some
> >> >> >>> day.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> >> C++
> >> >> >>> codebase
> >> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >> >>> replaced by
> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >> >>> >> >>> > interoperability:
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/OrcFile.hh#L37
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/MemoryPool.hh
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >> >>> OutputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> >> cause of
> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> >> them
> >> >> >>> from
> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> >> is
> >> >> only
> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > I question whether it's worth the community's time long
> >> term
> >> >> to
> >> >> >>> wear
> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> >> in
> >> >> each
> >> >> >>> >> >>> > library to plug components together rather than utilizing
> >> >> common
> >> >> >>> >> >>> > platform APIs.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > - Wes
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >> >>> >> joshuastorck@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> >> process are
> >> >> >>> >> well
> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> >> current
> >> >> >>> >> >>> environment I
> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> >> main
> >> >> >>> issues
> >> >> >>> >> I
> >> >> >>> >> >>> worry
> >> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> >> becomes
> >> >> >>> too
> >> >> >>> >> >>> coupled
> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> >> tree are
> >> >> >>> >> >>> delayed
> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> If the project/release management is structured well and
> >> >> someone
> >> >> >>> >> keeps
> >> >> >>> >> >>> an
> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how
> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> >> gives me hope that the projects could be managed
> >> separately
> >> >> some
> >> >> >>> >> day.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >> >>> wesmckinn@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >>> hi Josh,
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> >> people
> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> >> agree
> >> >> >>> >> with?)
> >> >> >>> >> >>> >>> is that we should work more closely together until the
> >> >> community
> >> >> >>> >> grows
> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> >> now.
> >> >> >>> As
> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> >> these
> >> >> >>> >> projects.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> >> own
> >> >> >>> >> codebase.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> >> into
> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> >> of
> >> >> the
> >> >> >>> >> GitHub
> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >> >>> idealistic,
> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> >> devise
> >> >> a
> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> >> per
> >> >> >>> day
> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >> >>> without
> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> >> see
> >> >> how
> >> >> >>> we
> >> >> >>> >> can
> >> >> >>> >> >>> >>> move forward.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> >> the
> >> >> >>> near
> >> >> >>> >> >>> term.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> >> to
> >> >> be
> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
> >> and
> >> >> >>> >> community
> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> >> the
> >> >> >>> >> current
> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> >> consider
> >> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> >> argument is
> >> >> >>> as
> >> >> >>> >> >>> >>> follows:
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
> >> PMCs
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> - Wes
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >> >>> >> joshuastorck@gmail.com
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> >> implemented
> >> >> in
> >> >> >>> >> >>> parquet-cpp
> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >> >>> >> (ARROW-2585,
> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> >> confusing
> >> >> and
> >> >> >>> >> hard to
> >> >> >>> >> >>> work
> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> >> parquet-cpp
> >> >> >>> >> (created on
> >> >> >>> >> >>> May
> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> >> was
> >> >> >>> >> recently
> >> >> >>> >> >>> >>> merged.
> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> >> the
> >> >> >>> >> change in
> >> >> >>> >> >>> >>> arrow
> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >> >>> >> >>> >>> run_clang_format.py
> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> >> there
> >> >> >>> >> was an
> >> >> >>> >> >>> >>> exact
> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> >> sense
> >> >> in
> >> >> >>> the
> >> >> >>> >> long
> >> >> >>> >> >>> >>> term.
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them
> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> >> other
> >> >> >>> formats
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> >> (e.g. -
> >> >> >>> Orc),
> >> >> >>> >> so I
> >> >> >>> >> >>> >>> don't
> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> >> tooling
> >> >> >>> >> should
> >> >> >>> >> >>> be
> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> >> history
> >> >> of
> >> >> >>> >> >>> developing
> >> >> >>> >> >>> >>> open
> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> >> CI
> >> >> is a
> >> >> >>> >> good
> >> >> >>> >> >>> >>> > counter-example since there have been lots of
> >> successful
> >> >> open
> >> >> >>> >> source
> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
> >> pinned
> >> >> >>> >> versions of
> >> >> >>> >> >>> >>> > dependent software.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short
> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> >> near
> >> >> >>> >> term.
> >> >> >>> >> >>> My
> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> >> together,
> >> >> >>> you
> >> >> >>> >> can
> >> >> >>> >> >>> more
> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> >> a
> >> >> >>> single
> >> >> >>> >> PR.
> >> >> >>> >> >>> >>> Second,
> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >> >>> diverge,
> >> >> >>> >> >>> which has
> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> >> been
> >> >> >>> >> sorted
> >> >> >>> >> >>> out,
> >> >> >>> >> >>> >>> it
> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> >> codebases.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >> >>> codebases
> >> >> >>> >> for
> >> >> >>> >> >>> arrow
> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> >> the
> >> >> >>> >> >>> perspective of
> >> >> >>> >> >>> >>> a
> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> >> is a
> >> >> >>> large
> >> >> >>> >> tax
> >> >> >>> >> >>> to
> >> >> >>> >> >>> >>> pay
> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> >> in the
> >> >> >>> >> 0.10.0
> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> >> release. I
> >> >> >>> >> hope
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> >> help
> >> >> >>> >> reduce
> >> >> >>> >> >>> the
> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >> >>> >> ted.dunning@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >> >>> >> wesmckinn@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> >> for
> >> >> >>> >> stability
> >> >> >>> >> >>> and
> >> >> >>> >> >>> >>> API
> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> >> HDFS
> >> >> >>> >> >>> community
> >> >> >>> >> >>> >>> took
> >> >> >>> >> >>> >>> >> a
> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> >> source
> >> >> >>> >> >>> community as
> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> >> didn't
> >> >> >>> go
> >> >> >>> >> the
> >> >> >>> >> >>> way
> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >> >>> community
> >> >> >>> >> which
> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> >> model.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> There are some more radical and community building
> >> >> options as
> >> >> >>> >> well.
> >> >> >>> >> >>> Take
> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> >> subversion,
> >> >> any
> >> >> >>> >> Apache
> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> >> large
> >> >> >>> >> >>> fraction of
> >> >> >>> >> >>> >>> >> subversion.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> >> parquet
> >> >> >>> >> committer
> >> >> >>> >> >>> a
> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >> >>> >> committers in
> >> >> >>> >> >>> >>> Arrow?
> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> >> committer who
> >> >> >>> >> asks
> >> >> >>> >> >>> will
> >> >> >>> >> >>> >>> be
> >> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> >> Parquet
> >> >> >>> >> committers
> >> >> >>> >> >>> >>> can't be
> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> >> merged;
> >> >> >>> >> they
> >> >> >>> >> >>> can
> >> >> >>> >> >>> >>> just
> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> >> in the
> >> >> >>> >> >>> Parquet
> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >> >>> parquet so
> >> >> >>> >> >>> why not
> >> >> >>> >> >>> >>> >> invite them in?
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> --
> >> >> >>> >> >> regards,
> >> >> >>> >> >> Deepak Majeti
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > --
> >> >> >>> > regards,
> >> >> >>> > Deepak Majeti
> >> >> >>>
> >> >>
> >> >
> >> >
> >> > --
> >> > regards,
> >> > Deepak Majeti
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Back from vacation, I also want to finally raise my voice.

With the current state of the Parquet<->Arrow development, I see a benefit in merging the code base for now, but not necessarily forever.

Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter is built and that uses some of the more standard-library features of Arrow. It is the go-to place where also the same toolchain and CI setup is used. Here we also directly apply all improvements that we make in Arrow itself. These are the points that make it special in comparison to other tools providing Arrow adapters like Turbodbc.

Thus, I think that the current move to merge the code bases is ok for me. I must say that I'm not 100% certain that this is the best move but currently I lack better alternatives. As previously mentioned, we should take extra care that we can still do separate releases and also provide a path for a future where we split parquet-cpp into its own project/repository again.

An important point that we should keep in (and why I was a bit concerned in the previous times this discussion was raised) is that we have to be careful to not pull everything that touches Arrow into the Arrow repository. Having separate repositories for projects with each its own release cycle is for me still the aim for the longterm. I expect that there will be many more projects that will use Arrow's I/O libraries as well as will omit Arrow structures. These libraries should be also usable in Python/C++/Ruby/R/… These libraries are then hopefully not all developed by the same core group of Arrow/Parquet developers we have currently. For this to function really well, we will need a more stable API in Arrow as well as a good set of build tooling that other libraries can build up when using Arrow functionality. In addition to being stable, the API must also provide a good UX in the abstraction layers the Arrow functions are provided so that high-performance applications are not high-maintenance due to frequent API changes in Arrow. That said, this is currently is wish for the future. We are currently building and iterating heavily on these APIs to form a good basis for future developments. Thus the repo merge will hopefully improve the development speed so that we have to spent less time on toolchain maintenance and can focus on the user-facing APIs.

Uwe

On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
> Thanks Ryan, will do. The people I'd still like to hear from are:
> 
> * Phillip Cloud
> * Uwe Korn
> 
> As ASF contributors we are responsible to both be pragmatic as well as
> act in the best interests of the community's health and productivity.
> 
> 
> 
> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> > I don't have an opinion here, but could someone send a summary of what is
> > decided to the dev list once there is consensus? This is a long thread for
> > parts of the project I don't work on, so I haven't followed it very closely.
> >
> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> >> Can we enforce that parquet-cpp changes will not be committed without a
> >> corresponding Parquet JIRA?
> >>
> >> I think we would use the following policy:
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> >> We've already been dealing with annoyances relating to issues
> >> straddling the two projects (debugging an issue on Arrow side to find
> >> that it has to be fixed on Parquet side); this would make things
> >> simpler for us
> >>
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> simplify forking later (if needed) and be able to maintain the commit
> >> history.  I don't know if its possible to squash parquet-cpp commits and
> >> arrow commits separately before merging.
> >>
> >> This seems rather onerous for both contributors and maintainers and
> >> not in line with the goal of improving productivity. In the event that
> >> we fork I see it as a traumatic event for the community. If it does
> >> happen, then we can write a script (using git filter-branch and other
> >> such tools) to extract commits related to the forked code.
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
> >> wrote:
> >> > I have a few more logistical questions to add.
> >> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> > Arrow changes. Will we establish some guidelines for filing Parquet
> >> JIRAs?
> >> > Can we enforce that parquet-cpp changes will not be committed without a
> >> > corresponding Parquet JIRA?
> >> >
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> > simplify forking later (if needed) and be able to maintain the commit
> >> > history.  I don't know if its possible to squash parquet-cpp commits and
> >> > arrow commits separately before merging.
> >> >
> >> >
> >> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
> >> >
> >> >> Do other people have opinions? I would like to undertake this work in
> >> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> >> responsibility for the primary codebase surgery.
> >> >>
> >> >> Some logistical questions:
> >> >>
> >> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> >> would need to be resolved / merged
> >> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> >> releases cut out of the new structure
> >> >> * Management of shared commit rights (I can discuss with the Arrow
> >> >> PMC; I believe that approving any committer who has actively
> >> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> >> comments)
> >> >>
> >> >> If working more closely together proves to not be working out after
> >> >> some period of time, I will be fully supportive of a fork or something
> >> >> like it
> >> >>
> >> >> Thanks,
> >> >> Wes
> >> >>
> >> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >> > Thanks Tim.
> >> >> >
> >> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> >> > platform code intending to improve the performance of bit-packing in
> >> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >> >
> >> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> >> > * https://github.com/apache/arrow/pull/2355
> >> >> >
> >> >> > Changes that impact the Python interface to Parquet are even more
> >> >> complex.
> >> >> >
> >> >> > Adding options to Arrow's CMake build system to only build
> >> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> >> > not be difficult, and amount to writing "make parquet".
> >> >> >
> >> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> >> to
> >> >> > build and install the Parquet core libraries and their dependencies
> >> >> > would be:
> >> >> >
> >> >> > ninja parquet && ninja install
> >> >> >
> >> >> > - Wes
> >> >> >
> >> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> >> > <ta...@cloudera.com.invalid> wrote:
> >> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> >> successful, but I thought I'd give my two cents.
> >> >> >>
> >> >> >> For me, the thing that makes the biggest difference in contributing
> >> to a
> >> >> >> new codebase is the number of steps in the workflow for writing,
> >> >> testing,
> >> >> >> posting and iterating on a commit and also the number of
> >> opportunities
> >> >> for
> >> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> >> secondary so long as the workflow is simple and reliable.
> >> >> >>
> >> >> >> I don't really know what the current state of things is, but it
> >> sounds
> >> >> like
> >> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >> >>
> >> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> >> >> wrote:
> >> >> >>
> >> >> >>> hi,
> >> >> >>>
> >> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> >> majeti.deepak@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > I think the circular dependency can be broken if we build a new
> >> >> library
> >> >> >>> for
> >> >> >>> > the platform code. This will also make it easy for other projects
> >> >> such as
> >> >> >>> > ORC to use it.
> >> >> >>> > I also remember your proposal a while ago of having a separate
> >> >> project
> >> >> >>> for
> >> >> >>> > the platform code.  That project can live in the arrow repo.
> >> >> However, one
> >> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> >> platform
> >> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >> >>> >
> >> >> >>> > The dependency will look like:
> >> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >> >>> > libplatform(platform api)
> >> >> >>> >
> >> >> >>> > CI workflow will clone the arrow project twice, once for the
> >> platform
> >> >> >>> > library and once for the arrow-core/bindings library.
> >> >> >>>
> >> >> >>> This seems like an interesting proposal; the best place to work
> >> toward
> >> >> >>> this goal (if it is even possible; the build system interactions and
> >> >> >>> ASF release management are the hard problems) is to have all of the
> >> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >> >>>
> >> >> >>> >
> >> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> >> Parquet
> >> >> >>> > communities so far have been very successful.
> >> >> >>> > The reason to maintain this relationship moving forward is to
> >> >> continue to
> >> >> >>> > reap the mutual benefits.
> >> >> >>> > We should continue to take advantage of sharing code as well.
> >> >> However, I
> >> >> >>> > don't see any code sharing opportunities between arrow-core and
> >> the
> >> >> >>> > parquet-core. Both have different functions.
> >> >> >>>
> >> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> >> format
> >> >> >>> is only one part of a project that has become quite large already
> >> >> >>> (
> >> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >> >>> platform-for-inmemory-data-105427919).
> >> >> >>>
> >> >> >>> >
> >> >> >>> > We are at a point where the parquet-cpp public API is pretty
> >> stable.
> >> >> We
> >> >> >>> > already passed that difficult stage. My take at arrow and parquet
> >> is
> >> >> to
> >> >> >>> > keep them nimble since we can.
> >> >> >>>
> >> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >> >>> yield both improved read and write throughput. This aligns well with
> >> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >> >>> believe that more development will happen on parquet-core once the
> >> >> >>> development process issues are resolved by having a single codebase,
> >> >> >>> single build system, and a single CI framework.
> >> >> >>>
> >> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >> >>> goal I think we should still be open to making significant changes
> >> in
> >> >> >>> the interest of long term progress.
> >> >> >>>
> >> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >> >>> past the "breaking point" and not willing to continue contributing
> >> in
> >> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >> >>> as they are now. It's hampering progress and not serving the
> >> >> >>> community.
> >> >> >>>
> >> >> >>> - Wes
> >> >> >>>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo. That will remove a majority of the dependency issues.
> >> Joshua's
> >> >> >>> work
> >> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> >> the
> >> >> >>> arrow
> >> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >> >>> >>
> >> >> >>> >> This has been suggested before, but I don't see how it would
> >> >> alleviate
> >> >> >>> >> any issues because of the significant dependencies on other
> >> parts of
> >> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >> >>> >>
> >> >> >>> >> - (Arrow) arrow platform
> >> >> >>> >> - (Parquet) parquet core
> >> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >> >>> >> - (Arrow) Python bindings
> >> >> >>> >>
> >> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >> >>> >> built before invoking the Parquet core part of the build system.
> >> You
> >> >> >>> >> would need to pass dependent targets across different CMake build
> >> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> >> into
> >> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> >> "concrete
> >> >> >>> >> and actionable plan". The only thing that would really work
> >> would be
> >> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> >> builds
> >> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> >> build
> >> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >> >>> >>
> >> >> >>> >> And even if a solution could be devised, it would not wholly
> >> resolve
> >> >> >>> >> the CI workflow issues.
> >> >> >>> >>
> >> >> >>> >> You could make Parquet completely independent of the Arrow
> >> codebase,
> >> >> >>> >> but at that point there is little reason to maintain a
> >> relationship
> >> >> >>> >> between the projects or their communities. We have spent a great
> >> >> deal
> >> >> >>> >> of effort refactoring the two projects to enable as much code
> >> >> sharing
> >> >> >>> >> as there is now.
> >> >> >>> >>
> >> >> >>> >> - Wes
> >> >> >>> >>
> >> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> >> wesmckinn@gmail.com>
> >> >> >>> wrote:
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >> two
> >> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >
> >> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> >> is
> >> >> to
> >> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >> >>> >> >
> >> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> >> monorepo is
> >> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> >> give
> >> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> >> (which I
> >> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >> >>> >> >
> >> >> >>> >> > - Wes
> >> >> >>> >> >
> >> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >> >>> majeti.deepak@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >> Wes,
> >> >> >>> >> >>
> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> >> problems
> >> >> >>> of a
> >> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >> >>> >> >> Bringing in related Apache community experiences are more
> >> >> meaningful
> >> >> >>> >> than
> >> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> >> developers.
> >> >> >>> >> >> You are very well aware of how difficult it has been to find
> >> more
> >> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> >> has
> >> >> a low
> >> >> >>> >> >> contribution rate to its core components.
> >> >> >>> >> >>
> >> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> >> contribute
> >> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> >> figuring
> >> >> >>> out
> >> >> >>> >> >> the project repo. We can never come up with an automated build
> >> >> system
> >> >> >>> >> that
> >> >> >>> >> >> caters to every possible environment.
> >> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> >> new
> >> >> >>> >> developers
> >> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> >> build
> >> >> >>> and
> >> >> >>> >> test
> >> >> >>> >> >> dependencies.
> >> >> >>> >> >> I am not saying that the Arrow community/committers will be
> >> less
> >> >> >>> >> >> co-operative.
> >> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> >> sustainable
> >> >> >>> in
> >> >> >>> >> an
> >> >> >>> >> >> open source community unless there are long-term vested
> >> >> interests. We
> >> >> >>> >> can't
> >> >> >>> >> >> predict that.
> >> >> >>> >> >>
> >> >> >>> >> >> The current circular dependency problems between Arrow and
> >> >> Parquet
> >> >> >>> is a
> >> >> >>> >> >> major problem for the community and it is important.
> >> >> >>> >> >>
> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> >> arrow
> >> >> >>> >> repo.
> >> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> >> that
> >> >> >>> adapter
> >> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> >> adaptor.
> >> >> >>> >> >>
> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> >> changes
> >> >> >>> in
> >> >> >>> >> the
> >> >> >>> >> >> future to this code should not be the main reason to combine
> >> the
> >> >> >>> arrow
> >> >> >>> >> >> parquet repos.
> >> >> >>> >> >>
> >> >> >>> >> >> "
> >> >> >>> >> >> *I question whether it's worth the community's time long term
> >> to
> >> >> >>> wear*
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >> >>> >> eachlibrary
> >> >> >>> >> >> to plug components together rather than utilizing
> >> commonplatform
> >> >> >>> APIs.*"
> >> >> >>> >> >>
> >> >> >>> >> >> My answer to your question below would be "Yes".
> >> >> >>> Modularity/separation
> >> >> >>> >> is
> >> >> >>> >> >> very important in an open source community where priorities of
> >> >> >>> >> contributors
> >> >> >>> >> >> are often short term.
> >> >> >>> >> >> The retention is low and therefore the acquisition costs
> >> should
> >> >> be
> >> >> >>> low
> >> >> >>> >> as
> >> >> >>> >> >> well. This is the community over code approach according to
> >> me.
> >> >> Minor
> >> >> >>> >> code
> >> >> >>> >> >> duplication is not a deal breaker.
> >> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> >> big
> >> >> >>> data
> >> >> >>> >> >> space serving their own functions.
> >> >> >>> >> >>
> >> >> >>> >> >> If you still strongly feel that the only way forward is to
> >> clone
> >> >> the
> >> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> >> Having
> >> >> >>> two
> >> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> >> wesmckinn@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >>> @Antoine
> >> >> >>> >> >>>
> >> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> >> would
> >> >> >>> slightly
> >> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >> >>> >> >>>
> >> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >> >>> >> >>>
> >> >> >>> >> >>> Parquet run takes about 28
> >> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >> >>> >> >>>
> >> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> >> certain
> >> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >> >>> >> >>>
> >> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> >> could be
> >> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> >> (like
> >> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> >> nightly
> >> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> >> also
> >> >> >>> >> >>> improve build times (valgrind build could be moved to a
> >> nightly
> >> >> >>> >> >>> exhaustive test run)
> >> >> >>> >> >>>
> >> >> >>> >> >>> - Wes
> >> >> >>> >> >>>
> >> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> >> wesmckinn@gmail.com
> >> >> >>> >
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> gives me hope that the projects could be managed separately
> >> some
> >> >> >>> day.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> >> C++
> >> >> >>> codebase
> >> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >> >>> replaced by
> >> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >> >>> >> >>> > interoperability:
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/OrcFile.hh#L37
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>>
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >> >>> orc/MemoryPool.hh
> >> >> >>> >> >>> >
> >> >> >>> >>
> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >> >>> OutputStream.hh
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> >> cause of
> >> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> >> them
> >> >> >>> from
> >> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> >> is
> >> >> only
> >> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > I question whether it's worth the community's time long
> >> term
> >> >> to
> >> >> >>> wear
> >> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> >> in
> >> >> each
> >> >> >>> >> >>> > library to plug components together rather than utilizing
> >> >> common
> >> >> >>> >> >>> > platform APIs.
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > - Wes
> >> >> >>> >> >>> >
> >> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >> >>> >> joshuastorck@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> >> process are
> >> >> >>> >> well
> >> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> >> current
> >> >> >>> >> >>> environment I
> >> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> >> main
> >> >> >>> issues
> >> >> >>> >> I
> >> >> >>> >> >>> worry
> >> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> >> becomes
> >> >> >>> too
> >> >> >>> >> >>> coupled
> >> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> >> tree are
> >> >> >>> >> >>> delayed
> >> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> If the project/release management is structured well and
> >> >> someone
> >> >> >>> >> keeps
> >> >> >>> >> >>> an
> >> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> >> great
> >> >> >>> >> example of
> >> >> >>> >> >>> how
> >> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >> >>> codebase.
> >> >> >>> >> That
> >> >> >>> >> >>> >> gives me hope that the projects could be managed
> >> separately
> >> >> some
> >> >> >>> >> day.
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >> >>> wesmckinn@gmail.com>
> >> >> >>> >> >>> wrote:
> >> >> >>> >> >>> >>
> >> >> >>> >> >>> >>> hi Josh,
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> >> people
> >> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> >> agree
> >> >> >>> >> with?)
> >> >> >>> >> >>> >>> is that we should work more closely together until the
> >> >> community
> >> >> >>> >> grows
> >> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> >> now.
> >> >> >>> As
> >> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> >> these
> >> >> >>> >> projects.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> >> own
> >> >> >>> >> codebase.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> >> into
> >> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> >> of
> >> >> the
> >> >> >>> >> GitHub
> >> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >> >>> idealistic,
> >> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> >> devise
> >> >> a
> >> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> >> per
> >> >> >>> day
> >> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >> >>> without
> >> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> >> see
> >> >> how
> >> >> >>> we
> >> >> >>> >> can
> >> >> >>> >> >>> >>> move forward.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> >> the
> >> >> >>> near
> >> >> >>> >> >>> term.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> >> to
> >> >> be
> >> >> >>> >> >>> >>> practical and result in net improvements in productivity
> >> and
> >> >> >>> >> community
> >> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> >> the
> >> >> >>> >> current
> >> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> >> consider
> >> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> >> argument is
> >> >> >>> as
> >> >> >>> >> >>> >>> follows:
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >> >>> >> >>> >>> * Releases structured according to the desires of the
> >> PMCs
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> - Wes
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >> >>> >> joshuastorck@gmail.com
> >> >> >>> >> >>> >
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> >> implemented
> >> >> in
> >> >> >>> >> >>> parquet-cpp
> >> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >> >>> >> (ARROW-2585,
> >> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> >> confusing
> >> >> and
> >> >> >>> >> hard to
> >> >> >>> >> >>> work
> >> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> >> parquet-cpp
> >> >> >>> >> (created on
> >> >> >>> >> >>> May
> >> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> >> was
> >> >> >>> >> recently
> >> >> >>> >> >>> >>> merged.
> >> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> >> the
> >> >> >>> >> change in
> >> >> >>> >> >>> >>> arrow
> >> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >> >>> >> >>> >>> run_clang_format.py
> >> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> >> there
> >> >> >>> >> was an
> >> >> >>> >> >>> >>> exact
> >> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> >> sense
> >> >> in
> >> >> >>> the
> >> >> >>> >> long
> >> >> >>> >> >>> >>> term.
> >> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> >> arrow
> >> >> >>> and
> >> >> >>> >> >>> tying
> >> >> >>> >> >>> >>> them
> >> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> >> other
> >> >> >>> formats
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> >> (e.g. -
> >> >> >>> Orc),
> >> >> >>> >> so I
> >> >> >>> >> >>> >>> don't
> >> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> >> tooling
> >> >> >>> >> should
> >> >> >>> >> >>> be
> >> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> >> history
> >> >> of
> >> >> >>> >> >>> developing
> >> >> >>> >> >>> >>> open
> >> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> >> CI
> >> >> is a
> >> >> >>> >> good
> >> >> >>> >> >>> >>> > counter-example since there have been lots of
> >> successful
> >> >> open
> >> >> >>> >> source
> >> >> >>> >> >>> >>> > projects that have used nightly build systems that
> >> pinned
> >> >> >>> >> versions of
> >> >> >>> >> >>> >>> > dependent software.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> >> codebases
> >> >> >>> >> in the
> >> >> >>> >> >>> >>> short
> >> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> >> near
> >> >> >>> >> term.
> >> >> >>> >> >>> My
> >> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> >> together,
> >> >> >>> you
> >> >> >>> >> can
> >> >> >>> >> >>> more
> >> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> >> a
> >> >> >>> single
> >> >> >>> >> PR.
> >> >> >>> >> >>> >>> Second,
> >> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >> >>> diverge,
> >> >> >>> >> >>> which has
> >> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> >> been
> >> >> >>> >> sorted
> >> >> >>> >> >>> out,
> >> >> >>> >> >>> >>> it
> >> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> >> codebases.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >> >>> codebases
> >> >> >>> >> for
> >> >> >>> >> >>> arrow
> >> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> >> the
> >> >> >>> >> >>> perspective of
> >> >> >>> >> >>> >>> a
> >> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> >> is a
> >> >> >>> large
> >> >> >>> >> tax
> >> >> >>> >> >>> to
> >> >> >>> >> >>> >>> pay
> >> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> >> in the
> >> >> >>> >> 0.10.0
> >> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> >> release. I
> >> >> >>> >> hope
> >> >> >>> >> >>> that
> >> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> >> help
> >> >> >>> >> reduce
> >> >> >>> >> >>> the
> >> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >> >>> >> ted.dunning@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >
> >> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >> >>> >> wesmckinn@gmail.com>
> >> >> >>> >> >>> >>> wrote:
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> >> for
> >> >> >>> >> stability
> >> >> >>> >> >>> and
> >> >> >>> >> >>> >>> API
> >> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> >> HDFS
> >> >> >>> >> >>> community
> >> >> >>> >> >>> >>> took
> >> >> >>> >> >>> >>> >> a
> >> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >> >>> >> >>> >>> >> >
> >> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> >> source
> >> >> >>> >> >>> community as
> >> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> >> didn't
> >> >> >>> go
> >> >> >>> >> the
> >> >> >>> >> >>> way
> >> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >> >>> community
> >> >> >>> >> which
> >> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> >> model.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> There are some more radical and community building
> >> >> options as
> >> >> >>> >> well.
> >> >> >>> >> >>> Take
> >> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> >> subversion,
> >> >> any
> >> >> >>> >> Apache
> >> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> >> large
> >> >> >>> >> >>> fraction of
> >> >> >>> >> >>> >>> >> subversion.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> >> parquet
> >> >> >>> >> committer
> >> >> >>> >> >>> a
> >> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >> >>> >> committers in
> >> >> >>> >> >>> >>> Arrow?
> >> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> >> committer who
> >> >> >>> >> asks
> >> >> >>> >> >>> will
> >> >> >>> >> >>> >>> be
> >> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> >> Parquet
> >> >> >>> >> committers
> >> >> >>> >> >>> >>> can't be
> >> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> >> merged;
> >> >> >>> >> they
> >> >> >>> >> >>> can
> >> >> >>> >> >>> >>> just
> >> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> >> in the
> >> >> >>> >> >>> Parquet
> >> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >> >>> parquet so
> >> >> >>> >> >>> why not
> >> >> >>> >> >>> >>> >> invite them in?
> >> >> >>> >> >>> >>> >>
> >> >> >>> >> >>> >>>
> >> >> >>> >> >>>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> --
> >> >> >>> >> >> regards,
> >> >> >>> >> >> Deepak Majeti
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > --
> >> >> >>> > regards,
> >> >> >>> > Deepak Majeti
> >> >> >>>
> >> >>
> >> >
> >> >
> >> > --
> >> > regards,
> >> > Deepak Majeti
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Thanks Ryan, will do. The people I'd still like to hear from are:

* Phillip Cloud
* Uwe Korn

As ASF contributors we are responsible to both be pragmatic as well as
act in the best interests of the community's health and productivity.



On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> I don't have an opinion here, but could someone send a summary of what is
> decided to the dev list once there is consensus? This is a long thread for
> parts of the project I don't work on, so I haven't followed it very closely.
>
> On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>
>> > It will be difficult to track parquet-cpp changes if they get mixed with
>> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>> Can we enforce that parquet-cpp changes will not be committed without a
>> corresponding Parquet JIRA?
>>
>> I think we would use the following policy:
>>
>> * use PARQUET-XXX for issues relating to Parquet core
>> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>> core (e.g. changes that are in parquet/arrow right now)
>>
>> We've already been dealing with annoyances relating to issues
>> straddling the two projects (debugging an issue on Arrow side to find
>> that it has to be fixed on Parquet side); this would make things
>> simpler for us
>>
>> > I would also like to keep changes to parquet-cpp on a separate commit to
>> simplify forking later (if needed) and be able to maintain the commit
>> history.  I don't know if its possible to squash parquet-cpp commits and
>> arrow commits separately before merging.
>>
>> This seems rather onerous for both contributors and maintainers and
>> not in line with the goal of improving productivity. In the event that
>> we fork I see it as a traumatic event for the community. If it does
>> happen, then we can write a script (using git filter-branch and other
>> such tools) to extract commits related to the forked code.
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > I have a few more logistical questions to add.
>> >
>> > It will be difficult to track parquet-cpp changes if they get mixed with
>> > Arrow changes. Will we establish some guidelines for filing Parquet
>> JIRAs?
>> > Can we enforce that parquet-cpp changes will not be committed without a
>> > corresponding Parquet JIRA?
>> >
>> > I would also like to keep changes to parquet-cpp on a separate commit to
>> > simplify forking later (if needed) and be able to maintain the commit
>> > history.  I don't know if its possible to squash parquet-cpp commits and
>> > arrow commits separately before merging.
>> >
>> >
>> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Do other people have opinions? I would like to undertake this work in
>> >> the near future (the next 8-10 weeks); I would be OK with taking
>> >> responsibility for the primary codebase surgery.
>> >>
>> >> Some logistical questions:
>> >>
>> >> * We have a handful of pull requests in flight in parquet-cpp that
>> >> would need to be resolved / merged
>> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> >> releases cut out of the new structure
>> >> * Management of shared commit rights (I can discuss with the Arrow
>> >> PMC; I believe that approving any committer who has actively
>> >> maintained parquet-cpp should be a reasonable approach per Ted's
>> >> comments)
>> >>
>> >> If working more closely together proves to not be working out after
>> >> some period of time, I will be fully supportive of a fork or something
>> >> like it
>> >>
>> >> Thanks,
>> >> Wes
>> >>
>> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> > Thanks Tim.
>> >> >
>> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> >> > platform code intending to improve the performance of bit-packing in
>> >> > Parquet writes, and we resulted with 2 interdependent PRs
>> >> >
>> >> > * https://github.com/apache/parquet-cpp/pull/483
>> >> > * https://github.com/apache/arrow/pull/2355
>> >> >
>> >> > Changes that impact the Python interface to Parquet are even more
>> >> complex.
>> >> >
>> >> > Adding options to Arrow's CMake build system to only build
>> >> > Parquet-related code and dependencies (in a monorepo framework) would
>> >> > not be difficult, and amount to writing "make parquet".
>> >> >
>> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>> to
>> >> > build and install the Parquet core libraries and their dependencies
>> >> > would be:
>> >> >
>> >> > ninja parquet && ninja install
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> >> > <ta...@cloudera.com.invalid> wrote:
>> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> >> successful, but I thought I'd give my two cents.
>> >> >>
>> >> >> For me, the thing that makes the biggest difference in contributing
>> to a
>> >> >> new codebase is the number of steps in the workflow for writing,
>> >> testing,
>> >> >> posting and iterating on a commit and also the number of
>> opportunities
>> >> for
>> >> >> missteps. The size of the repo and build/test times matter but are
>> >> >> secondary so long as the workflow is simple and reliable.
>> >> >>
>> >> >> I don't really know what the current state of things is, but it
>> sounds
>> >> like
>> >> >> it's not as simple as check out -> build -> test if you're doing a
>> >> >> cross-repo change. Circular dependencies are a real headache.
>> >> >>
>> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> hi,
>> >> >>>
>> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> >> majeti.deepak@gmail.com>
>> >> >>> wrote:
>> >> >>> > I think the circular dependency can be broken if we build a new
>> >> library
>> >> >>> for
>> >> >>> > the platform code. This will also make it easy for other projects
>> >> such as
>> >> >>> > ORC to use it.
>> >> >>> > I also remember your proposal a while ago of having a separate
>> >> project
>> >> >>> for
>> >> >>> > the platform code.  That project can live in the arrow repo.
>> >> However, one
>> >> >>> > has to clone the entire apache arrow repo but can just build the
>> >> platform
>> >> >>> > code. This will be temporary until we can find a new home for it.
>> >> >>> >
>> >> >>> > The dependency will look like:
>> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >> >>> > libplatform(platform api)
>> >> >>> >
>> >> >>> > CI workflow will clone the arrow project twice, once for the
>> platform
>> >> >>> > library and once for the arrow-core/bindings library.
>> >> >>>
>> >> >>> This seems like an interesting proposal; the best place to work
>> toward
>> >> >>> this goal (if it is even possible; the build system interactions and
>> >> >>> ASF release management are the hard problems) is to have all of the
>> >> >>> code in a single repository. ORC could already be using Arrow if it
>> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >> >>>
>> >> >>> >
>> >> >>> > There is no doubt that the collaborations between the Arrow and
>> >> Parquet
>> >> >>> > communities so far have been very successful.
>> >> >>> > The reason to maintain this relationship moving forward is to
>> >> continue to
>> >> >>> > reap the mutual benefits.
>> >> >>> > We should continue to take advantage of sharing code as well.
>> >> However, I
>> >> >>> > don't see any code sharing opportunities between arrow-core and
>> the
>> >> >>> > parquet-core. Both have different functions.
>> >> >>>
>> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>> format
>> >> >>> is only one part of a project that has become quite large already
>> >> >>> (
>> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >> >>> platform-for-inmemory-data-105427919).
>> >> >>>
>> >> >>> >
>> >> >>> > We are at a point where the parquet-cpp public API is pretty
>> stable.
>> >> We
>> >> >>> > already passed that difficult stage. My take at arrow and parquet
>> is
>> >> to
>> >> >>> > keep them nimble since we can.
>> >> >>>
>> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >> >>> have done little work in asynchronous IO and concurrency which would
>> >> >>> yield both improved read and write throughput. This aligns well with
>> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >> >>> believe that more development will happen on parquet-core once the
>> >> >>> development process issues are resolved by having a single codebase,
>> >> >>> single build system, and a single CI framework.
>> >> >>>
>> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >> >>> goal I think we should still be open to making significant changes
>> in
>> >> >>> the interest of long term progress.
>> >> >>>
>> >> >>> Having now worked on these projects for more than 2 and a half years
>> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >> >>> past the "breaking point" and not willing to continue contributing
>> in
>> >> >>> a significant way to parquet-cpp if the projects remained structured
>> >> >>> as they are now. It's hampering progress and not serving the
>> >> >>> community.
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>> >> arrow
>> >> >>> >> repo. That will remove a majority of the dependency issues.
>> Joshua's
>> >> >>> work
>> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> >> the
>> >> >>> arrow
>> >> >>> >> repo.  This will be similar to the ORC adaptor.
>> >> >>> >>
>> >> >>> >> This has been suggested before, but I don't see how it would
>> >> alleviate
>> >> >>> >> any issues because of the significant dependencies on other
>> parts of
>> >> >>> >> the Arrow codebase. What you are proposing is:
>> >> >>> >>
>> >> >>> >> - (Arrow) arrow platform
>> >> >>> >> - (Parquet) parquet core
>> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> >>> >> - (Arrow) Python bindings
>> >> >>> >>
>> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> >>> >> built before invoking the Parquet core part of the build system.
>> You
>> >> >>> >> would need to pass dependent targets across different CMake build
>> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>> >> into
>> >> >>> >> it earlier this year). This is what I meant by the lack of a
>> >> "concrete
>> >> >>> >> and actionable plan". The only thing that would really work
>> would be
>> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>> builds
>> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> >> build
>> >> >>> >> system because it's only depended upon by the Python bindings.
>> >> >>> >>
>> >> >>> >> And even if a solution could be devised, it would not wholly
>> resolve
>> >> >>> >> the CI workflow issues.
>> >> >>> >>
>> >> >>> >> You could make Parquet completely independent of the Arrow
>> codebase,
>> >> >>> >> but at that point there is little reason to maintain a
>> relationship
>> >> >>> >> between the projects or their communities. We have spent a great
>> >> deal
>> >> >>> >> of effort refactoring the two projects to enable as much code
>> >> sharing
>> >> >>> >> as there is now.
>> >> >>> >>
>> >> >>> >> - Wes
>> >> >>> >>
>> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>> wesmckinn@gmail.com>
>> >> >>> wrote:
>> >> >>> >> >> If you still strongly feel that the only way forward is to
>> clone
>> >> the
>> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> Having
>> >> two
>> >> >>> >> parquet-cpp repos is no way a better approach.
>> >> >>> >> >
>> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>> is
>> >> to
>> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >>> >> >
>> >> >>> >> > It doesn't look like I will be able to convince you that a
>> >> monorepo is
>> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>> >> give
>> >> >>> >> > it a shot, and if it turns out in the way you're describing
>> >> (which I
>> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >>> >> >
>> >> >>> >> > - Wes
>> >> >>> >> >
>> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >> >>> majeti.deepak@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >> >> Wes,
>> >> >>> >> >>
>> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> >> problems
>> >> >>> of a
>> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >>> >> >> Bringing in related Apache community experiences are more
>> >> meaningful
>> >> >>> >> than
>> >> >>> >> >> how mono-repos work at Google and other big organizations.
>> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> >> developers.
>> >> >>> >> >> You are very well aware of how difficult it has been to find
>> more
>> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>> has
>> >> a low
>> >> >>> >> >> contribution rate to its core components.
>> >> >>> >> >>
>> >> >>> >> >> We should target to ensure that new volunteers who want to
>> >> contribute
>> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>> >> figuring
>> >> >>> out
>> >> >>> >> >> the project repo. We can never come up with an automated build
>> >> system
>> >> >>> >> that
>> >> >>> >> >> caters to every possible environment.
>> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>> new
>> >> >>> >> developers
>> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> >> build
>> >> >>> and
>> >> >>> >> test
>> >> >>> >> >> dependencies.
>> >> >>> >> >> I am not saying that the Arrow community/committers will be
>> less
>> >> >>> >> >> co-operative.
>> >> >>> >> >> I just don't think the mono-repo structure model will be
>> >> sustainable
>> >> >>> in
>> >> >>> >> an
>> >> >>> >> >> open source community unless there are long-term vested
>> >> interests. We
>> >> >>> >> can't
>> >> >>> >> >> predict that.
>> >> >>> >> >>
>> >> >>> >> >> The current circular dependency problems between Arrow and
>> >> Parquet
>> >> >>> is a
>> >> >>> >> >> major problem for the community and it is important.
>> >> >>> >> >>
>> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> >> arrow
>> >> >>> >> repo.
>> >> >>> >> >> That will remove a majority of the dependency issues.
>> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>> that
>> >> >>> adapter
>> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>> adaptor.
>> >> >>> >> >>
>> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>> >> changes
>> >> >>> in
>> >> >>> >> the
>> >> >>> >> >> future to this code should not be the main reason to combine
>> the
>> >> >>> arrow
>> >> >>> >> >> parquet repos.
>> >> >>> >> >>
>> >> >>> >> >> "
>> >> >>> >> >> *I question whether it's worth the community's time long term
>> to
>> >> >>> wear*
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> >>> >> eachlibrary
>> >> >>> >> >> to plug components together rather than utilizing
>> commonplatform
>> >> >>> APIs.*"
>> >> >>> >> >>
>> >> >>> >> >> My answer to your question below would be "Yes".
>> >> >>> Modularity/separation
>> >> >>> >> is
>> >> >>> >> >> very important in an open source community where priorities of
>> >> >>> >> contributors
>> >> >>> >> >> are often short term.
>> >> >>> >> >> The retention is low and therefore the acquisition costs
>> should
>> >> be
>> >> >>> low
>> >> >>> >> as
>> >> >>> >> >> well. This is the community over code approach according to
>> me.
>> >> Minor
>> >> >>> >> code
>> >> >>> >> >> duplication is not a deal breaker.
>> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>> big
>> >> >>> data
>> >> >>> >> >> space serving their own functions.
>> >> >>> >> >>
>> >> >>> >> >> If you still strongly feel that the only way forward is to
>> clone
>> >> the
>> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> Having
>> >> >>> two
>> >> >>> >> >> parquet-cpp repos is no way a better approach.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >> >>
>> >> >>> >> >>> @Antoine
>> >> >>> >> >>>
>> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>> would
>> >> >>> slightly
>> >> >>> >> >>> increase Arrow CI times (which are already too large).
>> >> >>> >> >>>
>> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >>> >> >>>
>> >> >>> >> >>> Parquet run takes about 28
>> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >>> >> >>>
>> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> >> certain
>> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >>> >> >>>
>> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> >> could be
>> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>> >> (like
>> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> >> nightly
>> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> >> also
>> >> >>> >> >>> improve build times (valgrind build could be moved to a
>> nightly
>> >> >>> >> >>> exhaustive test run)
>> >> >>> >> >>>
>> >> >>> >> >>> - Wes
>> >> >>> >> >>>
>> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> >> wesmckinn@gmail.com
>> >> >>> >
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> great
>> >> >>> >> example of
>> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >> >>> codebase.
>> >> >>> >> That
>> >> >>> >> >>> gives me hope that the projects could be managed separately
>> some
>> >> >>> day.
>> >> >>> >> >>> >
>> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>> C++
>> >> >>> codebase
>> >> >>> >> >>> > features several areas of duplicated logic which could be
>> >> >>> replaced by
>> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >>> >> >>> > interoperability:
>> >> >>> >> >>> >
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >>> orc/OrcFile.hh#L37
>> >> >>> >> >>> >
>> >> >>> >>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >>> orc/MemoryPool.hh
>> >> >>> >> >>> >
>> >> >>> >>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >>> >> >>> >
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >> >>> OutputStream.hh
>> >> >>> >> >>> >
>> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> >> cause of
>> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> >> them
>> >> >>> from
>> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>> is
>> >> only
>> >> >>> >> >>> > available for static linking at the moment AFAIK).
>> >> >>> >> >>> >
>> >> >>> >> >>> > I question whether it's worth the community's time long
>> term
>> >> to
>> >> >>> wear
>> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>> in
>> >> each
>> >> >>> >> >>> > library to plug components together rather than utilizing
>> >> common
>> >> >>> >> >>> > platform APIs.
>> >> >>> >> >>> >
>> >> >>> >> >>> > - Wes
>> >> >>> >> >>> >
>> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> >>> >> joshuastorck@gmail.com>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >> You're point about the constraints of the ASF release
>> >> process are
>> >> >>> >> well
>> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>> current
>> >> >>> >> >>> environment I
>> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>> main
>> >> >>> issues
>> >> >>> >> I
>> >> >>> >> >>> worry
>> >> >>> >> >>> >> about when you put codebases like these together are:
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> >> becomes
>> >> >>> too
>> >> >>> >> >>> coupled
>> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> >> tree are
>> >> >>> >> >>> delayed
>> >> >>> >> >>> >> by artifacts higher in the dependency tree
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> If the project/release management is structured well and
>> >> someone
>> >> >>> >> keeps
>> >> >>> >> >>> an
>> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> great
>> >> >>> >> example of
>> >> >>> >> >>> how
>> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >> >>> codebase.
>> >> >>> >> That
>> >> >>> >> >>> >> gives me hope that the projects could be managed
>> separately
>> >> some
>> >> >>> >> day.
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >> >>> wesmckinn@gmail.com>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >>
>> >> >>> >> >>> >>> hi Josh,
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> arrow
>> >> >>> and
>> >> >>> >> >>> tying
>> >> >>> >> >>> >>> them together seems like the wrong choice.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> >> people
>> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>> >> agree
>> >> >>> >> with?)
>> >> >>> >> >>> >>> is that we should work more closely together until the
>> >> community
>> >> >>> >> grows
>> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>> >> now.
>> >> >>> As
>> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>> these
>> >> >>> >> projects.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>> own
>> >> >>> >> codebase.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>> into
>> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>> of
>> >> the
>> >> >>> >> GitHub
>> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >> >>> idealistic,
>> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>> devise
>> >> a
>> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> >> per
>> >> >>> day
>> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >> >>> without
>> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>> see
>> >> how
>> >> >>> we
>> >> >>> >> can
>> >> >>> >> >>> >>> move forward.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> codebases
>> >> >>> >> in the
>> >> >>> >> >>> >>> short term with the express purpose of separating them in
>> >> the
>> >> >>> near
>> >> >>> >> >>> term.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>> to
>> >> be
>> >> >>> >> >>> >>> practical and result in net improvements in productivity
>> and
>> >> >>> >> community
>> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>> the
>> >> >>> >> current
>> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>> consider
>> >> >>> >> >>> >>> development process and ASF releases separately. My
>> >> argument is
>> >> >>> as
>> >> >>> >> >>> >>> follows:
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >>> >> >>> >>> * Releases structured according to the desires of the
>> PMCs
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> - Wes
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> >>> >> joshuastorck@gmail.com
>> >> >>> >> >>> >
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>> implemented
>> >> in
>> >> >>> >> >>> parquet-cpp
>> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> >>> >> (ARROW-2585,
>> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>> confusing
>> >> and
>> >> >>> >> hard to
>> >> >>> >> >>> work
>> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>> parquet-cpp
>> >> >>> >> (created on
>> >> >>> >> >>> May
>> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>> was
>> >> >>> >> recently
>> >> >>> >> >>> >>> merged.
>> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> >> the
>> >> >>> >> change in
>> >> >>> >> >>> >>> arrow
>> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >>> >> >>> >>> run_clang_format.py
>> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>> >> there
>> >> >>> >> was an
>> >> >>> >> >>> >>> exact
>> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>> sense
>> >> in
>> >> >>> the
>> >> >>> >> long
>> >> >>> >> >>> >>> term.
>> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> arrow
>> >> >>> and
>> >> >>> >> >>> tying
>> >> >>> >> >>> >>> them
>> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>> other
>> >> >>> formats
>> >> >>> >> >>> that
>> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>> (e.g. -
>> >> >>> Orc),
>> >> >>> >> so I
>> >> >>> >> >>> >>> don't
>> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>> >> tooling
>> >> >>> >> should
>> >> >>> >> >>> be
>> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>> history
>> >> of
>> >> >>> >> >>> developing
>> >> >>> >> >>> >>> open
>> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>> CI
>> >> is a
>> >> >>> >> good
>> >> >>> >> >>> >>> > counter-example since there have been lots of
>> successful
>> >> open
>> >> >>> >> source
>> >> >>> >> >>> >>> > projects that have used nightly build systems that
>> pinned
>> >> >>> >> versions of
>> >> >>> >> >>> >>> > dependent software.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> codebases
>> >> >>> >> in the
>> >> >>> >> >>> >>> short
>> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>> >> near
>> >> >>> >> term.
>> >> >>> >> >>> My
>> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> >> together,
>> >> >>> you
>> >> >>> >> can
>> >> >>> >> >>> more
>> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>> a
>> >> >>> single
>> >> >>> >> PR.
>> >> >>> >> >>> >>> Second,
>> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >> >>> diverge,
>> >> >>> >> >>> which has
>> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> >> been
>> >> >>> >> sorted
>> >> >>> >> >>> out,
>> >> >>> >> >>> >>> it
>> >> >>> >> >>> >>> > should be easy to separate them back into their own
>> >> codebases.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >> >>> codebases
>> >> >>> >> for
>> >> >>> >> >>> arrow
>> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>> the
>> >> >>> >> >>> perspective of
>> >> >>> >> >>> >>> a
>> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>> is a
>> >> >>> large
>> >> >>> >> tax
>> >> >>> >> >>> to
>> >> >>> >> >>> >>> pay
>> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> >> in the
>> >> >>> >> 0.10.0
>> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> >> release. I
>> >> >>> >> hope
>> >> >>> >> >>> that
>> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> >> help
>> >> >>> >> reduce
>> >> >>> >> >>> the
>> >> >>> >> >>> >>> > complexity of the build/release tooling.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> >>> >> ted.dunning@gmail.com>
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> >>> >> wesmckinn@gmail.com>
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> >
>> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>> for
>> >> >>> >> stability
>> >> >>> >> >>> and
>> >> >>> >> >>> >>> API
>> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> >> HDFS
>> >> >>> >> >>> community
>> >> >>> >> >>> >>> took
>> >> >>> >> >>> >>> >> a
>> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >>> >> >>> >>> >> >
>> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> >> source
>> >> >>> >> >>> community as
>> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> >> didn't
>> >> >>> go
>> >> >>> >> the
>> >> >>> >> >>> way
>> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >> >>> community
>> >> >>> >> which
>> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> >> model.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> There are some more radical and community building
>> >> options as
>> >> >>> >> well.
>> >> >>> >> >>> Take
>> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>> subversion,
>> >> any
>> >> >>> >> Apache
>> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> >> large
>> >> >>> >> >>> fraction of
>> >> >>> >> >>> >>> >> subversion.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>> parquet
>> >> >>> >> committer
>> >> >>> >> >>> a
>> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> >>> >> committers in
>> >> >>> >> >>> >>> Arrow?
>> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> >> committer who
>> >> >>> >> asks
>> >> >>> >> >>> will
>> >> >>> >> >>> >>> be
>> >> >>> >> >>> >>> >> given committer status in Arrow.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>> Parquet
>> >> >>> >> committers
>> >> >>> >> >>> >>> can't be
>> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>> >> merged;
>> >> >>> >> they
>> >> >>> >> >>> can
>> >> >>> >> >>> >>> just
>> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> >> in the
>> >> >>> >> >>> Parquet
>> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >> >>> parquet so
>> >> >>> >> >>> why not
>> >> >>> >> >>> >>> >> invite them in?
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>>
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> --
>> >> >>> >> >> regards,
>> >> >>> >> >> Deepak Majeti
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>> > --
>> >> >>> > regards,
>> >> >>> > Deepak Majeti
>> >> >>>
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Thanks Ryan, will do. The people I'd still like to hear from are:

* Phillip Cloud
* Uwe Korn

As ASF contributors we are responsible to both be pragmatic as well as
act in the best interests of the community's health and productivity.



On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> I don't have an opinion here, but could someone send a summary of what is
> decided to the dev list once there is consensus? This is a long thread for
> parts of the project I don't work on, so I haven't followed it very closely.
>
> On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:
>
>> > It will be difficult to track parquet-cpp changes if they get mixed with
>> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>> Can we enforce that parquet-cpp changes will not be committed without a
>> corresponding Parquet JIRA?
>>
>> I think we would use the following policy:
>>
>> * use PARQUET-XXX for issues relating to Parquet core
>> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
>> core (e.g. changes that are in parquet/arrow right now)
>>
>> We've already been dealing with annoyances relating to issues
>> straddling the two projects (debugging an issue on Arrow side to find
>> that it has to be fixed on Parquet side); this would make things
>> simpler for us
>>
>> > I would also like to keep changes to parquet-cpp on a separate commit to
>> simplify forking later (if needed) and be able to maintain the commit
>> history.  I don't know if its possible to squash parquet-cpp commits and
>> arrow commits separately before merging.
>>
>> This seems rather onerous for both contributors and maintainers and
>> not in line with the goal of improving productivity. In the event that
>> we fork I see it as a traumatic event for the community. If it does
>> happen, then we can write a script (using git filter-branch and other
>> such tools) to extract commits related to the forked code.
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > I have a few more logistical questions to add.
>> >
>> > It will be difficult to track parquet-cpp changes if they get mixed with
>> > Arrow changes. Will we establish some guidelines for filing Parquet
>> JIRAs?
>> > Can we enforce that parquet-cpp changes will not be committed without a
>> > corresponding Parquet JIRA?
>> >
>> > I would also like to keep changes to parquet-cpp on a separate commit to
>> > simplify forking later (if needed) and be able to maintain the commit
>> > history.  I don't know if its possible to squash parquet-cpp commits and
>> > arrow commits separately before merging.
>> >
>> >
>> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Do other people have opinions? I would like to undertake this work in
>> >> the near future (the next 8-10 weeks); I would be OK with taking
>> >> responsibility for the primary codebase surgery.
>> >>
>> >> Some logistical questions:
>> >>
>> >> * We have a handful of pull requests in flight in parquet-cpp that
>> >> would need to be resolved / merged
>> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> >> releases cut out of the new structure
>> >> * Management of shared commit rights (I can discuss with the Arrow
>> >> PMC; I believe that approving any committer who has actively
>> >> maintained parquet-cpp should be a reasonable approach per Ted's
>> >> comments)
>> >>
>> >> If working more closely together proves to not be working out after
>> >> some period of time, I will be fully supportive of a fork or something
>> >> like it
>> >>
>> >> Thanks,
>> >> Wes
>> >>
>> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> > Thanks Tim.
>> >> >
>> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> >> > platform code intending to improve the performance of bit-packing in
>> >> > Parquet writes, and we resulted with 2 interdependent PRs
>> >> >
>> >> > * https://github.com/apache/parquet-cpp/pull/483
>> >> > * https://github.com/apache/arrow/pull/2355
>> >> >
>> >> > Changes that impact the Python interface to Parquet are even more
>> >> complex.
>> >> >
>> >> > Adding options to Arrow's CMake build system to only build
>> >> > Parquet-related code and dependencies (in a monorepo framework) would
>> >> > not be difficult, and amount to writing "make parquet".
>> >> >
>> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
>> to
>> >> > build and install the Parquet core libraries and their dependencies
>> >> > would be:
>> >> >
>> >> > ninja parquet && ninja install
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> >> > <ta...@cloudera.com.invalid> wrote:
>> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> >> successful, but I thought I'd give my two cents.
>> >> >>
>> >> >> For me, the thing that makes the biggest difference in contributing
>> to a
>> >> >> new codebase is the number of steps in the workflow for writing,
>> >> testing,
>> >> >> posting and iterating on a commit and also the number of
>> opportunities
>> >> for
>> >> >> missteps. The size of the repo and build/test times matter but are
>> >> >> secondary so long as the workflow is simple and reliable.
>> >> >>
>> >> >> I don't really know what the current state of things is, but it
>> sounds
>> >> like
>> >> >> it's not as simple as check out -> build -> test if you're doing a
>> >> >> cross-repo change. Circular dependencies are a real headache.
>> >> >>
>> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> hi,
>> >> >>>
>> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> >> majeti.deepak@gmail.com>
>> >> >>> wrote:
>> >> >>> > I think the circular dependency can be broken if we build a new
>> >> library
>> >> >>> for
>> >> >>> > the platform code. This will also make it easy for other projects
>> >> such as
>> >> >>> > ORC to use it.
>> >> >>> > I also remember your proposal a while ago of having a separate
>> >> project
>> >> >>> for
>> >> >>> > the platform code.  That project can live in the arrow repo.
>> >> However, one
>> >> >>> > has to clone the entire apache arrow repo but can just build the
>> >> platform
>> >> >>> > code. This will be temporary until we can find a new home for it.
>> >> >>> >
>> >> >>> > The dependency will look like:
>> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >> >>> > libplatform(platform api)
>> >> >>> >
>> >> >>> > CI workflow will clone the arrow project twice, once for the
>> platform
>> >> >>> > library and once for the arrow-core/bindings library.
>> >> >>>
>> >> >>> This seems like an interesting proposal; the best place to work
>> toward
>> >> >>> this goal (if it is even possible; the build system interactions and
>> >> >>> ASF release management are the hard problems) is to have all of the
>> >> >>> code in a single repository. ORC could already be using Arrow if it
>> >> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >> >>>
>> >> >>> >
>> >> >>> > There is no doubt that the collaborations between the Arrow and
>> >> Parquet
>> >> >>> > communities so far have been very successful.
>> >> >>> > The reason to maintain this relationship moving forward is to
>> >> continue to
>> >> >>> > reap the mutual benefits.
>> >> >>> > We should continue to take advantage of sharing code as well.
>> >> However, I
>> >> >>> > don't see any code sharing opportunities between arrow-core and
>> the
>> >> >>> > parquet-core. Both have different functions.
>> >> >>>
>> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
>> format
>> >> >>> is only one part of a project that has become quite large already
>> >> >>> (
>> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >> >>> platform-for-inmemory-data-105427919).
>> >> >>>
>> >> >>> >
>> >> >>> > We are at a point where the parquet-cpp public API is pretty
>> stable.
>> >> We
>> >> >>> > already passed that difficult stage. My take at arrow and parquet
>> is
>> >> to
>> >> >>> > keep them nimble since we can.
>> >> >>>
>> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >> >>> have done little work in asynchronous IO and concurrency which would
>> >> >>> yield both improved read and write throughput. This aligns well with
>> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >> >>> believe that more development will happen on parquet-core once the
>> >> >>> development process issues are resolved by having a single codebase,
>> >> >>> single build system, and a single CI framework.
>> >> >>>
>> >> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >> >>> goal I think we should still be open to making significant changes
>> in
>> >> >>> the interest of long term progress.
>> >> >>>
>> >> >>> Having now worked on these projects for more than 2 and a half years
>> >> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >> >>> past the "breaking point" and not willing to continue contributing
>> in
>> >> >>> a significant way to parquet-cpp if the projects remained structured
>> >> >>> as they are now. It's hampering progress and not serving the
>> >> >>> community.
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> > The current Arrow adaptor code for parquet should live in the
>> >> arrow
>> >> >>> >> repo. That will remove a majority of the dependency issues.
>> Joshua's
>> >> >>> work
>> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> >> the
>> >> >>> arrow
>> >> >>> >> repo.  This will be similar to the ORC adaptor.
>> >> >>> >>
>> >> >>> >> This has been suggested before, but I don't see how it would
>> >> alleviate
>> >> >>> >> any issues because of the significant dependencies on other
>> parts of
>> >> >>> >> the Arrow codebase. What you are proposing is:
>> >> >>> >>
>> >> >>> >> - (Arrow) arrow platform
>> >> >>> >> - (Parquet) parquet core
>> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> >>> >> - (Arrow) Python bindings
>> >> >>> >>
>> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> >>> >> built before invoking the Parquet core part of the build system.
>> You
>> >> >>> >> would need to pass dependent targets across different CMake build
>> >> >>> >> systems; I don't know if it's possible (I spent some time looking
>> >> into
>> >> >>> >> it earlier this year). This is what I meant by the lack of a
>> >> "concrete
>> >> >>> >> and actionable plan". The only thing that would really work
>> would be
>> >> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
>> builds
>> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> >> build
>> >> >>> >> system because it's only depended upon by the Python bindings.
>> >> >>> >>
>> >> >>> >> And even if a solution could be devised, it would not wholly
>> resolve
>> >> >>> >> the CI workflow issues.
>> >> >>> >>
>> >> >>> >> You could make Parquet completely independent of the Arrow
>> codebase,
>> >> >>> >> but at that point there is little reason to maintain a
>> relationship
>> >> >>> >> between the projects or their communities. We have spent a great
>> >> deal
>> >> >>> >> of effort refactoring the two projects to enable as much code
>> >> sharing
>> >> >>> >> as there is now.
>> >> >>> >>
>> >> >>> >> - Wes
>> >> >>> >>
>> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
>> wesmckinn@gmail.com>
>> >> >>> wrote:
>> >> >>> >> >> If you still strongly feel that the only way forward is to
>> clone
>> >> the
>> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> Having
>> >> two
>> >> >>> >> parquet-cpp repos is no way a better approach.
>> >> >>> >> >
>> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
>> is
>> >> to
>> >> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >>> >> >
>> >> >>> >> > It doesn't look like I will be able to convince you that a
>> >> monorepo is
>> >> >>> >> > a good idea; what I would ask instead is that you be willing to
>> >> give
>> >> >>> >> > it a shot, and if it turns out in the way you're describing
>> >> (which I
>> >> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >>> >> >
>> >> >>> >> > - Wes
>> >> >>> >> >
>> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >> >>> majeti.deepak@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >> >> Wes,
>> >> >>> >> >>
>> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> >> problems
>> >> >>> of a
>> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >>> >> >> Bringing in related Apache community experiences are more
>> >> meaningful
>> >> >>> >> than
>> >> >>> >> >> how mono-repos work at Google and other big organizations.
>> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> >> developers.
>> >> >>> >> >> You are very well aware of how difficult it has been to find
>> more
>> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
>> has
>> >> a low
>> >> >>> >> >> contribution rate to its core components.
>> >> >>> >> >>
>> >> >>> >> >> We should target to ensure that new volunteers who want to
>> >> contribute
>> >> >>> >> >> bug-fixes/features should spend the least amount of time in
>> >> figuring
>> >> >>> out
>> >> >>> >> >> the project repo. We can never come up with an automated build
>> >> system
>> >> >>> >> that
>> >> >>> >> >> caters to every possible environment.
>> >> >>> >> >> My only concern is if the mono-repo will make it harder for
>> new
>> >> >>> >> developers
>> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> >> build
>> >> >>> and
>> >> >>> >> test
>> >> >>> >> >> dependencies.
>> >> >>> >> >> I am not saying that the Arrow community/committers will be
>> less
>> >> >>> >> >> co-operative.
>> >> >>> >> >> I just don't think the mono-repo structure model will be
>> >> sustainable
>> >> >>> in
>> >> >>> >> an
>> >> >>> >> >> open source community unless there are long-term vested
>> >> interests. We
>> >> >>> >> can't
>> >> >>> >> >> predict that.
>> >> >>> >> >>
>> >> >>> >> >> The current circular dependency problems between Arrow and
>> >> Parquet
>> >> >>> is a
>> >> >>> >> >> major problem for the community and it is important.
>> >> >>> >> >>
>> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> >> arrow
>> >> >>> >> repo.
>> >> >>> >> >> That will remove a majority of the dependency issues.
>> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
>> that
>> >> >>> adapter
>> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
>> adaptor.
>> >> >>> >> >>
>> >> >>> >> >> The platform API code is pretty stable at this point. Minor
>> >> changes
>> >> >>> in
>> >> >>> >> the
>> >> >>> >> >> future to this code should not be the main reason to combine
>> the
>> >> >>> arrow
>> >> >>> >> >> parquet repos.
>> >> >>> >> >>
>> >> >>> >> >> "
>> >> >>> >> >> *I question whether it's worth the community's time long term
>> to
>> >> >>> wear*
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> >>> >> eachlibrary
>> >> >>> >> >> to plug components together rather than utilizing
>> commonplatform
>> >> >>> APIs.*"
>> >> >>> >> >>
>> >> >>> >> >> My answer to your question below would be "Yes".
>> >> >>> Modularity/separation
>> >> >>> >> is
>> >> >>> >> >> very important in an open source community where priorities of
>> >> >>> >> contributors
>> >> >>> >> >> are often short term.
>> >> >>> >> >> The retention is low and therefore the acquisition costs
>> should
>> >> be
>> >> >>> low
>> >> >>> >> as
>> >> >>> >> >> well. This is the community over code approach according to
>> me.
>> >> Minor
>> >> >>> >> code
>> >> >>> >> >> duplication is not a deal breaker.
>> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
>> big
>> >> >>> data
>> >> >>> >> >> space serving their own functions.
>> >> >>> >> >>
>> >> >>> >> >> If you still strongly feel that the only way forward is to
>> clone
>> >> the
>> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> >> Having
>> >> >>> two
>> >> >>> >> >> parquet-cpp repos is no way a better approach.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >> >>
>> >> >>> >> >>> @Antoine
>> >> >>> >> >>>
>> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
>> would
>> >> >>> slightly
>> >> >>> >> >>> increase Arrow CI times (which are already too large).
>> >> >>> >> >>>
>> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >>> >> >>>
>> >> >>> >> >>> Parquet run takes about 28
>> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >>> >> >>>
>> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> >> certain
>> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >>> >> >>>
>> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> >> could be
>> >> >>> >> >>> made substantially shorter by moving some of the slower parts
>> >> (like
>> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> >> nightly
>> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> >> also
>> >> >>> >> >>> improve build times (valgrind build could be moved to a
>> nightly
>> >> >>> >> >>> exhaustive test run)
>> >> >>> >> >>>
>> >> >>> >> >>> - Wes
>> >> >>> >> >>>
>> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> >> wesmckinn@gmail.com
>> >> >>> >
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> great
>> >> >>> >> example of
>> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >> >>> codebase.
>> >> >>> >> That
>> >> >>> >> >>> gives me hope that the projects could be managed separately
>> some
>> >> >>> day.
>> >> >>> >> >>> >
>> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
>> C++
>> >> >>> codebase
>> >> >>> >> >>> > features several areas of duplicated logic which could be
>> >> >>> replaced by
>> >> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >>> >> >>> > interoperability:
>> >> >>> >> >>> >
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >>> orc/OrcFile.hh#L37
>> >> >>> >> >>> >
>> >> >>> >>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >> >>> orc/MemoryPool.hh
>> >> >>> >> >>> >
>> >> >>> >>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >>> >> >>> >
>> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >> >>> OutputStream.hh
>> >> >>> >> >>> >
>> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> >> cause of
>> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> >> them
>> >> >>> from
>> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
>> is
>> >> only
>> >> >>> >> >>> > available for static linking at the moment AFAIK).
>> >> >>> >> >>> >
>> >> >>> >> >>> > I question whether it's worth the community's time long
>> term
>> >> to
>> >> >>> wear
>> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
>> in
>> >> each
>> >> >>> >> >>> > library to plug components together rather than utilizing
>> >> common
>> >> >>> >> >>> > platform APIs.
>> >> >>> >> >>> >
>> >> >>> >> >>> > - Wes
>> >> >>> >> >>> >
>> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> >>> >> joshuastorck@gmail.com>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >> You're point about the constraints of the ASF release
>> >> process are
>> >> >>> >> well
>> >> >>> >> >>> >> taken and as a developer who's trying to work in the
>> current
>> >> >>> >> >>> environment I
>> >> >>> >> >>> >> would be much happier if the codebases were merged. The
>> main
>> >> >>> issues
>> >> >>> >> I
>> >> >>> >> >>> worry
>> >> >>> >> >>> >> about when you put codebases like these together are:
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> >> becomes
>> >> >>> too
>> >> >>> >> >>> coupled
>> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> >> tree are
>> >> >>> >> >>> delayed
>> >> >>> >> >>> >> by artifacts higher in the dependency tree
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> If the project/release management is structured well and
>> >> someone
>> >> >>> >> keeps
>> >> >>> >> >>> an
>> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
>> great
>> >> >>> >> example of
>> >> >>> >> >>> how
>> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >> >>> codebase.
>> >> >>> >> That
>> >> >>> >> >>> >> gives me hope that the projects could be managed
>> separately
>> >> some
>> >> >>> >> day.
>> >> >>> >> >>> >>
>> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >> >>> wesmckinn@gmail.com>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> >>
>> >> >>> >> >>> >>> hi Josh,
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> arrow
>> >> >>> and
>> >> >>> >> >>> tying
>> >> >>> >> >>> >>> them together seems like the wrong choice.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> >> people
>> >> >>> >> >>> >>> building these projects -- my argument (which I think you
>> >> agree
>> >> >>> >> with?)
>> >> >>> >> >>> >>> is that we should work more closely together until the
>> >> community
>> >> >>> >> grows
>> >> >>> >> >>> >>> large enough to support larger-scope process than we have
>> >> now.
>> >> >>> As
>> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
>> these
>> >> >>> >> projects.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
>> own
>> >> >>> >> codebase.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
>> into
>> >> >>> >> >>> >>> consideration the constraints imposed by the combination
>> of
>> >> the
>> >> >>> >> GitHub
>> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >> >>> idealistic,
>> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
>> devise
>> >> a
>> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> >> per
>> >> >>> day
>> >> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >> >>> without
>> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
>> see
>> >> how
>> >> >>> we
>> >> >>> >> can
>> >> >>> >> >>> >>> move forward.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> codebases
>> >> >>> >> in the
>> >> >>> >> >>> >>> short term with the express purpose of separating them in
>> >> the
>> >> >>> near
>> >> >>> >> >>> term.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
>> to
>> >> be
>> >> >>> >> >>> >>> practical and result in net improvements in productivity
>> and
>> >> >>> >> community
>> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
>> the
>> >> >>> >> current
>> >> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
>> consider
>> >> >>> >> >>> >>> development process and ASF releases separately. My
>> >> argument is
>> >> >>> as
>> >> >>> >> >>> >>> follows:
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >>> >> >>> >>> * Releases structured according to the desires of the
>> PMCs
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> - Wes
>> >> >>> >> >>> >>>
>> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> >>> >> joshuastorck@gmail.com
>> >> >>> >> >>> >
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> > I recently worked on an issue that had to be
>> implemented
>> >> in
>> >> >>> >> >>> parquet-cpp
>> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> >>> >> (ARROW-2585,
>> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
>> confusing
>> >> and
>> >> >>> >> hard to
>> >> >>> >> >>> work
>> >> >>> >> >>> >>> > with. For example, I still have a PR open in
>> parquet-cpp
>> >> >>> >> (created on
>> >> >>> >> >>> May
>> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
>> was
>> >> >>> >> recently
>> >> >>> >> >>> >>> merged.
>> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> >> the
>> >> >>> >> change in
>> >> >>> >> >>> >>> arrow
>> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >>> >> >>> >>> run_clang_format.py
>> >> >>> >> >>> >>> > script in the arrow project only to find out later that
>> >> there
>> >> >>> >> was an
>> >> >>> >> >>> >>> exact
>> >> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
>> sense
>> >> in
>> >> >>> the
>> >> >>> >> long
>> >> >>> >> >>> >>> term.
>> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> >> arrow
>> >> >>> and
>> >> >>> >> >>> tying
>> >> >>> >> >>> >>> them
>> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
>> other
>> >> >>> formats
>> >> >>> >> >>> that
>> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
>> (e.g. -
>> >> >>> Orc),
>> >> >>> >> so I
>> >> >>> >> >>> >>> don't
>> >> >>> >> >>> >>> > see why parquet should be special. I also think build
>> >> tooling
>> >> >>> >> should
>> >> >>> >> >>> be
>> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
>> history
>> >> of
>> >> >>> >> >>> developing
>> >> >>> >> >>> >>> open
>> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
>> CI
>> >> is a
>> >> >>> >> good
>> >> >>> >> >>> >>> > counter-example since there have been lots of
>> successful
>> >> open
>> >> >>> >> source
>> >> >>> >> >>> >>> > projects that have used nightly build systems that
>> pinned
>> >> >>> >> versions of
>> >> >>> >> >>> >>> > dependent software.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> >> codebases
>> >> >>> >> in the
>> >> >>> >> >>> >>> short
>> >> >>> >> >>> >>> > term with the express purpose of separating them in the
>> >> near
>> >> >>> >> term.
>> >> >>> >> >>> My
>> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> >> together,
>> >> >>> you
>> >> >>> >> can
>> >> >>> >> >>> more
>> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
>> a
>> >> >>> single
>> >> >>> >> PR.
>> >> >>> >> >>> >>> Second,
>> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >> >>> diverge,
>> >> >>> >> >>> which has
>> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> >> been
>> >> >>> >> sorted
>> >> >>> >> >>> out,
>> >> >>> >> >>> >>> it
>> >> >>> >> >>> >>> > should be easy to separate them back into their own
>> >> codebases.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >> >>> codebases
>> >> >>> >> for
>> >> >>> >> >>> arrow
>> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
>> the
>> >> >>> >> >>> perspective of
>> >> >>> >> >>> >>> a
>> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
>> is a
>> >> >>> large
>> >> >>> >> tax
>> >> >>> >> >>> to
>> >> >>> >> >>> >>> pay
>> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> >> in the
>> >> >>> >> 0.10.0
>> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> >> release. I
>> >> >>> >> hope
>> >> >>> >> >>> that
>> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> >> help
>> >> >>> >> reduce
>> >> >>> >> >>> the
>> >> >>> >> >>> >>> > complexity of the build/release tooling.
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> >>> >> ted.dunning@gmail.com>
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> >
>> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> >>> >> wesmckinn@gmail.com>
>> >> >>> >> >>> >>> wrote:
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> >
>> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
>> for
>> >> >>> >> stability
>> >> >>> >> >>> and
>> >> >>> >> >>> >>> API
>> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> >> HDFS
>> >> >>> >> >>> community
>> >> >>> >> >>> >>> took
>> >> >>> >> >>> >>> >> a
>> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >>> >> >>> >>> >> >
>> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> >> source
>> >> >>> >> >>> community as
>> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> >> didn't
>> >> >>> go
>> >> >>> >> the
>> >> >>> >> >>> way
>> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >> >>> community
>> >> >>> >> which
>> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> >> model.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> There are some more radical and community building
>> >> options as
>> >> >>> >> well.
>> >> >>> >> >>> Take
>> >> >>> >> >>> >>> >> the subversion project as a precedent. With
>> subversion,
>> >> any
>> >> >>> >> Apache
>> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> >> large
>> >> >>> >> >>> fraction of
>> >> >>> >> >>> >>> >> subversion.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> So why not take this a bit further and give every
>> parquet
>> >> >>> >> committer
>> >> >>> >> >>> a
>> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> >>> >> committers in
>> >> >>> >> >>> >>> Arrow?
>> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> >> committer who
>> >> >>> >> asks
>> >> >>> >> >>> will
>> >> >>> >> >>> >>> be
>> >> >>> >> >>> >>> >> given committer status in Arrow.
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
>> Parquet
>> >> >>> >> committers
>> >> >>> >> >>> >>> can't be
>> >> >>> >> >>> >>> >> worried at that point whether their patches will get
>> >> merged;
>> >> >>> >> they
>> >> >>> >> >>> can
>> >> >>> >> >>> >>> just
>> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> >> in the
>> >> >>> >> >>> Parquet
>> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >> >>> parquet so
>> >> >>> >> >>> why not
>> >> >>> >> >>> >>> >> invite them in?
>> >> >>> >> >>> >>> >>
>> >> >>> >> >>> >>>
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> --
>> >> >>> >> >> regards,
>> >> >>> >> >> Deepak Majeti
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>> > --
>> >> >>> > regards,
>> >> >>> > Deepak Majeti
>> >> >>>
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don't have an opinion here, but could someone send a summary of what is
decided to the dev list once there is consensus? This is a long thread for
parts of the project I don't work on, so I haven't followed it very closely.

On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:

> > It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
>
> I think we would use the following policy:
>
> * use PARQUET-XXX for issues relating to Parquet core
> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> core (e.g. changes that are in parquet/arrow right now)
>
> We've already been dealing with annoyances relating to issues
> straddling the two projects (debugging an issue on Arrow side to find
> that it has to be fixed on Parquet side); this would make things
> simpler for us
>
> > I would also like to keep changes to parquet-cpp on a separate commit to
> simplify forking later (if needed) and be able to maintain the commit
> history.  I don't know if its possible to squash parquet-cpp commits and
> arrow commits separately before merging.
>
> This seems rather onerous for both contributors and maintainers and
> not in line with the goal of improving productivity. In the event that
> we fork I see it as a traumatic event for the community. If it does
> happen, then we can write a script (using git filter-branch and other
> such tools) to extract commits related to the forked code.
>
> - Wes
>
> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > I have a few more logistical questions to add.
> >
> > It will be difficult to track parquet-cpp changes if they get mixed with
> > Arrow changes. Will we establish some guidelines for filing Parquet
> JIRAs?
> > Can we enforce that parquet-cpp changes will not be committed without a
> > corresponding Parquet JIRA?
> >
> > I would also like to keep changes to parquet-cpp on a separate commit to
> > simplify forking later (if needed) and be able to maintain the commit
> > history.  I don't know if its possible to squash parquet-cpp commits and
> > arrow commits separately before merging.
> >
> >
> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
> >
> >> Do other people have opinions? I would like to undertake this work in
> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> responsibility for the primary codebase surgery.
> >>
> >> Some logistical questions:
> >>
> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> would need to be resolved / merged
> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> releases cut out of the new structure
> >> * Management of shared commit rights (I can discuss with the Arrow
> >> PMC; I believe that approving any committer who has actively
> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> comments)
> >>
> >> If working more closely together proves to not be working out after
> >> some period of time, I will be fully supportive of a fork or something
> >> like it
> >>
> >> Thanks,
> >> Wes
> >>
> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >> > Thanks Tim.
> >> >
> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> > platform code intending to improve the performance of bit-packing in
> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >
> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> > * https://github.com/apache/arrow/pull/2355
> >> >
> >> > Changes that impact the Python interface to Parquet are even more
> >> complex.
> >> >
> >> > Adding options to Arrow's CMake build system to only build
> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> > not be difficult, and amount to writing "make parquet".
> >> >
> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> to
> >> > build and install the Parquet core libraries and their dependencies
> >> > would be:
> >> >
> >> > ninja parquet && ninja install
> >> >
> >> > - Wes
> >> >
> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> > <ta...@cloudera.com.invalid> wrote:
> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> successful, but I thought I'd give my two cents.
> >> >>
> >> >> For me, the thing that makes the biggest difference in contributing
> to a
> >> >> new codebase is the number of steps in the workflow for writing,
> >> testing,
> >> >> posting and iterating on a commit and also the number of
> opportunities
> >> for
> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> secondary so long as the workflow is simple and reliable.
> >> >>
> >> >> I don't really know what the current state of things is, but it
> sounds
> >> like
> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >>
> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >>
> >> >>> hi,
> >> >>>
> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> majeti.deepak@gmail.com>
> >> >>> wrote:
> >> >>> > I think the circular dependency can be broken if we build a new
> >> library
> >> >>> for
> >> >>> > the platform code. This will also make it easy for other projects
> >> such as
> >> >>> > ORC to use it.
> >> >>> > I also remember your proposal a while ago of having a separate
> >> project
> >> >>> for
> >> >>> > the platform code.  That project can live in the arrow repo.
> >> However, one
> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> platform
> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >>> >
> >> >>> > The dependency will look like:
> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >>> > libplatform(platform api)
> >> >>> >
> >> >>> > CI workflow will clone the arrow project twice, once for the
> platform
> >> >>> > library and once for the arrow-core/bindings library.
> >> >>>
> >> >>> This seems like an interesting proposal; the best place to work
> toward
> >> >>> this goal (if it is even possible; the build system interactions and
> >> >>> ASF release management are the hard problems) is to have all of the
> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >>>
> >> >>> >
> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> Parquet
> >> >>> > communities so far have been very successful.
> >> >>> > The reason to maintain this relationship moving forward is to
> >> continue to
> >> >>> > reap the mutual benefits.
> >> >>> > We should continue to take advantage of sharing code as well.
> >> However, I
> >> >>> > don't see any code sharing opportunities between arrow-core and
> the
> >> >>> > parquet-core. Both have different functions.
> >> >>>
> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> format
> >> >>> is only one part of a project that has become quite large already
> >> >>> (
> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >>> platform-for-inmemory-data-105427919).
> >> >>>
> >> >>> >
> >> >>> > We are at a point where the parquet-cpp public API is pretty
> stable.
> >> We
> >> >>> > already passed that difficult stage. My take at arrow and parquet
> is
> >> to
> >> >>> > keep them nimble since we can.
> >> >>>
> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >>> yield both improved read and write throughput. This aligns well with
> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >>> believe that more development will happen on parquet-core once the
> >> >>> development process issues are resolved by having a single codebase,
> >> >>> single build system, and a single CI framework.
> >> >>>
> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >>> goal I think we should still be open to making significant changes
> in
> >> >>> the interest of long term progress.
> >> >>>
> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >>> past the "breaking point" and not willing to continue contributing
> in
> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >>> as they are now. It's hampering progress and not serving the
> >> >>> community.
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> >
> >> >>> wrote:
> >> >>> >
> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo. That will remove a majority of the dependency issues.
> Joshua's
> >> >>> work
> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> the
> >> >>> arrow
> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >>> >>
> >> >>> >> This has been suggested before, but I don't see how it would
> >> alleviate
> >> >>> >> any issues because of the significant dependencies on other
> parts of
> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >>> >>
> >> >>> >> - (Arrow) arrow platform
> >> >>> >> - (Parquet) parquet core
> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >>> >> - (Arrow) Python bindings
> >> >>> >>
> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >>> >> built before invoking the Parquet core part of the build system.
> You
> >> >>> >> would need to pass dependent targets across different CMake build
> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> into
> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> "concrete
> >> >>> >> and actionable plan". The only thing that would really work
> would be
> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> builds
> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> build
> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >>> >>
> >> >>> >> And even if a solution could be devised, it would not wholly
> resolve
> >> >>> >> the CI workflow issues.
> >> >>> >>
> >> >>> >> You could make Parquet completely independent of the Arrow
> codebase,
> >> >>> >> but at that point there is little reason to maintain a
> relationship
> >> >>> >> between the projects or their communities. We have spent a great
> >> deal
> >> >>> >> of effort refactoring the two projects to enable as much code
> >> sharing
> >> >>> >> as there is now.
> >> >>> >>
> >> >>> >> - Wes
> >> >>> >>
> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> wesmckinn@gmail.com>
> >> >>> wrote:
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> Having
> >> two
> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >
> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> is
> >> to
> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >>> >> >
> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> monorepo is
> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> give
> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> (which I
> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >>> >> >
> >> >>> >> > - Wes
> >> >>> >> >
> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >>> majeti.deepak@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >> Wes,
> >> >>> >> >>
> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> problems
> >> >>> of a
> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >>> >> >> Bringing in related Apache community experiences are more
> >> meaningful
> >> >>> >> than
> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> developers.
> >> >>> >> >> You are very well aware of how difficult it has been to find
> more
> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> has
> >> a low
> >> >>> >> >> contribution rate to its core components.
> >> >>> >> >>
> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> contribute
> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> figuring
> >> >>> out
> >> >>> >> >> the project repo. We can never come up with an automated build
> >> system
> >> >>> >> that
> >> >>> >> >> caters to every possible environment.
> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> new
> >> >>> >> developers
> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> build
> >> >>> and
> >> >>> >> test
> >> >>> >> >> dependencies.
> >> >>> >> >> I am not saying that the Arrow community/committers will be
> less
> >> >>> >> >> co-operative.
> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> sustainable
> >> >>> in
> >> >>> >> an
> >> >>> >> >> open source community unless there are long-term vested
> >> interests. We
> >> >>> >> can't
> >> >>> >> >> predict that.
> >> >>> >> >>
> >> >>> >> >> The current circular dependency problems between Arrow and
> >> Parquet
> >> >>> is a
> >> >>> >> >> major problem for the community and it is important.
> >> >>> >> >>
> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo.
> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> that
> >> >>> adapter
> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> adaptor.
> >> >>> >> >>
> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> changes
> >> >>> in
> >> >>> >> the
> >> >>> >> >> future to this code should not be the main reason to combine
> the
> >> >>> arrow
> >> >>> >> >> parquet repos.
> >> >>> >> >>
> >> >>> >> >> "
> >> >>> >> >> *I question whether it's worth the community's time long term
> to
> >> >>> wear*
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >>> >> eachlibrary
> >> >>> >> >> to plug components together rather than utilizing
> commonplatform
> >> >>> APIs.*"
> >> >>> >> >>
> >> >>> >> >> My answer to your question below would be "Yes".
> >> >>> Modularity/separation
> >> >>> >> is
> >> >>> >> >> very important in an open source community where priorities of
> >> >>> >> contributors
> >> >>> >> >> are often short term.
> >> >>> >> >> The retention is low and therefore the acquisition costs
> should
> >> be
> >> >>> low
> >> >>> >> as
> >> >>> >> >> well. This is the community over code approach according to
> me.
> >> Minor
> >> >>> >> code
> >> >>> >> >> duplication is not a deal breaker.
> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> big
> >> >>> data
> >> >>> >> >> space serving their own functions.
> >> >>> >> >>
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >>> two
> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> wesmckinn@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >>
> >> >>> >> >>> @Antoine
> >> >>> >> >>>
> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> would
> >> >>> slightly
> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >>> >> >>>
> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >>> >> >>>
> >> >>> >> >>> Parquet run takes about 28
> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >>> >> >>>
> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> certain
> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >>> >> >>>
> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> could be
> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> (like
> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> nightly
> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> also
> >> >>> >> >>> improve build times (valgrind build could be moved to a
> nightly
> >> >>> >> >>> exhaustive test run)
> >> >>> >> >>>
> >> >>> >> >>> - Wes
> >> >>> >> >>>
> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> wesmckinn@gmail.com
> >> >>> >
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> gives me hope that the projects could be managed separately
> some
> >> >>> day.
> >> >>> >> >>> >
> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> C++
> >> >>> codebase
> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >>> replaced by
> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >>> >> >>> > interoperability:
> >> >>> >> >>> >
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/OrcFile.hh#L37
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/MemoryPool.hh
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >>> >> >>> >
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >>> OutputStream.hh
> >> >>> >> >>> >
> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> cause of
> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> them
> >> >>> from
> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> is
> >> only
> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >>> >> >>> >
> >> >>> >> >>> > I question whether it's worth the community's time long
> term
> >> to
> >> >>> wear
> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> in
> >> each
> >> >>> >> >>> > library to plug components together rather than utilizing
> >> common
> >> >>> >> >>> > platform APIs.
> >> >>> >> >>> >
> >> >>> >> >>> > - Wes
> >> >>> >> >>> >
> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >>> >> joshuastorck@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> process are
> >> >>> >> well
> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> current
> >> >>> >> >>> environment I
> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> main
> >> >>> issues
> >> >>> >> I
> >> >>> >> >>> worry
> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >>> >> >>> >>
> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> becomes
> >> >>> too
> >> >>> >> >>> coupled
> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> tree are
> >> >>> >> >>> delayed
> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >>> >> >>> >>
> >> >>> >> >>> >> If the project/release management is structured well and
> >> someone
> >> >>> >> keeps
> >> >>> >> >>> an
> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >>> >> >>> >>
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how
> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> >> gives me hope that the projects could be managed
> separately
> >> some
> >> >>> >> day.
> >> >>> >> >>> >>
> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >>> wesmckinn@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >>
> >> >>> >> >>> >>> hi Josh,
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> people
> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> agree
> >> >>> >> with?)
> >> >>> >> >>> >>> is that we should work more closely together until the
> >> community
> >> >>> >> grows
> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> now.
> >> >>> As
> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> these
> >> >>> >> projects.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> own
> >> >>> >> codebase.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> into
> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> of
> >> the
> >> >>> >> GitHub
> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >>> idealistic,
> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> devise
> >> a
> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> per
> >> >>> day
> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >>> without
> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> see
> >> how
> >> >>> we
> >> >>> >> can
> >> >>> >> >>> >>> move forward.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> the
> >> >>> near
> >> >>> >> >>> term.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> to
> >> be
> >> >>> >> >>> >>> practical and result in net improvements in productivity
> and
> >> >>> >> community
> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> the
> >> >>> >> current
> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> consider
> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> argument is
> >> >>> as
> >> >>> >> >>> >>> follows:
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >>> >> >>> >>> * Releases structured according to the desires of the
> PMCs
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> - Wes
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >>> >> joshuastorck@gmail.com
> >> >>> >> >>> >
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> implemented
> >> in
> >> >>> >> >>> parquet-cpp
> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >>> >> (ARROW-2585,
> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> confusing
> >> and
> >> >>> >> hard to
> >> >>> >> >>> work
> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> parquet-cpp
> >> >>> >> (created on
> >> >>> >> >>> May
> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> was
> >> >>> >> recently
> >> >>> >> >>> >>> merged.
> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> the
> >> >>> >> change in
> >> >>> >> >>> >>> arrow
> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >>> >> >>> >>> run_clang_format.py
> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> there
> >> >>> >> was an
> >> >>> >> >>> >>> exact
> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> sense
> >> in
> >> >>> the
> >> >>> >> long
> >> >>> >> >>> >>> term.
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them
> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> other
> >> >>> formats
> >> >>> >> >>> that
> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> (e.g. -
> >> >>> Orc),
> >> >>> >> so I
> >> >>> >> >>> >>> don't
> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> tooling
> >> >>> >> should
> >> >>> >> >>> be
> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> history
> >> of
> >> >>> >> >>> developing
> >> >>> >> >>> >>> open
> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> CI
> >> is a
> >> >>> >> good
> >> >>> >> >>> >>> > counter-example since there have been lots of
> successful
> >> open
> >> >>> >> source
> >> >>> >> >>> >>> > projects that have used nightly build systems that
> pinned
> >> >>> >> versions of
> >> >>> >> >>> >>> > dependent software.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short
> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> near
> >> >>> >> term.
> >> >>> >> >>> My
> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> together,
> >> >>> you
> >> >>> >> can
> >> >>> >> >>> more
> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> a
> >> >>> single
> >> >>> >> PR.
> >> >>> >> >>> >>> Second,
> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >>> diverge,
> >> >>> >> >>> which has
> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> been
> >> >>> >> sorted
> >> >>> >> >>> out,
> >> >>> >> >>> >>> it
> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> codebases.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >>> codebases
> >> >>> >> for
> >> >>> >> >>> arrow
> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> the
> >> >>> >> >>> perspective of
> >> >>> >> >>> >>> a
> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> is a
> >> >>> large
> >> >>> >> tax
> >> >>> >> >>> to
> >> >>> >> >>> >>> pay
> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> in the
> >> >>> >> 0.10.0
> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> release. I
> >> >>> >> hope
> >> >>> >> >>> that
> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> help
> >> >>> >> reduce
> >> >>> >> >>> the
> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >>> >> ted.dunning@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >>> >> wesmckinn@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> for
> >> >>> >> stability
> >> >>> >> >>> and
> >> >>> >> >>> >>> API
> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> HDFS
> >> >>> >> >>> community
> >> >>> >> >>> >>> took
> >> >>> >> >>> >>> >> a
> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> source
> >> >>> >> >>> community as
> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> didn't
> >> >>> go
> >> >>> >> the
> >> >>> >> >>> way
> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >>> community
> >> >>> >> which
> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> model.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> There are some more radical and community building
> >> options as
> >> >>> >> well.
> >> >>> >> >>> Take
> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> subversion,
> >> any
> >> >>> >> Apache
> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> large
> >> >>> >> >>> fraction of
> >> >>> >> >>> >>> >> subversion.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> parquet
> >> >>> >> committer
> >> >>> >> >>> a
> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >>> >> committers in
> >> >>> >> >>> >>> Arrow?
> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> committer who
> >> >>> >> asks
> >> >>> >> >>> will
> >> >>> >> >>> >>> be
> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> Parquet
> >> >>> >> committers
> >> >>> >> >>> >>> can't be
> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> merged;
> >> >>> >> they
> >> >>> >> >>> can
> >> >>> >> >>> >>> just
> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> in the
> >> >>> >> >>> Parquet
> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >>> parquet so
> >> >>> >> >>> why not
> >> >>> >> >>> >>> >> invite them in?
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>>
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> regards,
> >> >>> >> >> Deepak Majeti
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > regards,
> >> >>> > Deepak Majeti
> >> >>>
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don't have an opinion here, but could someone send a summary of what is
decided to the dev list once there is consensus? This is a long thread for
parts of the project I don't work on, so I haven't followed it very closely.

On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <we...@gmail.com> wrote:

> > It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
>
> I think we would use the following policy:
>
> * use PARQUET-XXX for issues relating to Parquet core
> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> core (e.g. changes that are in parquet/arrow right now)
>
> We've already been dealing with annoyances relating to issues
> straddling the two projects (debugging an issue on Arrow side to find
> that it has to be fixed on Parquet side); this would make things
> simpler for us
>
> > I would also like to keep changes to parquet-cpp on a separate commit to
> simplify forking later (if needed) and be able to maintain the commit
> history.  I don't know if its possible to squash parquet-cpp commits and
> arrow commits separately before merging.
>
> This seems rather onerous for both contributors and maintainers and
> not in line with the goal of improving productivity. In the event that
> we fork I see it as a traumatic event for the community. If it does
> happen, then we can write a script (using git filter-branch and other
> such tools) to extract commits related to the forked code.
>
> - Wes
>
> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > I have a few more logistical questions to add.
> >
> > It will be difficult to track parquet-cpp changes if they get mixed with
> > Arrow changes. Will we establish some guidelines for filing Parquet
> JIRAs?
> > Can we enforce that parquet-cpp changes will not be committed without a
> > corresponding Parquet JIRA?
> >
> > I would also like to keep changes to parquet-cpp on a separate commit to
> > simplify forking later (if needed) and be able to maintain the commit
> > history.  I don't know if its possible to squash parquet-cpp commits and
> > arrow commits separately before merging.
> >
> >
> > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
> >
> >> Do other people have opinions? I would like to undertake this work in
> >> the near future (the next 8-10 weeks); I would be OK with taking
> >> responsibility for the primary codebase surgery.
> >>
> >> Some logistical questions:
> >>
> >> * We have a handful of pull requests in flight in parquet-cpp that
> >> would need to be resolved / merged
> >> * We should probably cut a status-quo cpp-1.5.0 release, with future
> >> releases cut out of the new structure
> >> * Management of shared commit rights (I can discuss with the Arrow
> >> PMC; I believe that approving any committer who has actively
> >> maintained parquet-cpp should be a reasonable approach per Ted's
> >> comments)
> >>
> >> If working more closely together proves to not be working out after
> >> some period of time, I will be fully supportive of a fork or something
> >> like it
> >>
> >> Thanks,
> >> Wes
> >>
> >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >> > Thanks Tim.
> >> >
> >> > Indeed, it's not very simple. Just today Antoine cleaned up some
> >> > platform code intending to improve the performance of bit-packing in
> >> > Parquet writes, and we resulted with 2 interdependent PRs
> >> >
> >> > * https://github.com/apache/parquet-cpp/pull/483
> >> > * https://github.com/apache/arrow/pull/2355
> >> >
> >> > Changes that impact the Python interface to Parquet are even more
> >> complex.
> >> >
> >> > Adding options to Arrow's CMake build system to only build
> >> > Parquet-related code and dependencies (in a monorepo framework) would
> >> > not be difficult, and amount to writing "make parquet".
> >> >
> >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands
> to
> >> > build and install the Parquet core libraries and their dependencies
> >> > would be:
> >> >
> >> > ninja parquet && ninja install
> >> >
> >> > - Wes
> >> >
> >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> >> > <ta...@cloudera.com.invalid> wrote:
> >> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> >> successful, but I thought I'd give my two cents.
> >> >>
> >> >> For me, the thing that makes the biggest difference in contributing
> to a
> >> >> new codebase is the number of steps in the workflow for writing,
> >> testing,
> >> >> posting and iterating on a commit and also the number of
> opportunities
> >> for
> >> >> missteps. The size of the repo and build/test times matter but are
> >> >> secondary so long as the workflow is simple and reliable.
> >> >>
> >> >> I don't really know what the current state of things is, but it
> sounds
> >> like
> >> >> it's not as simple as check out -> build -> test if you're doing a
> >> >> cross-repo change. Circular dependencies are a real headache.
> >> >>
> >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >>
> >> >>> hi,
> >> >>>
> >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> >> majeti.deepak@gmail.com>
> >> >>> wrote:
> >> >>> > I think the circular dependency can be broken if we build a new
> >> library
> >> >>> for
> >> >>> > the platform code. This will also make it easy for other projects
> >> such as
> >> >>> > ORC to use it.
> >> >>> > I also remember your proposal a while ago of having a separate
> >> project
> >> >>> for
> >> >>> > the platform code.  That project can live in the arrow repo.
> >> However, one
> >> >>> > has to clone the entire apache arrow repo but can just build the
> >> platform
> >> >>> > code. This will be temporary until we can find a new home for it.
> >> >>> >
> >> >>> > The dependency will look like:
> >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >> >>> > libplatform(platform api)
> >> >>> >
> >> >>> > CI workflow will clone the arrow project twice, once for the
> platform
> >> >>> > library and once for the arrow-core/bindings library.
> >> >>>
> >> >>> This seems like an interesting proposal; the best place to work
> toward
> >> >>> this goal (if it is even possible; the build system interactions and
> >> >>> ASF release management are the hard problems) is to have all of the
> >> >>> code in a single repository. ORC could already be using Arrow if it
> >> >>> wanted, but the ORC contributors aren't active in Arrow.
> >> >>>
> >> >>> >
> >> >>> > There is no doubt that the collaborations between the Arrow and
> >> Parquet
> >> >>> > communities so far have been very successful.
> >> >>> > The reason to maintain this relationship moving forward is to
> >> continue to
> >> >>> > reap the mutual benefits.
> >> >>> > We should continue to take advantage of sharing code as well.
> >> However, I
> >> >>> > don't see any code sharing opportunities between arrow-core and
> the
> >> >>> > parquet-core. Both have different functions.
> >> >>>
> >> >>> I think you mean the Arrow columnar format. The Arrow columnar
> format
> >> >>> is only one part of a project that has become quite large already
> >> >>> (
> >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >> >>> platform-for-inmemory-data-105427919).
> >> >>>
> >> >>> >
> >> >>> > We are at a point where the parquet-cpp public API is pretty
> stable.
> >> We
> >> >>> > already passed that difficult stage. My take at arrow and parquet
> is
> >> to
> >> >>> > keep them nimble since we can.
> >> >>>
> >> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >> >>> have done little work in asynchronous IO and concurrency which would
> >> >>> yield both improved read and write throughput. This aligns well with
> >> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >> >>> believe that more development will happen on parquet-core once the
> >> >>> development process issues are resolved by having a single codebase,
> >> >>> single build system, and a single CI framework.
> >> >>>
> >> >>> I have some gripes about design decisions made early in parquet-cpp,
> >> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >> >>> goal I think we should still be open to making significant changes
> in
> >> >>> the interest of long term progress.
> >> >>>
> >> >>> Having now worked on these projects for more than 2 and a half years
> >> >>> and the most frequent contributor to both codebases, I'm sadly far
> >> >>> past the "breaking point" and not willing to continue contributing
> in
> >> >>> a significant way to parquet-cpp if the projects remained structured
> >> >>> as they are now. It's hampering progress and not serving the
> >> >>> community.
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> >
> >> >>> wrote:
> >> >>> >
> >> >>> >> > The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo. That will remove a majority of the dependency issues.
> Joshua's
> >> >>> work
> >> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> >> the
> >> >>> arrow
> >> >>> >> repo.  This will be similar to the ORC adaptor.
> >> >>> >>
> >> >>> >> This has been suggested before, but I don't see how it would
> >> alleviate
> >> >>> >> any issues because of the significant dependencies on other
> parts of
> >> >>> >> the Arrow codebase. What you are proposing is:
> >> >>> >>
> >> >>> >> - (Arrow) arrow platform
> >> >>> >> - (Parquet) parquet core
> >> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >> >>> >> - (Arrow) Python bindings
> >> >>> >>
> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >> >>> >> built before invoking the Parquet core part of the build system.
> You
> >> >>> >> would need to pass dependent targets across different CMake build
> >> >>> >> systems; I don't know if it's possible (I spent some time looking
> >> into
> >> >>> >> it earlier this year). This is what I meant by the lack of a
> >> "concrete
> >> >>> >> and actionable plan". The only thing that would really work
> would be
> >> >>> >> for the Parquet core to be "included" in the Arrow build system
> >> >>> >> somehow rather than using ExternalProject. Currently Parquet
> builds
> >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> >> build
> >> >>> >> system because it's only depended upon by the Python bindings.
> >> >>> >>
> >> >>> >> And even if a solution could be devised, it would not wholly
> resolve
> >> >>> >> the CI workflow issues.
> >> >>> >>
> >> >>> >> You could make Parquet completely independent of the Arrow
> codebase,
> >> >>> >> but at that point there is little reason to maintain a
> relationship
> >> >>> >> between the projects or their communities. We have spent a great
> >> deal
> >> >>> >> of effort refactoring the two projects to enable as much code
> >> sharing
> >> >>> >> as there is now.
> >> >>> >>
> >> >>> >> - Wes
> >> >>> >>
> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <
> wesmckinn@gmail.com>
> >> >>> wrote:
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern.
> Having
> >> two
> >> >>> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >
> >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo
> is
> >> to
> >> >>> >> > fork. That would obviously be a bad outcome for the community.
> >> >>> >> >
> >> >>> >> > It doesn't look like I will be able to convince you that a
> >> monorepo is
> >> >>> >> > a good idea; what I would ask instead is that you be willing to
> >> give
> >> >>> >> > it a shot, and if it turns out in the way you're describing
> >> (which I
> >> >>> >> > don't think it will) then I suggest that we fork at that point.
> >> >>> >> >
> >> >>> >> > - Wes
> >> >>> >> >
> >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >> >>> majeti.deepak@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >> Wes,
> >> >>> >> >>
> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> >> problems
> >> >>> of a
> >> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >> >>> >> >> Bringing in related Apache community experiences are more
> >> meaningful
> >> >>> >> than
> >> >>> >> >> how mono-repos work at Google and other big organizations.
> >> >>> >> >> We solely depend on volunteers and cannot hire full-time
> >> developers.
> >> >>> >> >> You are very well aware of how difficult it has been to find
> more
> >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already
> has
> >> a low
> >> >>> >> >> contribution rate to its core components.
> >> >>> >> >>
> >> >>> >> >> We should target to ensure that new volunteers who want to
> >> contribute
> >> >>> >> >> bug-fixes/features should spend the least amount of time in
> >> figuring
> >> >>> out
> >> >>> >> >> the project repo. We can never come up with an automated build
> >> system
> >> >>> >> that
> >> >>> >> >> caters to every possible environment.
> >> >>> >> >> My only concern is if the mono-repo will make it harder for
> new
> >> >>> >> developers
> >> >>> >> >> to work on parquet-cpp core just due to the additional code,
> >> build
> >> >>> and
> >> >>> >> test
> >> >>> >> >> dependencies.
> >> >>> >> >> I am not saying that the Arrow community/committers will be
> less
> >> >>> >> >> co-operative.
> >> >>> >> >> I just don't think the mono-repo structure model will be
> >> sustainable
> >> >>> in
> >> >>> >> an
> >> >>> >> >> open source community unless there are long-term vested
> >> interests. We
> >> >>> >> can't
> >> >>> >> >> predict that.
> >> >>> >> >>
> >> >>> >> >> The current circular dependency problems between Arrow and
> >> Parquet
> >> >>> is a
> >> >>> >> >> major problem for the community and it is important.
> >> >>> >> >>
> >> >>> >> >> The current Arrow adaptor code for parquet should live in the
> >> arrow
> >> >>> >> repo.
> >> >>> >> >> That will remove a majority of the dependency issues.
> >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if
> that
> >> >>> adapter
> >> >>> >> >> was in the arrow repo.  This will be similar to the ORC
> adaptor.
> >> >>> >> >>
> >> >>> >> >> The platform API code is pretty stable at this point. Minor
> >> changes
> >> >>> in
> >> >>> >> the
> >> >>> >> >> future to this code should not be the main reason to combine
> the
> >> >>> arrow
> >> >>> >> >> parquet repos.
> >> >>> >> >>
> >> >>> >> >> "
> >> >>> >> >> *I question whether it's worth the community's time long term
> to
> >> >>> wear*
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> >>> >> eachlibrary
> >> >>> >> >> to plug components together rather than utilizing
> commonplatform
> >> >>> APIs.*"
> >> >>> >> >>
> >> >>> >> >> My answer to your question below would be "Yes".
> >> >>> Modularity/separation
> >> >>> >> is
> >> >>> >> >> very important in an open source community where priorities of
> >> >>> >> contributors
> >> >>> >> >> are often short term.
> >> >>> >> >> The retention is low and therefore the acquisition costs
> should
> >> be
> >> >>> low
> >> >>> >> as
> >> >>> >> >> well. This is the community over code approach according to
> me.
> >> Minor
> >> >>> >> code
> >> >>> >> >> duplication is not a deal breaker.
> >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the
> big
> >> >>> data
> >> >>> >> >> space serving their own functions.
> >> >>> >> >>
> >> >>> >> >> If you still strongly feel that the only way forward is to
> clone
> >> the
> >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> >> Having
> >> >>> two
> >> >>> >> >> parquet-cpp repos is no way a better approach.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> >> wesmckinn@gmail.com>
> >> >>> >> wrote:
> >> >>> >> >>
> >> >>> >> >>> @Antoine
> >> >>> >> >>>
> >> >>> >> >>> > By the way, one concern with the monorepo approach: it
> would
> >> >>> slightly
> >> >>> >> >>> increase Arrow CI times (which are already too large).
> >> >>> >> >>>
> >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >>> >> >>>
> >> >>> >> >>> Parquet run takes about 28
> >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >>> >> >>>
> >> >>> >> >>> Inevitably we will need to create some kind of bot to run
> >> certain
> >> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >>> >> >>>
> >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> >> could be
> >> >>> >> >>> made substantially shorter by moving some of the slower parts
> >> (like
> >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> >> nightly
> >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> >> also
> >> >>> >> >>> improve build times (valgrind build could be moved to a
> nightly
> >> >>> >> >>> exhaustive test run)
> >> >>> >> >>>
> >> >>> >> >>> - Wes
> >> >>> >> >>>
> >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> >> wesmckinn@gmail.com
> >> >>> >
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> gives me hope that the projects could be managed separately
> some
> >> >>> day.
> >> >>> >> >>> >
> >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC
> C++
> >> >>> codebase
> >> >>> >> >>> > features several areas of duplicated logic which could be
> >> >>> replaced by
> >> >>> >> >>> > components from the Arrow platform for better platform-wide
> >> >>> >> >>> > interoperability:
> >> >>> >> >>> >
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/OrcFile.hh#L37
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >> >>> orc/MemoryPool.hh
> >> >>> >> >>> >
> >> >>> >>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >>> >> >>> >
> >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >> >>> OutputStream.hh
> >> >>> >> >>> >
> >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> >> cause of
> >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> >> them
> >> >>> from
> >> >>> >> >>> > leaking to third party linkers when statically linked (ORC
> is
> >> only
> >> >>> >> >>> > available for static linking at the moment AFAIK).
> >> >>> >> >>> >
> >> >>> >> >>> > I question whether it's worth the community's time long
> term
> >> to
> >> >>> wear
> >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces
> in
> >> each
> >> >>> >> >>> > library to plug components together rather than utilizing
> >> common
> >> >>> >> >>> > platform APIs.
> >> >>> >> >>> >
> >> >>> >> >>> > - Wes
> >> >>> >> >>> >
> >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> >>> >> joshuastorck@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >> You're point about the constraints of the ASF release
> >> process are
> >> >>> >> well
> >> >>> >> >>> >> taken and as a developer who's trying to work in the
> current
> >> >>> >> >>> environment I
> >> >>> >> >>> >> would be much happier if the codebases were merged. The
> main
> >> >>> issues
> >> >>> >> I
> >> >>> >> >>> worry
> >> >>> >> >>> >> about when you put codebases like these together are:
> >> >>> >> >>> >>
> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> >> becomes
> >> >>> too
> >> >>> >> >>> coupled
> >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> >> tree are
> >> >>> >> >>> delayed
> >> >>> >> >>> >> by artifacts higher in the dependency tree
> >> >>> >> >>> >>
> >> >>> >> >>> >> If the project/release management is structured well and
> >> someone
> >> >>> >> keeps
> >> >>> >> >>> an
> >> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >>> >> >>> >>
> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a
> great
> >> >>> >> example of
> >> >>> >> >>> how
> >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >> >>> codebase.
> >> >>> >> That
> >> >>> >> >>> >> gives me hope that the projects could be managed
> separately
> >> some
> >> >>> >> day.
> >> >>> >> >>> >>
> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >> >>> wesmckinn@gmail.com>
> >> >>> >> >>> wrote:
> >> >>> >> >>> >>
> >> >>> >> >>> >>> hi Josh,
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them together seems like the wrong choice.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> >> people
> >> >>> >> >>> >>> building these projects -- my argument (which I think you
> >> agree
> >> >>> >> with?)
> >> >>> >> >>> >>> is that we should work more closely together until the
> >> community
> >> >>> >> grows
> >> >>> >> >>> >>> large enough to support larger-scope process than we have
> >> now.
> >> >>> As
> >> >>> >> >>> >>> you've seen, our process isn't serving developers of
> these
> >> >>> >> projects.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > I also think build tooling should be pulled into its
> own
> >> >>> >> codebase.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I don't see how this can possibly be practical taking
> into
> >> >>> >> >>> >>> consideration the constraints imposed by the combination
> of
> >> the
> >> >>> >> GitHub
> >> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >> >>> idealistic,
> >> >>> >> >>> >>> but right now we need to be practical. Unless we can
> devise
> >> a
> >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> >> per
> >> >>> day
> >> >>> >> >>> >>> which may touch both code and build system simultaneously
> >> >>> without
> >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't
> see
> >> how
> >> >>> we
> >> >>> >> can
> >> >>> >> >>> >>> move forward.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short term with the express purpose of separating them in
> >> the
> >> >>> near
> >> >>> >> >>> term.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> I would agree but only if separation can be demonstrated
> to
> >> be
> >> >>> >> >>> >>> practical and result in net improvements in productivity
> and
> >> >>> >> community
> >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that
> the
> >> >>> >> current
> >> >>> >> >>> >>> separation is impractical, and is causing problems.
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to
> consider
> >> >>> >> >>> >>> development process and ASF releases separately. My
> >> argument is
> >> >>> as
> >> >>> >> >>> >>> follows:
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> * Monorepo for development (for practicality)
> >> >>> >> >>> >>> * Releases structured according to the desires of the
> PMCs
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> - Wes
> >> >>> >> >>> >>>
> >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> >>> >> joshuastorck@gmail.com
> >> >>> >> >>> >
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> > I recently worked on an issue that had to be
> implemented
> >> in
> >> >>> >> >>> parquet-cpp
> >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> >>> >> (ARROW-2585,
> >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies
> confusing
> >> and
> >> >>> >> hard to
> >> >>> >> >>> work
> >> >>> >> >>> >>> > with. For example, I still have a PR open in
> parquet-cpp
> >> >>> >> (created on
> >> >>> >> >>> May
> >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that
> was
> >> >>> >> recently
> >> >>> >> >>> >>> merged.
> >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> >> the
> >> >>> >> change in
> >> >>> >> >>> >>> arrow
> >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >>> >> >>> >>> run_clang_format.py
> >> >>> >> >>> >>> > script in the arrow project only to find out later that
> >> there
> >> >>> >> was an
> >> >>> >> >>> >>> exact
> >> >>> >> >>> >>> > copy of it in parquet-cpp.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > However, I don't think merging the codebases makes
> sense
> >> in
> >> >>> the
> >> >>> >> long
> >> >>> >> >>> >>> term.
> >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> >> arrow
> >> >>> and
> >> >>> >> >>> tying
> >> >>> >> >>> >>> them
> >> >>> >> >>> >>> > together seems like the wrong choice. There will be
> other
> >> >>> formats
> >> >>> >> >>> that
> >> >>> >> >>> >>> > arrow needs to support that will be kept separate
> (e.g. -
> >> >>> Orc),
> >> >>> >> so I
> >> >>> >> >>> >>> don't
> >> >>> >> >>> >>> > see why parquet should be special. I also think build
> >> tooling
> >> >>> >> should
> >> >>> >> >>> be
> >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long
> history
> >> of
> >> >>> >> >>> developing
> >> >>> >> >>> >>> open
> >> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think
> CI
> >> is a
> >> >>> >> good
> >> >>> >> >>> >>> > counter-example since there have been lots of
> successful
> >> open
> >> >>> >> source
> >> >>> >> >>> >>> > projects that have used nightly build systems that
> pinned
> >> >>> >> versions of
> >> >>> >> >>> >>> > dependent software.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> >> codebases
> >> >>> >> in the
> >> >>> >> >>> >>> short
> >> >>> >> >>> >>> > term with the express purpose of separating them in the
> >> near
> >> >>> >> term.
> >> >>> >> >>> My
> >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> >> together,
> >> >>> you
> >> >>> >> can
> >> >>> >> >>> more
> >> >>> >> >>> >>> > easily delineate the boundaries between the API's with
> a
> >> >>> single
> >> >>> >> PR.
> >> >>> >> >>> >>> Second,
> >> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >> >>> diverge,
> >> >>> >> >>> which has
> >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> >> been
> >> >>> >> sorted
> >> >>> >> >>> out,
> >> >>> >> >>> >>> it
> >> >>> >> >>> >>> > should be easy to separate them back into their own
> >> codebases.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >> >>> codebases
> >> >>> >> for
> >> >>> >> >>> arrow
> >> >>> >> >>> >>> > be separated from other languages. Looking at it from
> the
> >> >>> >> >>> perspective of
> >> >>> >> >>> >>> a
> >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java
> is a
> >> >>> large
> >> >>> >> tax
> >> >>> >> >>> to
> >> >>> >> >>> >>> pay
> >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> >> in the
> >> >>> >> 0.10.0
> >> >>> >> >>> >>> > release of arrow, many of which were holding up the
> >> release. I
> >> >>> >> hope
> >> >>> >> >>> that
> >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> >> help
> >> >>> >> reduce
> >> >>> >> >>> the
> >> >>> >> >>> >>> > complexity of the build/release tooling.
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> >>> >> ted.dunning@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >
> >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> >>> >> wesmckinn@gmail.com>
> >> >>> >> >>> >>> wrote:
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches
> for
> >> >>> >> stability
> >> >>> >> >>> and
> >> >>> >> >>> >>> API
> >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> >> HDFS
> >> >>> >> >>> community
> >> >>> >> >>> >>> took
> >> >>> >> >>> >>> >> a
> >> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >>> >> >>> >>> >> >
> >> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> >> source
> >> >>> >> >>> community as
> >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> >> didn't
> >> >>> go
> >> >>> >> the
> >> >>> >> >>> way
> >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >> >>> community
> >> >>> >> which
> >> >>> >> >>> >>> >> > happens to operate under a similar open governance
> >> model.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> There are some more radical and community building
> >> options as
> >> >>> >> well.
> >> >>> >> >>> Take
> >> >>> >> >>> >>> >> the subversion project as a precedent. With
> subversion,
> >> any
> >> >>> >> Apache
> >> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> >> large
> >> >>> >> >>> fraction of
> >> >>> >> >>> >>> >> subversion.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> So why not take this a bit further and give every
> parquet
> >> >>> >> committer
> >> >>> >> >>> a
> >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> >>> >> committers in
> >> >>> >> >>> >>> Arrow?
> >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> >> committer who
> >> >>> >> asks
> >> >>> >> >>> will
> >> >>> >> >>> >>> be
> >> >>> >> >>> >>> >> given committer status in Arrow.
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here.
> Parquet
> >> >>> >> committers
> >> >>> >> >>> >>> can't be
> >> >>> >> >>> >>> >> worried at that point whether their patches will get
> >> merged;
> >> >>> >> they
> >> >>> >> >>> can
> >> >>> >> >>> >>> just
> >> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> >> in the
> >> >>> >> >>> Parquet
> >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >> >>> parquet so
> >> >>> >> >>> why not
> >> >>> >> >>> >>> >> invite them in?
> >> >>> >> >>> >>> >>
> >> >>> >> >>> >>>
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> regards,
> >> >>> >> >> Deepak Majeti
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > regards,
> >> >>> > Deepak Majeti
> >> >>>
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> It will be difficult to track parquet-cpp changes if they get mixed with Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? Can we enforce that parquet-cpp changes will not be committed without a corresponding Parquet JIRA?

I think we would use the following policy:

* use PARQUET-XXX for issues relating to Parquet core
* use ARROW-XXX for issues relation to Arrow's consumption of Parquet
core (e.g. changes that are in parquet/arrow right now)

We've already been dealing with annoyances relating to issues
straddling the two projects (debugging an issue on Arrow side to find
that it has to be fixed on Parquet side); this would make things
simpler for us

> I would also like to keep changes to parquet-cpp on a separate commit to simplify forking later (if needed) and be able to maintain the commit history.  I don't know if its possible to squash parquet-cpp commits and arrow commits separately before merging.

This seems rather onerous for both contributors and maintainers and
not in line with the goal of improving productivity. In the event that
we fork I see it as a traumatic event for the community. If it does
happen, then we can write a script (using git filter-branch and other
such tools) to extract commits related to the forked code.

- Wes

On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com> wrote:
> I have a few more logistical questions to add.
>
> It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
>
> I would also like to keep changes to parquet-cpp on a separate commit to
> simplify forking later (if needed) and be able to maintain the commit
> history.  I don't know if its possible to squash parquet-cpp commits and
> arrow commits separately before merging.
>
>
> On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>
>> Do other people have opinions? I would like to undertake this work in
>> the near future (the next 8-10 weeks); I would be OK with taking
>> responsibility for the primary codebase surgery.
>>
>> Some logistical questions:
>>
>> * We have a handful of pull requests in flight in parquet-cpp that
>> would need to be resolved / merged
>> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> releases cut out of the new structure
>> * Management of shared commit rights (I can discuss with the Arrow
>> PMC; I believe that approving any committer who has actively
>> maintained parquet-cpp should be a reasonable approach per Ted's
>> comments)
>>
>> If working more closely together proves to not be working out after
>> some period of time, I will be fully supportive of a fork or something
>> like it
>>
>> Thanks,
>> Wes
>>
>> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
>> > Thanks Tim.
>> >
>> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> > platform code intending to improve the performance of bit-packing in
>> > Parquet writes, and we resulted with 2 interdependent PRs
>> >
>> > * https://github.com/apache/parquet-cpp/pull/483
>> > * https://github.com/apache/arrow/pull/2355
>> >
>> > Changes that impact the Python interface to Parquet are even more
>> complex.
>> >
>> > Adding options to Arrow's CMake build system to only build
>> > Parquet-related code and dependencies (in a monorepo framework) would
>> > not be difficult, and amount to writing "make parquet".
>> >
>> > See e.g. https://stackoverflow.com/a/17201375. The desired commands to
>> > build and install the Parquet core libraries and their dependencies
>> > would be:
>> >
>> > ninja parquet && ninja install
>> >
>> > - Wes
>> >
>> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> > <ta...@cloudera.com.invalid> wrote:
>> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> successful, but I thought I'd give my two cents.
>> >>
>> >> For me, the thing that makes the biggest difference in contributing to a
>> >> new codebase is the number of steps in the workflow for writing,
>> testing,
>> >> posting and iterating on a commit and also the number of opportunities
>> for
>> >> missteps. The size of the repo and build/test times matter but are
>> >> secondary so long as the workflow is simple and reliable.
>> >>
>> >> I don't really know what the current state of things is, but it sounds
>> like
>> >> it's not as simple as check out -> build -> test if you're doing a
>> >> cross-repo change. Circular dependencies are a real headache.
>> >>
>> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> hi,
>> >>>
>> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >>> wrote:
>> >>> > I think the circular dependency can be broken if we build a new
>> library
>> >>> for
>> >>> > the platform code. This will also make it easy for other projects
>> such as
>> >>> > ORC to use it.
>> >>> > I also remember your proposal a while ago of having a separate
>> project
>> >>> for
>> >>> > the platform code.  That project can live in the arrow repo.
>> However, one
>> >>> > has to clone the entire apache arrow repo but can just build the
>> platform
>> >>> > code. This will be temporary until we can find a new home for it.
>> >>> >
>> >>> > The dependency will look like:
>> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >>> > libplatform(platform api)
>> >>> >
>> >>> > CI workflow will clone the arrow project twice, once for the platform
>> >>> > library and once for the arrow-core/bindings library.
>> >>>
>> >>> This seems like an interesting proposal; the best place to work toward
>> >>> this goal (if it is even possible; the build system interactions and
>> >>> ASF release management are the hard problems) is to have all of the
>> >>> code in a single repository. ORC could already be using Arrow if it
>> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >>>
>> >>> >
>> >>> > There is no doubt that the collaborations between the Arrow and
>> Parquet
>> >>> > communities so far have been very successful.
>> >>> > The reason to maintain this relationship moving forward is to
>> continue to
>> >>> > reap the mutual benefits.
>> >>> > We should continue to take advantage of sharing code as well.
>> However, I
>> >>> > don't see any code sharing opportunities between arrow-core and the
>> >>> > parquet-core. Both have different functions.
>> >>>
>> >>> I think you mean the Arrow columnar format. The Arrow columnar format
>> >>> is only one part of a project that has become quite large already
>> >>> (
>> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >>> platform-for-inmemory-data-105427919).
>> >>>
>> >>> >
>> >>> > We are at a point where the parquet-cpp public API is pretty stable.
>> We
>> >>> > already passed that difficult stage. My take at arrow and parquet is
>> to
>> >>> > keep them nimble since we can.
>> >>>
>> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >>> have done little work in asynchronous IO and concurrency which would
>> >>> yield both improved read and write throughput. This aligns well with
>> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >>> believe that more development will happen on parquet-core once the
>> >>> development process issues are resolved by having a single codebase,
>> >>> single build system, and a single CI framework.
>> >>>
>> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >>> goal I think we should still be open to making significant changes in
>> >>> the interest of long term progress.
>> >>>
>> >>> Having now worked on these projects for more than 2 and a half years
>> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >>> past the "breaking point" and not willing to continue contributing in
>> >>> a significant way to parquet-cpp if the projects remained structured
>> >>> as they are now. It's hampering progress and not serving the
>> >>> community.
>> >>>
>> >>> - Wes
>> >>>
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> > The current Arrow adaptor code for parquet should live in the
>> arrow
>> >>> >> repo. That will remove a majority of the dependency issues. Joshua's
>> >>> work
>> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> the
>> >>> arrow
>> >>> >> repo.  This will be similar to the ORC adaptor.
>> >>> >>
>> >>> >> This has been suggested before, but I don't see how it would
>> alleviate
>> >>> >> any issues because of the significant dependencies on other parts of
>> >>> >> the Arrow codebase. What you are proposing is:
>> >>> >>
>> >>> >> - (Arrow) arrow platform
>> >>> >> - (Parquet) parquet core
>> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >>> >> - (Arrow) Python bindings
>> >>> >>
>> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >>> >> built before invoking the Parquet core part of the build system. You
>> >>> >> would need to pass dependent targets across different CMake build
>> >>> >> systems; I don't know if it's possible (I spent some time looking
>> into
>> >>> >> it earlier this year). This is what I meant by the lack of a
>> "concrete
>> >>> >> and actionable plan". The only thing that would really work would be
>> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >>> >> somehow rather than using ExternalProject. Currently Parquet builds
>> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> build
>> >>> >> system because it's only depended upon by the Python bindings.
>> >>> >>
>> >>> >> And even if a solution could be devised, it would not wholly resolve
>> >>> >> the CI workflow issues.
>> >>> >>
>> >>> >> You could make Parquet completely independent of the Arrow codebase,
>> >>> >> but at that point there is little reason to maintain a relationship
>> >>> >> between the projects or their communities. We have spent a great
>> deal
>> >>> >> of effort refactoring the two projects to enable as much code
>> sharing
>> >>> >> as there is now.
>> >>> >>
>> >>> >> - Wes
>> >>> >>
>> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >> >> If you still strongly feel that the only way forward is to clone
>> the
>> >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>> two
>> >>> >> parquet-cpp repos is no way a better approach.
>> >>> >> >
>> >>> >> > Yes, indeed. In my view, the next best option after a monorepo is
>> to
>> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >>> >> >
>> >>> >> > It doesn't look like I will be able to convince you that a
>> monorepo is
>> >>> >> > a good idea; what I would ask instead is that you be willing to
>> give
>> >>> >> > it a shot, and if it turns out in the way you're describing
>> (which I
>> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >>> >> >
>> >>> >> > - Wes
>> >>> >> >
>> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >>> majeti.deepak@gmail.com>
>> >>> >> wrote:
>> >>> >> >> Wes,
>> >>> >> >>
>> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> problems
>> >>> of a
>> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >>> >> >> Bringing in related Apache community experiences are more
>> meaningful
>> >>> >> than
>> >>> >> >> how mono-repos work at Google and other big organizations.
>> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> developers.
>> >>> >> >> You are very well aware of how difficult it has been to find more
>> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has
>> a low
>> >>> >> >> contribution rate to its core components.
>> >>> >> >>
>> >>> >> >> We should target to ensure that new volunteers who want to
>> contribute
>> >>> >> >> bug-fixes/features should spend the least amount of time in
>> figuring
>> >>> out
>> >>> >> >> the project repo. We can never come up with an automated build
>> system
>> >>> >> that
>> >>> >> >> caters to every possible environment.
>> >>> >> >> My only concern is if the mono-repo will make it harder for new
>> >>> >> developers
>> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> build
>> >>> and
>> >>> >> test
>> >>> >> >> dependencies.
>> >>> >> >> I am not saying that the Arrow community/committers will be less
>> >>> >> >> co-operative.
>> >>> >> >> I just don't think the mono-repo structure model will be
>> sustainable
>> >>> in
>> >>> >> an
>> >>> >> >> open source community unless there are long-term vested
>> interests. We
>> >>> >> can't
>> >>> >> >> predict that.
>> >>> >> >>
>> >>> >> >> The current circular dependency problems between Arrow and
>> Parquet
>> >>> is a
>> >>> >> >> major problem for the community and it is important.
>> >>> >> >>
>> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> arrow
>> >>> >> repo.
>> >>> >> >> That will remove a majority of the dependency issues.
>> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>> >>> adapter
>> >>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >>> >> >>
>> >>> >> >> The platform API code is pretty stable at this point. Minor
>> changes
>> >>> in
>> >>> >> the
>> >>> >> >> future to this code should not be the main reason to combine the
>> >>> arrow
>> >>> >> >> parquet repos.
>> >>> >> >>
>> >>> >> >> "
>> >>> >> >> *I question whether it's worth the community's time long term to
>> >>> wear*
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >>> >> eachlibrary
>> >>> >> >> to plug components together rather than utilizing commonplatform
>> >>> APIs.*"
>> >>> >> >>
>> >>> >> >> My answer to your question below would be "Yes".
>> >>> Modularity/separation
>> >>> >> is
>> >>> >> >> very important in an open source community where priorities of
>> >>> >> contributors
>> >>> >> >> are often short term.
>> >>> >> >> The retention is low and therefore the acquisition costs should
>> be
>> >>> low
>> >>> >> as
>> >>> >> >> well. This is the community over code approach according to me.
>> Minor
>> >>> >> code
>> >>> >> >> duplication is not a deal breaker.
>> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>> >>> data
>> >>> >> >> space serving their own functions.
>> >>> >> >>
>> >>> >> >> If you still strongly feel that the only way forward is to clone
>> the
>> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> Having
>> >>> two
>> >>> >> >> parquet-cpp repos is no way a better approach.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> wesmckinn@gmail.com>
>> >>> >> wrote:
>> >>> >> >>
>> >>> >> >>> @Antoine
>> >>> >> >>>
>> >>> >> >>> > By the way, one concern with the monorepo approach: it would
>> >>> slightly
>> >>> >> >>> increase Arrow CI times (which are already too large).
>> >>> >> >>>
>> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >>> >> >>>
>> >>> >> >>> Parquet run takes about 28
>> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >>> >> >>>
>> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> certain
>> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >>> >> >>>
>> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> could be
>> >>> >> >>> made substantially shorter by moving some of the slower parts
>> (like
>> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> nightly
>> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> also
>> >>> >> >>> improve build times (valgrind build could be moved to a nightly
>> >>> >> >>> exhaustive test run)
>> >>> >> >>>
>> >>> >> >>> - Wes
>> >>> >> >>>
>> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> wesmckinn@gmail.com
>> >>> >
>> >>> >> >>> wrote:
>> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >>> >> example of
>> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >>> codebase.
>> >>> >> That
>> >>> >> >>> gives me hope that the projects could be managed separately some
>> >>> day.
>> >>> >> >>> >
>> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>> >>> codebase
>> >>> >> >>> > features several areas of duplicated logic which could be
>> >>> replaced by
>> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >>> >> >>> > interoperability:
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >>> orc/OrcFile.hh#L37
>> >>> >> >>> >
>> >>> >>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >>> orc/MemoryPool.hh
>> >>> >> >>> >
>> >>> >>
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >>> >> >>> >
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >>> OutputStream.hh
>> >>> >> >>> >
>> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> cause of
>> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> them
>> >>> from
>> >>> >> >>> > leaking to third party linkers when statically linked (ORC is
>> only
>> >>> >> >>> > available for static linking at the moment AFAIK).
>> >>> >> >>> >
>> >>> >> >>> > I question whether it's worth the community's time long term
>> to
>> >>> wear
>> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in
>> each
>> >>> >> >>> > library to plug components together rather than utilizing
>> common
>> >>> >> >>> > platform APIs.
>> >>> >> >>> >
>> >>> >> >>> > - Wes
>> >>> >> >>> >
>> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >>> >> joshuastorck@gmail.com>
>> >>> >> >>> wrote:
>> >>> >> >>> >> You're point about the constraints of the ASF release
>> process are
>> >>> >> well
>> >>> >> >>> >> taken and as a developer who's trying to work in the current
>> >>> >> >>> environment I
>> >>> >> >>> >> would be much happier if the codebases were merged. The main
>> >>> issues
>> >>> >> I
>> >>> >> >>> worry
>> >>> >> >>> >> about when you put codebases like these together are:
>> >>> >> >>> >>
>> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> becomes
>> >>> too
>> >>> >> >>> coupled
>> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> tree are
>> >>> >> >>> delayed
>> >>> >> >>> >> by artifacts higher in the dependency tree
>> >>> >> >>> >>
>> >>> >> >>> >> If the project/release management is structured well and
>> someone
>> >>> >> keeps
>> >>> >> >>> an
>> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >>> >> >>> >>
>> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >>> >> example of
>> >>> >> >>> how
>> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >>> codebase.
>> >>> >> That
>> >>> >> >>> >> gives me hope that the projects could be managed separately
>> some
>> >>> >> day.
>> >>> >> >>> >>
>> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >>> wesmckinn@gmail.com>
>> >>> >> >>> wrote:
>> >>> >> >>> >>
>> >>> >> >>> >>> hi Josh,
>> >>> >> >>> >>>
>> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> arrow
>> >>> and
>> >>> >> >>> tying
>> >>> >> >>> >>> them together seems like the wrong choice.
>> >>> >> >>> >>>
>> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> people
>> >>> >> >>> >>> building these projects -- my argument (which I think you
>> agree
>> >>> >> with?)
>> >>> >> >>> >>> is that we should work more closely together until the
>> community
>> >>> >> grows
>> >>> >> >>> >>> large enough to support larger-scope process than we have
>> now.
>> >>> As
>> >>> >> >>> >>> you've seen, our process isn't serving developers of these
>> >>> >> projects.
>> >>> >> >>> >>>
>> >>> >> >>> >>> > I also think build tooling should be pulled into its own
>> >>> >> codebase.
>> >>> >> >>> >>>
>> >>> >> >>> >>> I don't see how this can possibly be practical taking into
>> >>> >> >>> >>> consideration the constraints imposed by the combination of
>> the
>> >>> >> GitHub
>> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >>> idealistic,
>> >>> >> >>> >>> but right now we need to be practical. Unless we can devise
>> a
>> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> per
>> >>> day
>> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >>> without
>> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see
>> how
>> >>> we
>> >>> >> can
>> >>> >> >>> >>> move forward.
>> >>> >> >>> >>>
>> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> codebases
>> >>> >> in the
>> >>> >> >>> >>> short term with the express purpose of separating them in
>> the
>> >>> near
>> >>> >> >>> term.
>> >>> >> >>> >>>
>> >>> >> >>> >>> I would agree but only if separation can be demonstrated to
>> be
>> >>> >> >>> >>> practical and result in net improvements in productivity and
>> >>> >> community
>> >>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>> >>> >> current
>> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >>> >> >>> >>>
>> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> >> >>> >>> development process and ASF releases separately. My
>> argument is
>> >>> as
>> >>> >> >>> >>> follows:
>> >>> >> >>> >>>
>> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >>> >> >>> >>> * Releases structured according to the desires of the PMCs
>> >>> >> >>> >>>
>> >>> >> >>> >>> - Wes
>> >>> >> >>> >>>
>> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >>> >> joshuastorck@gmail.com
>> >>> >> >>> >
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> > I recently worked on an issue that had to be implemented
>> in
>> >>> >> >>> parquet-cpp
>> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >>> >> (ARROW-2585,
>> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing
>> and
>> >>> >> hard to
>> >>> >> >>> work
>> >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> >>> >> (created on
>> >>> >> >>> May
>> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> >>> >> recently
>> >>> >> >>> >>> merged.
>> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> the
>> >>> >> change in
>> >>> >> >>> >>> arrow
>> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> >> >>> >>> run_clang_format.py
>> >>> >> >>> >>> > script in the arrow project only to find out later that
>> there
>> >>> >> was an
>> >>> >> >>> >>> exact
>> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > However, I don't think merging the codebases makes sense
>> in
>> >>> the
>> >>> >> long
>> >>> >> >>> >>> term.
>> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> arrow
>> >>> and
>> >>> >> >>> tying
>> >>> >> >>> >>> them
>> >>> >> >>> >>> > together seems like the wrong choice. There will be other
>> >>> formats
>> >>> >> >>> that
>> >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>> >>> Orc),
>> >>> >> so I
>> >>> >> >>> >>> don't
>> >>> >> >>> >>> > see why parquet should be special. I also think build
>> tooling
>> >>> >> should
>> >>> >> >>> be
>> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history
>> of
>> >>> >> >>> developing
>> >>> >> >>> >>> open
>> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI
>> is a
>> >>> >> good
>> >>> >> >>> >>> > counter-example since there have been lots of successful
>> open
>> >>> >> source
>> >>> >> >>> >>> > projects that have used nightly build systems that pinned
>> >>> >> versions of
>> >>> >> >>> >>> > dependent software.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> codebases
>> >>> >> in the
>> >>> >> >>> >>> short
>> >>> >> >>> >>> > term with the express purpose of separating them in the
>> near
>> >>> >> term.
>> >>> >> >>> My
>> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> together,
>> >>> you
>> >>> >> can
>> >>> >> >>> more
>> >>> >> >>> >>> > easily delineate the boundaries between the API's with a
>> >>> single
>> >>> >> PR.
>> >>> >> >>> >>> Second,
>> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >>> diverge,
>> >>> >> >>> which has
>> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> been
>> >>> >> sorted
>> >>> >> >>> out,
>> >>> >> >>> >>> it
>> >>> >> >>> >>> > should be easy to separate them back into their own
>> codebases.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >>> codebases
>> >>> >> for
>> >>> >> >>> arrow
>> >>> >> >>> >>> > be separated from other languages. Looking at it from the
>> >>> >> >>> perspective of
>> >>> >> >>> >>> a
>> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>> >>> large
>> >>> >> tax
>> >>> >> >>> to
>> >>> >> >>> >>> pay
>> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> in the
>> >>> >> 0.10.0
>> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> release. I
>> >>> >> hope
>> >>> >> >>> that
>> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> help
>> >>> >> reduce
>> >>> >> >>> the
>> >>> >> >>> >>> > complexity of the build/release tooling.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >>> >> ted.dunning@gmail.com>
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> >
>> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >>> >> wesmckinn@gmail.com>
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> >
>> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>> >>> >> stability
>> >>> >> >>> and
>> >>> >> >>> >>> API
>> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> HDFS
>> >>> >> >>> community
>> >>> >> >>> >>> took
>> >>> >> >>> >>> >> a
>> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >>> >> >>> >>> >> >
>> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> source
>> >>> >> >>> community as
>> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> didn't
>> >>> go
>> >>> >> the
>> >>> >> >>> way
>> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >>> community
>> >>> >> which
>> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> model.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> There are some more radical and community building
>> options as
>> >>> >> well.
>> >>> >> >>> Take
>> >>> >> >>> >>> >> the subversion project as a precedent. With subversion,
>> any
>> >>> >> Apache
>> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> large
>> >>> >> >>> fraction of
>> >>> >> >>> >>> >> subversion.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> So why not take this a bit further and give every parquet
>> >>> >> committer
>> >>> >> >>> a
>> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >>> >> committers in
>> >>> >> >>> >>> Arrow?
>> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> committer who
>> >>> >> asks
>> >>> >> >>> will
>> >>> >> >>> >>> be
>> >>> >> >>> >>> >> given committer status in Arrow.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> >>> >> committers
>> >>> >> >>> >>> can't be
>> >>> >> >>> >>> >> worried at that point whether their patches will get
>> merged;
>> >>> >> they
>> >>> >> >>> can
>> >>> >> >>> >>> just
>> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> in the
>> >>> >> >>> Parquet
>> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >>> parquet so
>> >>> >> >>> why not
>> >>> >> >>> >>> >> invite them in?
>> >>> >> >>> >>> >>
>> >>> >> >>> >>>
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> regards,
>> >>> >> >> Deepak Majeti
>> >>> >>
>> >>> >
>> >>> >
>> >>> > --
>> >>> > regards,
>> >>> > Deepak Majeti
>> >>>
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> It will be difficult to track parquet-cpp changes if they get mixed with Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? Can we enforce that parquet-cpp changes will not be committed without a corresponding Parquet JIRA?

I think we would use the following policy:

* use PARQUET-XXX for issues relating to Parquet core
* use ARROW-XXX for issues relation to Arrow's consumption of Parquet
core (e.g. changes that are in parquet/arrow right now)

We've already been dealing with annoyances relating to issues
straddling the two projects (debugging an issue on Arrow side to find
that it has to be fixed on Parquet side); this would make things
simpler for us

> I would also like to keep changes to parquet-cpp on a separate commit to simplify forking later (if needed) and be able to maintain the commit history.  I don't know if its possible to squash parquet-cpp commits and arrow commits separately before merging.

This seems rather onerous for both contributors and maintainers and
not in line with the goal of improving productivity. In the event that
we fork I see it as a traumatic event for the community. If it does
happen, then we can write a script (using git filter-branch and other
such tools) to extract commits related to the forked code.

- Wes

On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <ma...@gmail.com> wrote:
> I have a few more logistical questions to add.
>
> It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
>
> I would also like to keep changes to parquet-cpp on a separate commit to
> simplify forking later (if needed) and be able to maintain the commit
> history.  I don't know if its possible to squash parquet-cpp commits and
> arrow commits separately before merging.
>
>
> On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:
>
>> Do other people have opinions? I would like to undertake this work in
>> the near future (the next 8-10 weeks); I would be OK with taking
>> responsibility for the primary codebase surgery.
>>
>> Some logistical questions:
>>
>> * We have a handful of pull requests in flight in parquet-cpp that
>> would need to be resolved / merged
>> * We should probably cut a status-quo cpp-1.5.0 release, with future
>> releases cut out of the new structure
>> * Management of shared commit rights (I can discuss with the Arrow
>> PMC; I believe that approving any committer who has actively
>> maintained parquet-cpp should be a reasonable approach per Ted's
>> comments)
>>
>> If working more closely together proves to not be working out after
>> some period of time, I will be fully supportive of a fork or something
>> like it
>>
>> Thanks,
>> Wes
>>
>> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
>> > Thanks Tim.
>> >
>> > Indeed, it's not very simple. Just today Antoine cleaned up some
>> > platform code intending to improve the performance of bit-packing in
>> > Parquet writes, and we resulted with 2 interdependent PRs
>> >
>> > * https://github.com/apache/parquet-cpp/pull/483
>> > * https://github.com/apache/arrow/pull/2355
>> >
>> > Changes that impact the Python interface to Parquet are even more
>> complex.
>> >
>> > Adding options to Arrow's CMake build system to only build
>> > Parquet-related code and dependencies (in a monorepo framework) would
>> > not be difficult, and amount to writing "make parquet".
>> >
>> > See e.g. https://stackoverflow.com/a/17201375. The desired commands to
>> > build and install the Parquet core libraries and their dependencies
>> > would be:
>> >
>> > ninja parquet && ninja install
>> >
>> > - Wes
>> >
>> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
>> > <ta...@cloudera.com.invalid> wrote:
>> >> I don't have a direct stake in this beyond wanting to see Parquet be
>> >> successful, but I thought I'd give my two cents.
>> >>
>> >> For me, the thing that makes the biggest difference in contributing to a
>> >> new codebase is the number of steps in the workflow for writing,
>> testing,
>> >> posting and iterating on a commit and also the number of opportunities
>> for
>> >> missteps. The size of the repo and build/test times matter but are
>> >> secondary so long as the workflow is simple and reliable.
>> >>
>> >> I don't really know what the current state of things is, but it sounds
>> like
>> >> it's not as simple as check out -> build -> test if you're doing a
>> >> cross-repo change. Circular dependencies are a real headache.
>> >>
>> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> hi,
>> >>>
>> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >>> wrote:
>> >>> > I think the circular dependency can be broken if we build a new
>> library
>> >>> for
>> >>> > the platform code. This will also make it easy for other projects
>> such as
>> >>> > ORC to use it.
>> >>> > I also remember your proposal a while ago of having a separate
>> project
>> >>> for
>> >>> > the platform code.  That project can live in the arrow repo.
>> However, one
>> >>> > has to clone the entire apache arrow repo but can just build the
>> platform
>> >>> > code. This will be temporary until we can find a new home for it.
>> >>> >
>> >>> > The dependency will look like:
>> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> >>> > libplatform(platform api)
>> >>> >
>> >>> > CI workflow will clone the arrow project twice, once for the platform
>> >>> > library and once for the arrow-core/bindings library.
>> >>>
>> >>> This seems like an interesting proposal; the best place to work toward
>> >>> this goal (if it is even possible; the build system interactions and
>> >>> ASF release management are the hard problems) is to have all of the
>> >>> code in a single repository. ORC could already be using Arrow if it
>> >>> wanted, but the ORC contributors aren't active in Arrow.
>> >>>
>> >>> >
>> >>> > There is no doubt that the collaborations between the Arrow and
>> Parquet
>> >>> > communities so far have been very successful.
>> >>> > The reason to maintain this relationship moving forward is to
>> continue to
>> >>> > reap the mutual benefits.
>> >>> > We should continue to take advantage of sharing code as well.
>> However, I
>> >>> > don't see any code sharing opportunities between arrow-core and the
>> >>> > parquet-core. Both have different functions.
>> >>>
>> >>> I think you mean the Arrow columnar format. The Arrow columnar format
>> >>> is only one part of a project that has become quite large already
>> >>> (
>> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> >>> platform-for-inmemory-data-105427919).
>> >>>
>> >>> >
>> >>> > We are at a point where the parquet-cpp public API is pretty stable.
>> We
>> >>> > already passed that difficult stage. My take at arrow and parquet is
>> to
>> >>> > keep them nimble since we can.
>> >>>
>> >>> I believe that parquet-core has progress to make yet ahead of it. We
>> >>> have done little work in asynchronous IO and concurrency which would
>> >>> yield both improved read and write throughput. This aligns well with
>> >>> other concurrency and async-IO work planned in the Arrow platform. I
>> >>> believe that more development will happen on parquet-core once the
>> >>> development process issues are resolved by having a single codebase,
>> >>> single build system, and a single CI framework.
>> >>>
>> >>> I have some gripes about design decisions made early in parquet-cpp,
>> >>> like the use of C++ exceptions. So while "stability" is a reasonable
>> >>> goal I think we should still be open to making significant changes in
>> >>> the interest of long term progress.
>> >>>
>> >>> Having now worked on these projects for more than 2 and a half years
>> >>> and the most frequent contributor to both codebases, I'm sadly far
>> >>> past the "breaking point" and not willing to continue contributing in
>> >>> a significant way to parquet-cpp if the projects remained structured
>> >>> as they are now. It's hampering progress and not serving the
>> >>> community.
>> >>>
>> >>> - Wes
>> >>>
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> > The current Arrow adaptor code for parquet should live in the
>> arrow
>> >>> >> repo. That will remove a majority of the dependency issues. Joshua's
>> >>> work
>> >>> >> would not have been blocked in parquet-cpp if that adapter was in
>> the
>> >>> arrow
>> >>> >> repo.  This will be similar to the ORC adaptor.
>> >>> >>
>> >>> >> This has been suggested before, but I don't see how it would
>> alleviate
>> >>> >> any issues because of the significant dependencies on other parts of
>> >>> >> the Arrow codebase. What you are proposing is:
>> >>> >>
>> >>> >> - (Arrow) arrow platform
>> >>> >> - (Parquet) parquet core
>> >>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >>> >> - (Arrow) Python bindings
>> >>> >>
>> >>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >>> >> built before invoking the Parquet core part of the build system. You
>> >>> >> would need to pass dependent targets across different CMake build
>> >>> >> systems; I don't know if it's possible (I spent some time looking
>> into
>> >>> >> it earlier this year). This is what I meant by the lack of a
>> "concrete
>> >>> >> and actionable plan". The only thing that would really work would be
>> >>> >> for the Parquet core to be "included" in the Arrow build system
>> >>> >> somehow rather than using ExternalProject. Currently Parquet builds
>> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
>> build
>> >>> >> system because it's only depended upon by the Python bindings.
>> >>> >>
>> >>> >> And even if a solution could be devised, it would not wholly resolve
>> >>> >> the CI workflow issues.
>> >>> >>
>> >>> >> You could make Parquet completely independent of the Arrow codebase,
>> >>> >> but at that point there is little reason to maintain a relationship
>> >>> >> between the projects or their communities. We have spent a great
>> deal
>> >>> >> of effort refactoring the two projects to enable as much code
>> sharing
>> >>> >> as there is now.
>> >>> >>
>> >>> >> - Wes
>> >>> >>
>> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >> >> If you still strongly feel that the only way forward is to clone
>> the
>> >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>> two
>> >>> >> parquet-cpp repos is no way a better approach.
>> >>> >> >
>> >>> >> > Yes, indeed. In my view, the next best option after a monorepo is
>> to
>> >>> >> > fork. That would obviously be a bad outcome for the community.
>> >>> >> >
>> >>> >> > It doesn't look like I will be able to convince you that a
>> monorepo is
>> >>> >> > a good idea; what I would ask instead is that you be willing to
>> give
>> >>> >> > it a shot, and if it turns out in the way you're describing
>> (which I
>> >>> >> > don't think it will) then I suggest that we fork at that point.
>> >>> >> >
>> >>> >> > - Wes
>> >>> >> >
>> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> >>> majeti.deepak@gmail.com>
>> >>> >> wrote:
>> >>> >> >> Wes,
>> >>> >> >>
>> >>> >> >> Unfortunately, I cannot show you any practical fact-based
>> problems
>> >>> of a
>> >>> >> >> non-existent Arrow-Parquet mono-repo.
>> >>> >> >> Bringing in related Apache community experiences are more
>> meaningful
>> >>> >> than
>> >>> >> >> how mono-repos work at Google and other big organizations.
>> >>> >> >> We solely depend on volunteers and cannot hire full-time
>> developers.
>> >>> >> >> You are very well aware of how difficult it has been to find more
>> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has
>> a low
>> >>> >> >> contribution rate to its core components.
>> >>> >> >>
>> >>> >> >> We should target to ensure that new volunteers who want to
>> contribute
>> >>> >> >> bug-fixes/features should spend the least amount of time in
>> figuring
>> >>> out
>> >>> >> >> the project repo. We can never come up with an automated build
>> system
>> >>> >> that
>> >>> >> >> caters to every possible environment.
>> >>> >> >> My only concern is if the mono-repo will make it harder for new
>> >>> >> developers
>> >>> >> >> to work on parquet-cpp core just due to the additional code,
>> build
>> >>> and
>> >>> >> test
>> >>> >> >> dependencies.
>> >>> >> >> I am not saying that the Arrow community/committers will be less
>> >>> >> >> co-operative.
>> >>> >> >> I just don't think the mono-repo structure model will be
>> sustainable
>> >>> in
>> >>> >> an
>> >>> >> >> open source community unless there are long-term vested
>> interests. We
>> >>> >> can't
>> >>> >> >> predict that.
>> >>> >> >>
>> >>> >> >> The current circular dependency problems between Arrow and
>> Parquet
>> >>> is a
>> >>> >> >> major problem for the community and it is important.
>> >>> >> >>
>> >>> >> >> The current Arrow adaptor code for parquet should live in the
>> arrow
>> >>> >> repo.
>> >>> >> >> That will remove a majority of the dependency issues.
>> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>> >>> adapter
>> >>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >>> >> >>
>> >>> >> >> The platform API code is pretty stable at this point. Minor
>> changes
>> >>> in
>> >>> >> the
>> >>> >> >> future to this code should not be the main reason to combine the
>> >>> arrow
>> >>> >> >> parquet repos.
>> >>> >> >>
>> >>> >> >> "
>> >>> >> >> *I question whether it's worth the community's time long term to
>> >>> wear*
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >>> >> eachlibrary
>> >>> >> >> to plug components together rather than utilizing commonplatform
>> >>> APIs.*"
>> >>> >> >>
>> >>> >> >> My answer to your question below would be "Yes".
>> >>> Modularity/separation
>> >>> >> is
>> >>> >> >> very important in an open source community where priorities of
>> >>> >> contributors
>> >>> >> >> are often short term.
>> >>> >> >> The retention is low and therefore the acquisition costs should
>> be
>> >>> low
>> >>> >> as
>> >>> >> >> well. This is the community over code approach according to me.
>> Minor
>> >>> >> code
>> >>> >> >> duplication is not a deal breaker.
>> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>> >>> data
>> >>> >> >> space serving their own functions.
>> >>> >> >>
>> >>> >> >> If you still strongly feel that the only way forward is to clone
>> the
>> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
>> Having
>> >>> two
>> >>> >> >> parquet-cpp repos is no way a better approach.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
>> wesmckinn@gmail.com>
>> >>> >> wrote:
>> >>> >> >>
>> >>> >> >>> @Antoine
>> >>> >> >>>
>> >>> >> >>> > By the way, one concern with the monorepo approach: it would
>> >>> slightly
>> >>> >> >>> increase Arrow CI times (which are already too large).
>> >>> >> >>>
>> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >>> >> >>>
>> >>> >> >>> Parquet run takes about 28
>> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >>> >> >>>
>> >>> >> >>> Inevitably we will need to create some kind of bot to run
>> certain
>> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >>> >> >>>
>> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
>> could be
>> >>> >> >>> made substantially shorter by moving some of the slower parts
>> (like
>> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
>> nightly
>> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
>> also
>> >>> >> >>> improve build times (valgrind build could be moved to a nightly
>> >>> >> >>> exhaustive test run)
>> >>> >> >>>
>> >>> >> >>> - Wes
>> >>> >> >>>
>> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
>> wesmckinn@gmail.com
>> >>> >
>> >>> >> >>> wrote:
>> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >>> >> example of
>> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> >>> codebase.
>> >>> >> That
>> >>> >> >>> gives me hope that the projects could be managed separately some
>> >>> day.
>> >>> >> >>> >
>> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>> >>> codebase
>> >>> >> >>> > features several areas of duplicated logic which could be
>> >>> replaced by
>> >>> >> >>> > components from the Arrow platform for better platform-wide
>> >>> >> >>> > interoperability:
>> >>> >> >>> >
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >>> orc/OrcFile.hh#L37
>> >>> >> >>> >
>> >>> >>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> >>> orc/MemoryPool.hh
>> >>> >> >>> >
>> >>> >>
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >>> >> >>> >
>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> >>> OutputStream.hh
>> >>> >> >>> >
>> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
>> cause of
>> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
>> them
>> >>> from
>> >>> >> >>> > leaking to third party linkers when statically linked (ORC is
>> only
>> >>> >> >>> > available for static linking at the moment AFAIK).
>> >>> >> >>> >
>> >>> >> >>> > I question whether it's worth the community's time long term
>> to
>> >>> wear
>> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in
>> each
>> >>> >> >>> > library to plug components together rather than utilizing
>> common
>> >>> >> >>> > platform APIs.
>> >>> >> >>> >
>> >>> >> >>> > - Wes
>> >>> >> >>> >
>> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >>> >> joshuastorck@gmail.com>
>> >>> >> >>> wrote:
>> >>> >> >>> >> You're point about the constraints of the ASF release
>> process are
>> >>> >> well
>> >>> >> >>> >> taken and as a developer who's trying to work in the current
>> >>> >> >>> environment I
>> >>> >> >>> >> would be much happier if the codebases were merged. The main
>> >>> issues
>> >>> >> I
>> >>> >> >>> worry
>> >>> >> >>> >> about when you put codebases like these together are:
>> >>> >> >>> >>
>> >>> >> >>> >> 1. The delineation of API's become blurred and the code
>> becomes
>> >>> too
>> >>> >> >>> coupled
>> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
>> tree are
>> >>> >> >>> delayed
>> >>> >> >>> >> by artifacts higher in the dependency tree
>> >>> >> >>> >>
>> >>> >> >>> >> If the project/release management is structured well and
>> someone
>> >>> >> keeps
>> >>> >> >>> an
>> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >>> >> >>> >>
>> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >>> >> example of
>> >>> >> >>> how
>> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> >>> codebase.
>> >>> >> That
>> >>> >> >>> >> gives me hope that the projects could be managed separately
>> some
>> >>> >> day.
>> >>> >> >>> >>
>> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> >>> wesmckinn@gmail.com>
>> >>> >> >>> wrote:
>> >>> >> >>> >>
>> >>> >> >>> >>> hi Josh,
>> >>> >> >>> >>>
>> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> arrow
>> >>> and
>> >>> >> >>> tying
>> >>> >> >>> >>> them together seems like the wrong choice.
>> >>> >> >>> >>>
>> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
>> people
>> >>> >> >>> >>> building these projects -- my argument (which I think you
>> agree
>> >>> >> with?)
>> >>> >> >>> >>> is that we should work more closely together until the
>> community
>> >>> >> grows
>> >>> >> >>> >>> large enough to support larger-scope process than we have
>> now.
>> >>> As
>> >>> >> >>> >>> you've seen, our process isn't serving developers of these
>> >>> >> projects.
>> >>> >> >>> >>>
>> >>> >> >>> >>> > I also think build tooling should be pulled into its own
>> >>> >> codebase.
>> >>> >> >>> >>>
>> >>> >> >>> >>> I don't see how this can possibly be practical taking into
>> >>> >> >>> >>> consideration the constraints imposed by the combination of
>> the
>> >>> >> GitHub
>> >>> >> >>> >>> platform and the ASF release process. I'm all for being
>> >>> idealistic,
>> >>> >> >>> >>> but right now we need to be practical. Unless we can devise
>> a
>> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
>> per
>> >>> day
>> >>> >> >>> >>> which may touch both code and build system simultaneously
>> >>> without
>> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see
>> how
>> >>> we
>> >>> >> can
>> >>> >> >>> >>> move forward.
>> >>> >> >>> >>>
>> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> codebases
>> >>> >> in the
>> >>> >> >>> >>> short term with the express purpose of separating them in
>> the
>> >>> near
>> >>> >> >>> term.
>> >>> >> >>> >>>
>> >>> >> >>> >>> I would agree but only if separation can be demonstrated to
>> be
>> >>> >> >>> >>> practical and result in net improvements in productivity and
>> >>> >> community
>> >>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>> >>> >> current
>> >>> >> >>> >>> separation is impractical, and is causing problems.
>> >>> >> >>> >>>
>> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> >> >>> >>> development process and ASF releases separately. My
>> argument is
>> >>> as
>> >>> >> >>> >>> follows:
>> >>> >> >>> >>>
>> >>> >> >>> >>> * Monorepo for development (for practicality)
>> >>> >> >>> >>> * Releases structured according to the desires of the PMCs
>> >>> >> >>> >>>
>> >>> >> >>> >>> - Wes
>> >>> >> >>> >>>
>> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >>> >> joshuastorck@gmail.com
>> >>> >> >>> >
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> > I recently worked on an issue that had to be implemented
>> in
>> >>> >> >>> parquet-cpp
>> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >>> >> (ARROW-2585,
>> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing
>> and
>> >>> >> hard to
>> >>> >> >>> work
>> >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> >>> >> (created on
>> >>> >> >>> May
>> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> >>> >> recently
>> >>> >> >>> >>> merged.
>> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
>> the
>> >>> >> change in
>> >>> >> >>> >>> arrow
>> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> >> >>> >>> run_clang_format.py
>> >>> >> >>> >>> > script in the arrow project only to find out later that
>> there
>> >>> >> was an
>> >>> >> >>> >>> exact
>> >>> >> >>> >>> > copy of it in parquet-cpp.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > However, I don't think merging the codebases makes sense
>> in
>> >>> the
>> >>> >> long
>> >>> >> >>> >>> term.
>> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
>> arrow
>> >>> and
>> >>> >> >>> tying
>> >>> >> >>> >>> them
>> >>> >> >>> >>> > together seems like the wrong choice. There will be other
>> >>> formats
>> >>> >> >>> that
>> >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>> >>> Orc),
>> >>> >> so I
>> >>> >> >>> >>> don't
>> >>> >> >>> >>> > see why parquet should be special. I also think build
>> tooling
>> >>> >> should
>> >>> >> >>> be
>> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history
>> of
>> >>> >> >>> developing
>> >>> >> >>> >>> open
>> >>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI
>> is a
>> >>> >> good
>> >>> >> >>> >>> > counter-example since there have been lots of successful
>> open
>> >>> >> source
>> >>> >> >>> >>> > projects that have used nightly build systems that pinned
>> >>> >> versions of
>> >>> >> >>> >>> > dependent software.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > That being said, I think it makes sense to merge the
>> codebases
>> >>> >> in the
>> >>> >> >>> >>> short
>> >>> >> >>> >>> > term with the express purpose of separating them in the
>> near
>> >>> >> term.
>> >>> >> >>> My
>> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
>> together,
>> >>> you
>> >>> >> can
>> >>> >> >>> more
>> >>> >> >>> >>> > easily delineate the boundaries between the API's with a
>> >>> single
>> >>> >> PR.
>> >>> >> >>> >>> Second,
>> >>> >> >>> >>> > it will force the build tooling to converge instead of
>> >>> diverge,
>> >>> >> >>> which has
>> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
>> been
>> >>> >> sorted
>> >>> >> >>> out,
>> >>> >> >>> >>> it
>> >>> >> >>> >>> > should be easy to separate them back into their own
>> codebases.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> >>> codebases
>> >>> >> for
>> >>> >> >>> arrow
>> >>> >> >>> >>> > be separated from other languages. Looking at it from the
>> >>> >> >>> perspective of
>> >>> >> >>> >>> a
>> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>> >>> large
>> >>> >> tax
>> >>> >> >>> to
>> >>> >> >>> >>> pay
>> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
>> in the
>> >>> >> 0.10.0
>> >>> >> >>> >>> > release of arrow, many of which were holding up the
>> release. I
>> >>> >> hope
>> >>> >> >>> that
>> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
>> help
>> >>> >> reduce
>> >>> >> >>> the
>> >>> >> >>> >>> > complexity of the build/release tooling.
>> >>> >> >>> >>> >
>> >>> >> >>> >>> >
>> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >>> >> ted.dunning@gmail.com>
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> >
>> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >>> >> wesmckinn@gmail.com>
>> >>> >> >>> >>> wrote:
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> >
>> >>> >> >>> >>> >> > > The community will be less willing to accept large
>> >>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>> >>> >> stability
>> >>> >> >>> and
>> >>> >> >>> >>> API
>> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
>> HDFS
>> >>> >> >>> community
>> >>> >> >>> >>> took
>> >>> >> >>> >>> >> a
>> >>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >>> >> >>> >>> >> >
>> >>> >> >>> >>> >> > Please don't use bad experiences from another open
>> source
>> >>> >> >>> community as
>> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
>> didn't
>> >>> go
>> >>> >> the
>> >>> >> >>> way
>> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> >>> community
>> >>> >> which
>> >>> >> >>> >>> >> > happens to operate under a similar open governance
>> model.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> There are some more radical and community building
>> options as
>> >>> >> well.
>> >>> >> >>> Take
>> >>> >> >>> >>> >> the subversion project as a precedent. With subversion,
>> any
>> >>> >> Apache
>> >>> >> >>> >>> >> committer can request and receive a commit bit on some
>> large
>> >>> >> >>> fraction of
>> >>> >> >>> >>> >> subversion.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> So why not take this a bit further and give every parquet
>> >>> >> committer
>> >>> >> >>> a
>> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >>> >> committers in
>> >>> >> >>> >>> Arrow?
>> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
>> committer who
>> >>> >> asks
>> >>> >> >>> will
>> >>> >> >>> >>> be
>> >>> >> >>> >>> >> given committer status in Arrow.
>> >>> >> >>> >>> >>
>> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> >>> >> committers
>> >>> >> >>> >>> can't be
>> >>> >> >>> >>> >> worried at that point whether their patches will get
>> merged;
>> >>> >> they
>> >>> >> >>> can
>> >>> >> >>> >>> just
>> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
>> in the
>> >>> >> >>> Parquet
>> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> >>> parquet so
>> >>> >> >>> why not
>> >>> >> >>> >>> >> invite them in?
>> >>> >> >>> >>> >>
>> >>> >> >>> >>>
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> regards,
>> >>> >> >> Deepak Majeti
>> >>> >>
>> >>> >
>> >>> >
>> >>> > --
>> >>> > regards,
>> >>> > Deepak Majeti
>> >>>
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I have a few more logistical questions to add.

It will be difficult to track parquet-cpp changes if they get mixed with
Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
Can we enforce that parquet-cpp changes will not be committed without a
corresponding Parquet JIRA?

I would also like to keep changes to parquet-cpp on a separate commit to
simplify forking later (if needed) and be able to maintain the commit
history.  I don't know if its possible to squash parquet-cpp commits and
arrow commits separately before merging.


On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:

> Do other people have opinions? I would like to undertake this work in
> the near future (the next 8-10 weeks); I would be OK with taking
> responsibility for the primary codebase surgery.
>
> Some logistical questions:
>
> * We have a handful of pull requests in flight in parquet-cpp that
> would need to be resolved / merged
> * We should probably cut a status-quo cpp-1.5.0 release, with future
> releases cut out of the new structure
> * Management of shared commit rights (I can discuss with the Arrow
> PMC; I believe that approving any committer who has actively
> maintained parquet-cpp should be a reasonable approach per Ted's
> comments)
>
> If working more closely together proves to not be working out after
> some period of time, I will be fully supportive of a fork or something
> like it
>
> Thanks,
> Wes
>
> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
> > Thanks Tim.
> >
> > Indeed, it's not very simple. Just today Antoine cleaned up some
> > platform code intending to improve the performance of bit-packing in
> > Parquet writes, and we resulted with 2 interdependent PRs
> >
> > * https://github.com/apache/parquet-cpp/pull/483
> > * https://github.com/apache/arrow/pull/2355
> >
> > Changes that impact the Python interface to Parquet are even more
> complex.
> >
> > Adding options to Arrow's CMake build system to only build
> > Parquet-related code and dependencies (in a monorepo framework) would
> > not be difficult, and amount to writing "make parquet".
> >
> > See e.g. https://stackoverflow.com/a/17201375. The desired commands to
> > build and install the Parquet core libraries and their dependencies
> > would be:
> >
> > ninja parquet && ninja install
> >
> > - Wes
> >
> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> > <ta...@cloudera.com.invalid> wrote:
> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> successful, but I thought I'd give my two cents.
> >>
> >> For me, the thing that makes the biggest difference in contributing to a
> >> new codebase is the number of steps in the workflow for writing,
> testing,
> >> posting and iterating on a commit and also the number of opportunities
> for
> >> missteps. The size of the repo and build/test times matter but are
> >> secondary so long as the workflow is simple and reliable.
> >>
> >> I don't really know what the current state of things is, but it sounds
> like
> >> it's not as simple as check out -> build -> test if you're doing a
> >> cross-repo change. Circular dependencies are a real headache.
> >>
> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> hi,
> >>>
> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> majeti.deepak@gmail.com>
> >>> wrote:
> >>> > I think the circular dependency can be broken if we build a new
> library
> >>> for
> >>> > the platform code. This will also make it easy for other projects
> such as
> >>> > ORC to use it.
> >>> > I also remember your proposal a while ago of having a separate
> project
> >>> for
> >>> > the platform code.  That project can live in the arrow repo.
> However, one
> >>> > has to clone the entire apache arrow repo but can just build the
> platform
> >>> > code. This will be temporary until we can find a new home for it.
> >>> >
> >>> > The dependency will look like:
> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >>> > libplatform(platform api)
> >>> >
> >>> > CI workflow will clone the arrow project twice, once for the platform
> >>> > library and once for the arrow-core/bindings library.
> >>>
> >>> This seems like an interesting proposal; the best place to work toward
> >>> this goal (if it is even possible; the build system interactions and
> >>> ASF release management are the hard problems) is to have all of the
> >>> code in a single repository. ORC could already be using Arrow if it
> >>> wanted, but the ORC contributors aren't active in Arrow.
> >>>
> >>> >
> >>> > There is no doubt that the collaborations between the Arrow and
> Parquet
> >>> > communities so far have been very successful.
> >>> > The reason to maintain this relationship moving forward is to
> continue to
> >>> > reap the mutual benefits.
> >>> > We should continue to take advantage of sharing code as well.
> However, I
> >>> > don't see any code sharing opportunities between arrow-core and the
> >>> > parquet-core. Both have different functions.
> >>>
> >>> I think you mean the Arrow columnar format. The Arrow columnar format
> >>> is only one part of a project that has become quite large already
> >>> (
> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >>> platform-for-inmemory-data-105427919).
> >>>
> >>> >
> >>> > We are at a point where the parquet-cpp public API is pretty stable.
> We
> >>> > already passed that difficult stage. My take at arrow and parquet is
> to
> >>> > keep them nimble since we can.
> >>>
> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >>> have done little work in asynchronous IO and concurrency which would
> >>> yield both improved read and write throughput. This aligns well with
> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >>> believe that more development will happen on parquet-core once the
> >>> development process issues are resolved by having a single codebase,
> >>> single build system, and a single CI framework.
> >>>
> >>> I have some gripes about design decisions made early in parquet-cpp,
> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >>> goal I think we should still be open to making significant changes in
> >>> the interest of long term progress.
> >>>
> >>> Having now worked on these projects for more than 2 and a half years
> >>> and the most frequent contributor to both codebases, I'm sadly far
> >>> past the "breaking point" and not willing to continue contributing in
> >>> a significant way to parquet-cpp if the projects remained structured
> >>> as they are now. It's hampering progress and not serving the
> >>> community.
> >>>
> >>> - Wes
> >>>
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> > The current Arrow adaptor code for parquet should live in the
> arrow
> >>> >> repo. That will remove a majority of the dependency issues. Joshua's
> >>> work
> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> the
> >>> arrow
> >>> >> repo.  This will be similar to the ORC adaptor.
> >>> >>
> >>> >> This has been suggested before, but I don't see how it would
> alleviate
> >>> >> any issues because of the significant dependencies on other parts of
> >>> >> the Arrow codebase. What you are proposing is:
> >>> >>
> >>> >> - (Arrow) arrow platform
> >>> >> - (Parquet) parquet core
> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >>> >> - (Arrow) Python bindings
> >>> >>
> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >>> >> built before invoking the Parquet core part of the build system. You
> >>> >> would need to pass dependent targets across different CMake build
> >>> >> systems; I don't know if it's possible (I spent some time looking
> into
> >>> >> it earlier this year). This is what I meant by the lack of a
> "concrete
> >>> >> and actionable plan". The only thing that would really work would be
> >>> >> for the Parquet core to be "included" in the Arrow build system
> >>> >> somehow rather than using ExternalProject. Currently Parquet builds
> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> build
> >>> >> system because it's only depended upon by the Python bindings.
> >>> >>
> >>> >> And even if a solution could be devised, it would not wholly resolve
> >>> >> the CI workflow issues.
> >>> >>
> >>> >> You could make Parquet completely independent of the Arrow codebase,
> >>> >> but at that point there is little reason to maintain a relationship
> >>> >> between the projects or their communities. We have spent a great
> deal
> >>> >> of effort refactoring the two projects to enable as much code
> sharing
> >>> >> as there is now.
> >>> >>
> >>> >> - Wes
> >>> >>
> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >> >> If you still strongly feel that the only way forward is to clone
> the
> >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
> two
> >>> >> parquet-cpp repos is no way a better approach.
> >>> >> >
> >>> >> > Yes, indeed. In my view, the next best option after a monorepo is
> to
> >>> >> > fork. That would obviously be a bad outcome for the community.
> >>> >> >
> >>> >> > It doesn't look like I will be able to convince you that a
> monorepo is
> >>> >> > a good idea; what I would ask instead is that you be willing to
> give
> >>> >> > it a shot, and if it turns out in the way you're describing
> (which I
> >>> >> > don't think it will) then I suggest that we fork at that point.
> >>> >> >
> >>> >> > - Wes
> >>> >> >
> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >>> majeti.deepak@gmail.com>
> >>> >> wrote:
> >>> >> >> Wes,
> >>> >> >>
> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> problems
> >>> of a
> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >>> >> >> Bringing in related Apache community experiences are more
> meaningful
> >>> >> than
> >>> >> >> how mono-repos work at Google and other big organizations.
> >>> >> >> We solely depend on volunteers and cannot hire full-time
> developers.
> >>> >> >> You are very well aware of how difficult it has been to find more
> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has
> a low
> >>> >> >> contribution rate to its core components.
> >>> >> >>
> >>> >> >> We should target to ensure that new volunteers who want to
> contribute
> >>> >> >> bug-fixes/features should spend the least amount of time in
> figuring
> >>> out
> >>> >> >> the project repo. We can never come up with an automated build
> system
> >>> >> that
> >>> >> >> caters to every possible environment.
> >>> >> >> My only concern is if the mono-repo will make it harder for new
> >>> >> developers
> >>> >> >> to work on parquet-cpp core just due to the additional code,
> build
> >>> and
> >>> >> test
> >>> >> >> dependencies.
> >>> >> >> I am not saying that the Arrow community/committers will be less
> >>> >> >> co-operative.
> >>> >> >> I just don't think the mono-repo structure model will be
> sustainable
> >>> in
> >>> >> an
> >>> >> >> open source community unless there are long-term vested
> interests. We
> >>> >> can't
> >>> >> >> predict that.
> >>> >> >>
> >>> >> >> The current circular dependency problems between Arrow and
> Parquet
> >>> is a
> >>> >> >> major problem for the community and it is important.
> >>> >> >>
> >>> >> >> The current Arrow adaptor code for parquet should live in the
> arrow
> >>> >> repo.
> >>> >> >> That will remove a majority of the dependency issues.
> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
> >>> adapter
> >>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
> >>> >> >>
> >>> >> >> The platform API code is pretty stable at this point. Minor
> changes
> >>> in
> >>> >> the
> >>> >> >> future to this code should not be the main reason to combine the
> >>> arrow
> >>> >> >> parquet repos.
> >>> >> >>
> >>> >> >> "
> >>> >> >> *I question whether it's worth the community's time long term to
> >>> wear*
> >>> >> >>
> >>> >> >>
> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >>> >> eachlibrary
> >>> >> >> to plug components together rather than utilizing commonplatform
> >>> APIs.*"
> >>> >> >>
> >>> >> >> My answer to your question below would be "Yes".
> >>> Modularity/separation
> >>> >> is
> >>> >> >> very important in an open source community where priorities of
> >>> >> contributors
> >>> >> >> are often short term.
> >>> >> >> The retention is low and therefore the acquisition costs should
> be
> >>> low
> >>> >> as
> >>> >> >> well. This is the community over code approach according to me.
> Minor
> >>> >> code
> >>> >> >> duplication is not a deal breaker.
> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
> >>> data
> >>> >> >> space serving their own functions.
> >>> >> >>
> >>> >> >> If you still strongly feel that the only way forward is to clone
> the
> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> Having
> >>> two
> >>> >> >> parquet-cpp repos is no way a better approach.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> wesmckinn@gmail.com>
> >>> >> wrote:
> >>> >> >>
> >>> >> >>> @Antoine
> >>> >> >>>
> >>> >> >>> > By the way, one concern with the monorepo approach: it would
> >>> slightly
> >>> >> >>> increase Arrow CI times (which are already too large).
> >>> >> >>>
> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >>> >> >>>
> >>> >> >>> Parquet run takes about 28
> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >>> >> >>>
> >>> >> >>> Inevitably we will need to create some kind of bot to run
> certain
> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >>> >> >>>
> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> could be
> >>> >> >>> made substantially shorter by moving some of the slower parts
> (like
> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> nightly
> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> also
> >>> >> >>> improve build times (valgrind build could be moved to a nightly
> >>> >> >>> exhaustive test run)
> >>> >> >>>
> >>> >> >>> - Wes
> >>> >> >>>
> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> wesmckinn@gmail.com
> >>> >
> >>> >> >>> wrote:
> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
> >>> >> example of
> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >>> codebase.
> >>> >> That
> >>> >> >>> gives me hope that the projects could be managed separately some
> >>> day.
> >>> >> >>> >
> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
> >>> codebase
> >>> >> >>> > features several areas of duplicated logic which could be
> >>> replaced by
> >>> >> >>> > components from the Arrow platform for better platform-wide
> >>> >> >>> > interoperability:
> >>> >> >>> >
> >>> >> >>> >
> >>> >> >>>
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >>> orc/OrcFile.hh#L37
> >>> >> >>> >
> >>> >>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >>> >> >>> >
> >>> >> >>>
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >>> orc/MemoryPool.hh
> >>> >> >>> >
> >>> >>
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >>> >> >>> >
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >>> OutputStream.hh
> >>> >> >>> >
> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> cause of
> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> them
> >>> from
> >>> >> >>> > leaking to third party linkers when statically linked (ORC is
> only
> >>> >> >>> > available for static linking at the moment AFAIK).
> >>> >> >>> >
> >>> >> >>> > I question whether it's worth the community's time long term
> to
> >>> wear
> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in
> each
> >>> >> >>> > library to plug components together rather than utilizing
> common
> >>> >> >>> > platform APIs.
> >>> >> >>> >
> >>> >> >>> > - Wes
> >>> >> >>> >
> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >>> >> joshuastorck@gmail.com>
> >>> >> >>> wrote:
> >>> >> >>> >> You're point about the constraints of the ASF release
> process are
> >>> >> well
> >>> >> >>> >> taken and as a developer who's trying to work in the current
> >>> >> >>> environment I
> >>> >> >>> >> would be much happier if the codebases were merged. The main
> >>> issues
> >>> >> I
> >>> >> >>> worry
> >>> >> >>> >> about when you put codebases like these together are:
> >>> >> >>> >>
> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> becomes
> >>> too
> >>> >> >>> coupled
> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> tree are
> >>> >> >>> delayed
> >>> >> >>> >> by artifacts higher in the dependency tree
> >>> >> >>> >>
> >>> >> >>> >> If the project/release management is structured well and
> someone
> >>> >> keeps
> >>> >> >>> an
> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >>> >> >>> >>
> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
> >>> >> example of
> >>> >> >>> how
> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >>> codebase.
> >>> >> That
> >>> >> >>> >> gives me hope that the projects could be managed separately
> some
> >>> >> day.
> >>> >> >>> >>
> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >>> wesmckinn@gmail.com>
> >>> >> >>> wrote:
> >>> >> >>> >>
> >>> >> >>> >>> hi Josh,
> >>> >> >>> >>>
> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> arrow
> >>> and
> >>> >> >>> tying
> >>> >> >>> >>> them together seems like the wrong choice.
> >>> >> >>> >>>
> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> people
> >>> >> >>> >>> building these projects -- my argument (which I think you
> agree
> >>> >> with?)
> >>> >> >>> >>> is that we should work more closely together until the
> community
> >>> >> grows
> >>> >> >>> >>> large enough to support larger-scope process than we have
> now.
> >>> As
> >>> >> >>> >>> you've seen, our process isn't serving developers of these
> >>> >> projects.
> >>> >> >>> >>>
> >>> >> >>> >>> > I also think build tooling should be pulled into its own
> >>> >> codebase.
> >>> >> >>> >>>
> >>> >> >>> >>> I don't see how this can possibly be practical taking into
> >>> >> >>> >>> consideration the constraints imposed by the combination of
> the
> >>> >> GitHub
> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >>> idealistic,
> >>> >> >>> >>> but right now we need to be practical. Unless we can devise
> a
> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> per
> >>> day
> >>> >> >>> >>> which may touch both code and build system simultaneously
> >>> without
> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see
> how
> >>> we
> >>> >> can
> >>> >> >>> >>> move forward.
> >>> >> >>> >>>
> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> codebases
> >>> >> in the
> >>> >> >>> >>> short term with the express purpose of separating them in
> the
> >>> near
> >>> >> >>> term.
> >>> >> >>> >>>
> >>> >> >>> >>> I would agree but only if separation can be demonstrated to
> be
> >>> >> >>> >>> practical and result in net improvements in productivity and
> >>> >> community
> >>> >> >>> >>> growth. I think experience has clearly demonstrated that the
> >>> >> current
> >>> >> >>> >>> separation is impractical, and is causing problems.
> >>> >> >>> >>>
> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> >> >>> >>> development process and ASF releases separately. My
> argument is
> >>> as
> >>> >> >>> >>> follows:
> >>> >> >>> >>>
> >>> >> >>> >>> * Monorepo for development (for practicality)
> >>> >> >>> >>> * Releases structured according to the desires of the PMCs
> >>> >> >>> >>>
> >>> >> >>> >>> - Wes
> >>> >> >>> >>>
> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >>> >> joshuastorck@gmail.com
> >>> >> >>> >
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> > I recently worked on an issue that had to be implemented
> in
> >>> >> >>> parquet-cpp
> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >>> >> (ARROW-2585,
> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing
> and
> >>> >> hard to
> >>> >> >>> work
> >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
> >>> >> (created on
> >>> >> >>> May
> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
> >>> >> recently
> >>> >> >>> >>> merged.
> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> the
> >>> >> change in
> >>> >> >>> >>> arrow
> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >>> >> >>> >>> run_clang_format.py
> >>> >> >>> >>> > script in the arrow project only to find out later that
> there
> >>> >> was an
> >>> >> >>> >>> exact
> >>> >> >>> >>> > copy of it in parquet-cpp.
> >>> >> >>> >>> >
> >>> >> >>> >>> > However, I don't think merging the codebases makes sense
> in
> >>> the
> >>> >> long
> >>> >> >>> >>> term.
> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> arrow
> >>> and
> >>> >> >>> tying
> >>> >> >>> >>> them
> >>> >> >>> >>> > together seems like the wrong choice. There will be other
> >>> formats
> >>> >> >>> that
> >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
> >>> Orc),
> >>> >> so I
> >>> >> >>> >>> don't
> >>> >> >>> >>> > see why parquet should be special. I also think build
> tooling
> >>> >> should
> >>> >> >>> be
> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history
> of
> >>> >> >>> developing
> >>> >> >>> >>> open
> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI
> is a
> >>> >> good
> >>> >> >>> >>> > counter-example since there have been lots of successful
> open
> >>> >> source
> >>> >> >>> >>> > projects that have used nightly build systems that pinned
> >>> >> versions of
> >>> >> >>> >>> > dependent software.
> >>> >> >>> >>> >
> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> codebases
> >>> >> in the
> >>> >> >>> >>> short
> >>> >> >>> >>> > term with the express purpose of separating them in the
> near
> >>> >> term.
> >>> >> >>> My
> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> together,
> >>> you
> >>> >> can
> >>> >> >>> more
> >>> >> >>> >>> > easily delineate the boundaries between the API's with a
> >>> single
> >>> >> PR.
> >>> >> >>> >>> Second,
> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >>> diverge,
> >>> >> >>> which has
> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> been
> >>> >> sorted
> >>> >> >>> out,
> >>> >> >>> >>> it
> >>> >> >>> >>> > should be easy to separate them back into their own
> codebases.
> >>> >> >>> >>> >
> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >>> codebases
> >>> >> for
> >>> >> >>> arrow
> >>> >> >>> >>> > be separated from other languages. Looking at it from the
> >>> >> >>> perspective of
> >>> >> >>> >>> a
> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
> >>> large
> >>> >> tax
> >>> >> >>> to
> >>> >> >>> >>> pay
> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> in the
> >>> >> 0.10.0
> >>> >> >>> >>> > release of arrow, many of which were holding up the
> release. I
> >>> >> hope
> >>> >> >>> that
> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> help
> >>> >> reduce
> >>> >> >>> the
> >>> >> >>> >>> > complexity of the build/release tooling.
> >>> >> >>> >>> >
> >>> >> >>> >>> >
> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >>> >> ted.dunning@gmail.com>
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> >
> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >>> >> wesmckinn@gmail.com>
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> >
> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >>> >> >>> >>> >> > > changes that require multiple rounds of patches for
> >>> >> stability
> >>> >> >>> and
> >>> >> >>> >>> API
> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> HDFS
> >>> >> >>> community
> >>> >> >>> >>> took
> >>> >> >>> >>> >> a
> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >>> >> >>> >>> >> >
> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> source
> >>> >> >>> community as
> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> didn't
> >>> go
> >>> >> the
> >>> >> >>> way
> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >>> community
> >>> >> which
> >>> >> >>> >>> >> > happens to operate under a similar open governance
> model.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> There are some more radical and community building
> options as
> >>> >> well.
> >>> >> >>> Take
> >>> >> >>> >>> >> the subversion project as a precedent. With subversion,
> any
> >>> >> Apache
> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> large
> >>> >> >>> fraction of
> >>> >> >>> >>> >> subversion.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> So why not take this a bit further and give every parquet
> >>> >> committer
> >>> >> >>> a
> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >>> >> committers in
> >>> >> >>> >>> Arrow?
> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> committer who
> >>> >> asks
> >>> >> >>> will
> >>> >> >>> >>> be
> >>> >> >>> >>> >> given committer status in Arrow.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
> >>> >> committers
> >>> >> >>> >>> can't be
> >>> >> >>> >>> >> worried at that point whether their patches will get
> merged;
> >>> >> they
> >>> >> >>> can
> >>> >> >>> >>> just
> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> in the
> >>> >> >>> Parquet
> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >>> parquet so
> >>> >> >>> why not
> >>> >> >>> >>> >> invite them in?
> >>> >> >>> >>> >>
> >>> >> >>> >>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> regards,
> >>> >> >> Deepak Majeti
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > regards,
> >>> > Deepak Majeti
> >>>
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I have a few more logistical questions to add.

It will be difficult to track parquet-cpp changes if they get mixed with
Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
Can we enforce that parquet-cpp changes will not be committed without a
corresponding Parquet JIRA?

I would also like to keep changes to parquet-cpp on a separate commit to
simplify forking later (if needed) and be able to maintain the commit
history.  I don't know if its possible to squash parquet-cpp commits and
arrow commits separately before merging.


On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <we...@gmail.com> wrote:

> Do other people have opinions? I would like to undertake this work in
> the near future (the next 8-10 weeks); I would be OK with taking
> responsibility for the primary codebase surgery.
>
> Some logistical questions:
>
> * We have a handful of pull requests in flight in parquet-cpp that
> would need to be resolved / merged
> * We should probably cut a status-quo cpp-1.5.0 release, with future
> releases cut out of the new structure
> * Management of shared commit rights (I can discuss with the Arrow
> PMC; I believe that approving any committer who has actively
> maintained parquet-cpp should be a reasonable approach per Ted's
> comments)
>
> If working more closely together proves to not be working out after
> some period of time, I will be fully supportive of a fork or something
> like it
>
> Thanks,
> Wes
>
> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
> > Thanks Tim.
> >
> > Indeed, it's not very simple. Just today Antoine cleaned up some
> > platform code intending to improve the performance of bit-packing in
> > Parquet writes, and we resulted with 2 interdependent PRs
> >
> > * https://github.com/apache/parquet-cpp/pull/483
> > * https://github.com/apache/arrow/pull/2355
> >
> > Changes that impact the Python interface to Parquet are even more
> complex.
> >
> > Adding options to Arrow's CMake build system to only build
> > Parquet-related code and dependencies (in a monorepo framework) would
> > not be difficult, and amount to writing "make parquet".
> >
> > See e.g. https://stackoverflow.com/a/17201375. The desired commands to
> > build and install the Parquet core libraries and their dependencies
> > would be:
> >
> > ninja parquet && ninja install
> >
> > - Wes
> >
> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> > <ta...@cloudera.com.invalid> wrote:
> >> I don't have a direct stake in this beyond wanting to see Parquet be
> >> successful, but I thought I'd give my two cents.
> >>
> >> For me, the thing that makes the biggest difference in contributing to a
> >> new codebase is the number of steps in the workflow for writing,
> testing,
> >> posting and iterating on a commit and also the number of opportunities
> for
> >> missteps. The size of the repo and build/test times matter but are
> >> secondary so long as the workflow is simple and reliable.
> >>
> >> I don't really know what the current state of things is, but it sounds
> like
> >> it's not as simple as check out -> build -> test if you're doing a
> >> cross-repo change. Circular dependencies are a real headache.
> >>
> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> hi,
> >>>
> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <
> majeti.deepak@gmail.com>
> >>> wrote:
> >>> > I think the circular dependency can be broken if we build a new
> library
> >>> for
> >>> > the platform code. This will also make it easy for other projects
> such as
> >>> > ORC to use it.
> >>> > I also remember your proposal a while ago of having a separate
> project
> >>> for
> >>> > the platform code.  That project can live in the arrow repo.
> However, one
> >>> > has to clone the entire apache arrow repo but can just build the
> platform
> >>> > code. This will be temporary until we can find a new home for it.
> >>> >
> >>> > The dependency will look like:
> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> >>> > libplatform(platform api)
> >>> >
> >>> > CI workflow will clone the arrow project twice, once for the platform
> >>> > library and once for the arrow-core/bindings library.
> >>>
> >>> This seems like an interesting proposal; the best place to work toward
> >>> this goal (if it is even possible; the build system interactions and
> >>> ASF release management are the hard problems) is to have all of the
> >>> code in a single repository. ORC could already be using Arrow if it
> >>> wanted, but the ORC contributors aren't active in Arrow.
> >>>
> >>> >
> >>> > There is no doubt that the collaborations between the Arrow and
> Parquet
> >>> > communities so far have been very successful.
> >>> > The reason to maintain this relationship moving forward is to
> continue to
> >>> > reap the mutual benefits.
> >>> > We should continue to take advantage of sharing code as well.
> However, I
> >>> > don't see any code sharing opportunities between arrow-core and the
> >>> > parquet-core. Both have different functions.
> >>>
> >>> I think you mean the Arrow columnar format. The Arrow columnar format
> >>> is only one part of a project that has become quite large already
> >>> (
> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> >>> platform-for-inmemory-data-105427919).
> >>>
> >>> >
> >>> > We are at a point where the parquet-cpp public API is pretty stable.
> We
> >>> > already passed that difficult stage. My take at arrow and parquet is
> to
> >>> > keep them nimble since we can.
> >>>
> >>> I believe that parquet-core has progress to make yet ahead of it. We
> >>> have done little work in asynchronous IO and concurrency which would
> >>> yield both improved read and write throughput. This aligns well with
> >>> other concurrency and async-IO work planned in the Arrow platform. I
> >>> believe that more development will happen on parquet-core once the
> >>> development process issues are resolved by having a single codebase,
> >>> single build system, and a single CI framework.
> >>>
> >>> I have some gripes about design decisions made early in parquet-cpp,
> >>> like the use of C++ exceptions. So while "stability" is a reasonable
> >>> goal I think we should still be open to making significant changes in
> >>> the interest of long term progress.
> >>>
> >>> Having now worked on these projects for more than 2 and a half years
> >>> and the most frequent contributor to both codebases, I'm sadly far
> >>> past the "breaking point" and not willing to continue contributing in
> >>> a significant way to parquet-cpp if the projects remained structured
> >>> as they are now. It's hampering progress and not serving the
> >>> community.
> >>>
> >>> - Wes
> >>>
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> > The current Arrow adaptor code for parquet should live in the
> arrow
> >>> >> repo. That will remove a majority of the dependency issues. Joshua's
> >>> work
> >>> >> would not have been blocked in parquet-cpp if that adapter was in
> the
> >>> arrow
> >>> >> repo.  This will be similar to the ORC adaptor.
> >>> >>
> >>> >> This has been suggested before, but I don't see how it would
> alleviate
> >>> >> any issues because of the significant dependencies on other parts of
> >>> >> the Arrow codebase. What you are proposing is:
> >>> >>
> >>> >> - (Arrow) arrow platform
> >>> >> - (Parquet) parquet core
> >>> >> - (Arrow) arrow columnar-parquet adapter interface
> >>> >> - (Arrow) Python bindings
> >>> >>
> >>> >> To make this work, somehow Arrow core / libarrow would have to be
> >>> >> built before invoking the Parquet core part of the build system. You
> >>> >> would need to pass dependent targets across different CMake build
> >>> >> systems; I don't know if it's possible (I spent some time looking
> into
> >>> >> it earlier this year). This is what I meant by the lack of a
> "concrete
> >>> >> and actionable plan". The only thing that would really work would be
> >>> >> for the Parquet core to be "included" in the Arrow build system
> >>> >> somehow rather than using ExternalProject. Currently Parquet builds
> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow
> build
> >>> >> system because it's only depended upon by the Python bindings.
> >>> >>
> >>> >> And even if a solution could be devised, it would not wholly resolve
> >>> >> the CI workflow issues.
> >>> >>
> >>> >> You could make Parquet completely independent of the Arrow codebase,
> >>> >> but at that point there is little reason to maintain a relationship
> >>> >> between the projects or their communities. We have spent a great
> deal
> >>> >> of effort refactoring the two projects to enable as much code
> sharing
> >>> >> as there is now.
> >>> >>
> >>> >> - Wes
> >>> >>
> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >> >> If you still strongly feel that the only way forward is to clone
> the
> >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
> two
> >>> >> parquet-cpp repos is no way a better approach.
> >>> >> >
> >>> >> > Yes, indeed. In my view, the next best option after a monorepo is
> to
> >>> >> > fork. That would obviously be a bad outcome for the community.
> >>> >> >
> >>> >> > It doesn't look like I will be able to convince you that a
> monorepo is
> >>> >> > a good idea; what I would ask instead is that you be willing to
> give
> >>> >> > it a shot, and if it turns out in the way you're describing
> (which I
> >>> >> > don't think it will) then I suggest that we fork at that point.
> >>> >> >
> >>> >> > - Wes
> >>> >> >
> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> >>> majeti.deepak@gmail.com>
> >>> >> wrote:
> >>> >> >> Wes,
> >>> >> >>
> >>> >> >> Unfortunately, I cannot show you any practical fact-based
> problems
> >>> of a
> >>> >> >> non-existent Arrow-Parquet mono-repo.
> >>> >> >> Bringing in related Apache community experiences are more
> meaningful
> >>> >> than
> >>> >> >> how mono-repos work at Google and other big organizations.
> >>> >> >> We solely depend on volunteers and cannot hire full-time
> developers.
> >>> >> >> You are very well aware of how difficult it has been to find more
> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has
> a low
> >>> >> >> contribution rate to its core components.
> >>> >> >>
> >>> >> >> We should target to ensure that new volunteers who want to
> contribute
> >>> >> >> bug-fixes/features should spend the least amount of time in
> figuring
> >>> out
> >>> >> >> the project repo. We can never come up with an automated build
> system
> >>> >> that
> >>> >> >> caters to every possible environment.
> >>> >> >> My only concern is if the mono-repo will make it harder for new
> >>> >> developers
> >>> >> >> to work on parquet-cpp core just due to the additional code,
> build
> >>> and
> >>> >> test
> >>> >> >> dependencies.
> >>> >> >> I am not saying that the Arrow community/committers will be less
> >>> >> >> co-operative.
> >>> >> >> I just don't think the mono-repo structure model will be
> sustainable
> >>> in
> >>> >> an
> >>> >> >> open source community unless there are long-term vested
> interests. We
> >>> >> can't
> >>> >> >> predict that.
> >>> >> >>
> >>> >> >> The current circular dependency problems between Arrow and
> Parquet
> >>> is a
> >>> >> >> major problem for the community and it is important.
> >>> >> >>
> >>> >> >> The current Arrow adaptor code for parquet should live in the
> arrow
> >>> >> repo.
> >>> >> >> That will remove a majority of the dependency issues.
> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
> >>> adapter
> >>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
> >>> >> >>
> >>> >> >> The platform API code is pretty stable at this point. Minor
> changes
> >>> in
> >>> >> the
> >>> >> >> future to this code should not be the main reason to combine the
> >>> arrow
> >>> >> >> parquet repos.
> >>> >> >>
> >>> >> >> "
> >>> >> >> *I question whether it's worth the community's time long term to
> >>> wear*
> >>> >> >>
> >>> >> >>
> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >>> >> eachlibrary
> >>> >> >> to plug components together rather than utilizing commonplatform
> >>> APIs.*"
> >>> >> >>
> >>> >> >> My answer to your question below would be "Yes".
> >>> Modularity/separation
> >>> >> is
> >>> >> >> very important in an open source community where priorities of
> >>> >> contributors
> >>> >> >> are often short term.
> >>> >> >> The retention is low and therefore the acquisition costs should
> be
> >>> low
> >>> >> as
> >>> >> >> well. This is the community over code approach according to me.
> Minor
> >>> >> code
> >>> >> >> duplication is not a deal breaker.
> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
> >>> data
> >>> >> >> space serving their own functions.
> >>> >> >>
> >>> >> >> If you still strongly feel that the only way forward is to clone
> the
> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern.
> Having
> >>> two
> >>> >> >> parquet-cpp repos is no way a better approach.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <
> wesmckinn@gmail.com>
> >>> >> wrote:
> >>> >> >>
> >>> >> >>> @Antoine
> >>> >> >>>
> >>> >> >>> > By the way, one concern with the monorepo approach: it would
> >>> slightly
> >>> >> >>> increase Arrow CI times (which are already too large).
> >>> >> >>>
> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >>> >> >>>
> >>> >> >>> Parquet run takes about 28
> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >>> >> >>>
> >>> >> >>> Inevitably we will need to create some kind of bot to run
> certain
> >>> >> >>> builds on-demand based on commit / PR metadata or on request.
> >>> >> >>>
> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build
> could be
> >>> >> >>> made substantially shorter by moving some of the slower parts
> (like
> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to
> nightly
> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would
> also
> >>> >> >>> improve build times (valgrind build could be moved to a nightly
> >>> >> >>> exhaustive test run)
> >>> >> >>>
> >>> >> >>> - Wes
> >>> >> >>>
> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <
> wesmckinn@gmail.com
> >>> >
> >>> >> >>> wrote:
> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
> >>> >> example of
> >>> >> >>> how it would be possible to manage parquet-cpp as a separate
> >>> codebase.
> >>> >> That
> >>> >> >>> gives me hope that the projects could be managed separately some
> >>> day.
> >>> >> >>> >
> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
> >>> codebase
> >>> >> >>> > features several areas of duplicated logic which could be
> >>> replaced by
> >>> >> >>> > components from the Arrow platform for better platform-wide
> >>> >> >>> > interoperability:
> >>> >> >>> >
> >>> >> >>> >
> >>> >> >>>
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >>> orc/OrcFile.hh#L37
> >>> >> >>> >
> >>> >>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >>> >> >>> >
> >>> >> >>>
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> >>> orc/MemoryPool.hh
> >>> >> >>> >
> >>> >>
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >>> >> >>> >
> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> >>> OutputStream.hh
> >>> >> >>> >
> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a
> cause of
> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent
> them
> >>> from
> >>> >> >>> > leaking to third party linkers when statically linked (ORC is
> only
> >>> >> >>> > available for static linking at the moment AFAIK).
> >>> >> >>> >
> >>> >> >>> > I question whether it's worth the community's time long term
> to
> >>> wear
> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in
> each
> >>> >> >>> > library to plug components together rather than utilizing
> common
> >>> >> >>> > platform APIs.
> >>> >> >>> >
> >>> >> >>> > - Wes
> >>> >> >>> >
> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >>> >> joshuastorck@gmail.com>
> >>> >> >>> wrote:
> >>> >> >>> >> You're point about the constraints of the ASF release
> process are
> >>> >> well
> >>> >> >>> >> taken and as a developer who's trying to work in the current
> >>> >> >>> environment I
> >>> >> >>> >> would be much happier if the codebases were merged. The main
> >>> issues
> >>> >> I
> >>> >> >>> worry
> >>> >> >>> >> about when you put codebases like these together are:
> >>> >> >>> >>
> >>> >> >>> >> 1. The delineation of API's become blurred and the code
> becomes
> >>> too
> >>> >> >>> coupled
> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency
> tree are
> >>> >> >>> delayed
> >>> >> >>> >> by artifacts higher in the dependency tree
> >>> >> >>> >>
> >>> >> >>> >> If the project/release management is structured well and
> someone
> >>> >> keeps
> >>> >> >>> an
> >>> >> >>> >> eye on the coupling, then I don't have any concerns.
> >>> >> >>> >>
> >>> >> >>> >> I would like to point out that arrow's use of orc is a great
> >>> >> example of
> >>> >> >>> how
> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate
> >>> codebase.
> >>> >> That
> >>> >> >>> >> gives me hope that the projects could be managed separately
> some
> >>> >> day.
> >>> >> >>> >>
> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> >>> wesmckinn@gmail.com>
> >>> >> >>> wrote:
> >>> >> >>> >>
> >>> >> >>> >>> hi Josh,
> >>> >> >>> >>>
> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> arrow
> >>> and
> >>> >> >>> tying
> >>> >> >>> >>> them together seems like the wrong choice.
> >>> >> >>> >>>
> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same
> people
> >>> >> >>> >>> building these projects -- my argument (which I think you
> agree
> >>> >> with?)
> >>> >> >>> >>> is that we should work more closely together until the
> community
> >>> >> grows
> >>> >> >>> >>> large enough to support larger-scope process than we have
> now.
> >>> As
> >>> >> >>> >>> you've seen, our process isn't serving developers of these
> >>> >> projects.
> >>> >> >>> >>>
> >>> >> >>> >>> > I also think build tooling should be pulled into its own
> >>> >> codebase.
> >>> >> >>> >>>
> >>> >> >>> >>> I don't see how this can possibly be practical taking into
> >>> >> >>> >>> consideration the constraints imposed by the combination of
> the
> >>> >> GitHub
> >>> >> >>> >>> platform and the ASF release process. I'm all for being
> >>> idealistic,
> >>> >> >>> >>> but right now we need to be practical. Unless we can devise
> a
> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch
> per
> >>> day
> >>> >> >>> >>> which may touch both code and build system simultaneously
> >>> without
> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see
> how
> >>> we
> >>> >> can
> >>> >> >>> >>> move forward.
> >>> >> >>> >>>
> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> codebases
> >>> >> in the
> >>> >> >>> >>> short term with the express purpose of separating them in
> the
> >>> near
> >>> >> >>> term.
> >>> >> >>> >>>
> >>> >> >>> >>> I would agree but only if separation can be demonstrated to
> be
> >>> >> >>> >>> practical and result in net improvements in productivity and
> >>> >> community
> >>> >> >>> >>> growth. I think experience has clearly demonstrated that the
> >>> >> current
> >>> >> >>> >>> separation is impractical, and is causing problems.
> >>> >> >>> >>>
> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> >> >>> >>> development process and ASF releases separately. My
> argument is
> >>> as
> >>> >> >>> >>> follows:
> >>> >> >>> >>>
> >>> >> >>> >>> * Monorepo for development (for practicality)
> >>> >> >>> >>> * Releases structured according to the desires of the PMCs
> >>> >> >>> >>>
> >>> >> >>> >>> - Wes
> >>> >> >>> >>>
> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >>> >> joshuastorck@gmail.com
> >>> >> >>> >
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> > I recently worked on an issue that had to be implemented
> in
> >>> >> >>> parquet-cpp
> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >>> >> (ARROW-2585,
> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing
> and
> >>> >> hard to
> >>> >> >>> work
> >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
> >>> >> (created on
> >>> >> >>> May
> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
> >>> >> recently
> >>> >> >>> >>> merged.
> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because
> the
> >>> >> change in
> >>> >> >>> >>> arrow
> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >>> >> >>> >>> run_clang_format.py
> >>> >> >>> >>> > script in the arrow project only to find out later that
> there
> >>> >> was an
> >>> >> >>> >>> exact
> >>> >> >>> >>> > copy of it in parquet-cpp.
> >>> >> >>> >>> >
> >>> >> >>> >>> > However, I don't think merging the codebases makes sense
> in
> >>> the
> >>> >> long
> >>> >> >>> >>> term.
> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve
> arrow
> >>> and
> >>> >> >>> tying
> >>> >> >>> >>> them
> >>> >> >>> >>> > together seems like the wrong choice. There will be other
> >>> formats
> >>> >> >>> that
> >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
> >>> Orc),
> >>> >> so I
> >>> >> >>> >>> don't
> >>> >> >>> >>> > see why parquet should be special. I also think build
> tooling
> >>> >> should
> >>> >> >>> be
> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history
> of
> >>> >> >>> developing
> >>> >> >>> >>> open
> >>> >> >>> >>> > source C/C++ projects that way and made projects like
> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI
> is a
> >>> >> good
> >>> >> >>> >>> > counter-example since there have been lots of successful
> open
> >>> >> source
> >>> >> >>> >>> > projects that have used nightly build systems that pinned
> >>> >> versions of
> >>> >> >>> >>> > dependent software.
> >>> >> >>> >>> >
> >>> >> >>> >>> > That being said, I think it makes sense to merge the
> codebases
> >>> >> in the
> >>> >> >>> >>> short
> >>> >> >>> >>> > term with the express purpose of separating them in the
> near
> >>> >> term.
> >>> >> >>> My
> >>> >> >>> >>> > reasoning is as follows. By putting the codebases
> together,
> >>> you
> >>> >> can
> >>> >> >>> more
> >>> >> >>> >>> > easily delineate the boundaries between the API's with a
> >>> single
> >>> >> PR.
> >>> >> >>> >>> Second,
> >>> >> >>> >>> > it will force the build tooling to converge instead of
> >>> diverge,
> >>> >> >>> which has
> >>> >> >>> >>> > already happened. Once the boundaries and tooling have
> been
> >>> >> sorted
> >>> >> >>> out,
> >>> >> >>> >>> it
> >>> >> >>> >>> > should be easy to separate them back into their own
> codebases.
> >>> >> >>> >>> >
> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++
> >>> codebases
> >>> >> for
> >>> >> >>> arrow
> >>> >> >>> >>> > be separated from other languages. Looking at it from the
> >>> >> >>> perspective of
> >>> >> >>> >>> a
> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
> >>> large
> >>> >> tax
> >>> >> >>> to
> >>> >> >>> >>> pay
> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's
> in the
> >>> >> 0.10.0
> >>> >> >>> >>> > release of arrow, many of which were holding up the
> release. I
> >>> >> hope
> >>> >> >>> that
> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will
> help
> >>> >> reduce
> >>> >> >>> the
> >>> >> >>> >>> > complexity of the build/release tooling.
> >>> >> >>> >>> >
> >>> >> >>> >>> >
> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >>> >> ted.dunning@gmail.com>
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> >
> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >>> >> wesmckinn@gmail.com>
> >>> >> >>> >>> wrote:
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> >
> >>> >> >>> >>> >> > > The community will be less willing to accept large
> >>> >> >>> >>> >> > > changes that require multiple rounds of patches for
> >>> >> stability
> >>> >> >>> and
> >>> >> >>> >>> API
> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the
> HDFS
> >>> >> >>> community
> >>> >> >>> >>> took
> >>> >> >>> >>> >> a
> >>> >> >>> >>> >> > > significantly long time for the very same reason.
> >>> >> >>> >>> >> >
> >>> >> >>> >>> >> > Please don't use bad experiences from another open
> source
> >>> >> >>> community as
> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things
> didn't
> >>> go
> >>> >> the
> >>> >> >>> way
> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> >>> community
> >>> >> which
> >>> >> >>> >>> >> > happens to operate under a similar open governance
> model.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> There are some more radical and community building
> options as
> >>> >> well.
> >>> >> >>> Take
> >>> >> >>> >>> >> the subversion project as a precedent. With subversion,
> any
> >>> >> Apache
> >>> >> >>> >>> >> committer can request and receive a commit bit on some
> large
> >>> >> >>> fraction of
> >>> >> >>> >>> >> subversion.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> So why not take this a bit further and give every parquet
> >>> >> committer
> >>> >> >>> a
> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >>> >> committers in
> >>> >> >>> >>> Arrow?
> >>> >> >>> >>> >> Possibly even make it policy that every Parquet
> committer who
> >>> >> asks
> >>> >> >>> will
> >>> >> >>> >>> be
> >>> >> >>> >>> >> given committer status in Arrow.
> >>> >> >>> >>> >>
> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
> >>> >> committers
> >>> >> >>> >>> can't be
> >>> >> >>> >>> >> worried at that point whether their patches will get
> merged;
> >>> >> they
> >>> >> >>> can
> >>> >> >>> >>> just
> >>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting
> in the
> >>> >> >>> Parquet
> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> >>> parquet so
> >>> >> >>> why not
> >>> >> >>> >>> >> invite them in?
> >>> >> >>> >>> >>
> >>> >> >>> >>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> regards,
> >>> >> >> Deepak Majeti
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > regards,
> >>> > Deepak Majeti
> >>>
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Do other people have opinions? I would like to undertake this work in
the near future (the next 8-10 weeks); I would be OK with taking
responsibility for the primary codebase surgery.

Some logistical questions:

* We have a handful of pull requests in flight in parquet-cpp that
would need to be resolved / merged
* We should probably cut a status-quo cpp-1.5.0 release, with future
releases cut out of the new structure
* Management of shared commit rights (I can discuss with the Arrow
PMC; I believe that approving any committer who has actively
maintained parquet-cpp should be a reasonable approach per Ted's
comments)

If working more closely together proves to not be working out after
some period of time, I will be fully supportive of a fork or something
like it

Thanks,
Wes

On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
> Thanks Tim.
>
> Indeed, it's not very simple. Just today Antoine cleaned up some
> platform code intending to improve the performance of bit-packing in
> Parquet writes, and we resulted with 2 interdependent PRs
>
> * https://github.com/apache/parquet-cpp/pull/483
> * https://github.com/apache/arrow/pull/2355
>
> Changes that impact the Python interface to Parquet are even more complex.
>
> Adding options to Arrow's CMake build system to only build
> Parquet-related code and dependencies (in a monorepo framework) would
> not be difficult, and amount to writing "make parquet".
>
> See e.g. https://stackoverflow.com/a/17201375. The desired commands to
> build and install the Parquet core libraries and their dependencies
> would be:
>
> ninja parquet && ninja install
>
> - Wes
>
> On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
>> I don't have a direct stake in this beyond wanting to see Parquet be
>> successful, but I thought I'd give my two cents.
>>
>> For me, the thing that makes the biggest difference in contributing to a
>> new codebase is the number of steps in the workflow for writing, testing,
>> posting and iterating on a commit and also the number of opportunities for
>> missteps. The size of the repo and build/test times matter but are
>> secondary so long as the workflow is simple and reliable.
>>
>> I don't really know what the current state of things is, but it sounds like
>> it's not as simple as check out -> build -> test if you're doing a
>> cross-repo change. Circular dependencies are a real headache.
>>
>> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi,
>>>
>>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com>
>>> wrote:
>>> > I think the circular dependency can be broken if we build a new library
>>> for
>>> > the platform code. This will also make it easy for other projects such as
>>> > ORC to use it.
>>> > I also remember your proposal a while ago of having a separate project
>>> for
>>> > the platform code.  That project can live in the arrow repo. However, one
>>> > has to clone the entire apache arrow repo but can just build the platform
>>> > code. This will be temporary until we can find a new home for it.
>>> >
>>> > The dependency will look like:
>>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>>> > libplatform(platform api)
>>> >
>>> > CI workflow will clone the arrow project twice, once for the platform
>>> > library and once for the arrow-core/bindings library.
>>>
>>> This seems like an interesting proposal; the best place to work toward
>>> this goal (if it is even possible; the build system interactions and
>>> ASF release management are the hard problems) is to have all of the
>>> code in a single repository. ORC could already be using Arrow if it
>>> wanted, but the ORC contributors aren't active in Arrow.
>>>
>>> >
>>> > There is no doubt that the collaborations between the Arrow and Parquet
>>> > communities so far have been very successful.
>>> > The reason to maintain this relationship moving forward is to continue to
>>> > reap the mutual benefits.
>>> > We should continue to take advantage of sharing code as well. However, I
>>> > don't see any code sharing opportunities between arrow-core and the
>>> > parquet-core. Both have different functions.
>>>
>>> I think you mean the Arrow columnar format. The Arrow columnar format
>>> is only one part of a project that has become quite large already
>>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>>> platform-for-inmemory-data-105427919).
>>>
>>> >
>>> > We are at a point where the parquet-cpp public API is pretty stable. We
>>> > already passed that difficult stage. My take at arrow and parquet is to
>>> > keep them nimble since we can.
>>>
>>> I believe that parquet-core has progress to make yet ahead of it. We
>>> have done little work in asynchronous IO and concurrency which would
>>> yield both improved read and write throughput. This aligns well with
>>> other concurrency and async-IO work planned in the Arrow platform. I
>>> believe that more development will happen on parquet-core once the
>>> development process issues are resolved by having a single codebase,
>>> single build system, and a single CI framework.
>>>
>>> I have some gripes about design decisions made early in parquet-cpp,
>>> like the use of C++ exceptions. So while "stability" is a reasonable
>>> goal I think we should still be open to making significant changes in
>>> the interest of long term progress.
>>>
>>> Having now worked on these projects for more than 2 and a half years
>>> and the most frequent contributor to both codebases, I'm sadly far
>>> past the "breaking point" and not willing to continue contributing in
>>> a significant way to parquet-cpp if the projects remained structured
>>> as they are now. It's hampering progress and not serving the
>>> community.
>>>
>>> - Wes
>>>
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> >> > The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo. That will remove a majority of the dependency issues. Joshua's
>>> work
>>> >> would not have been blocked in parquet-cpp if that adapter was in the
>>> arrow
>>> >> repo.  This will be similar to the ORC adaptor.
>>> >>
>>> >> This has been suggested before, but I don't see how it would alleviate
>>> >> any issues because of the significant dependencies on other parts of
>>> >> the Arrow codebase. What you are proposing is:
>>> >>
>>> >> - (Arrow) arrow platform
>>> >> - (Parquet) parquet core
>>> >> - (Arrow) arrow columnar-parquet adapter interface
>>> >> - (Arrow) Python bindings
>>> >>
>>> >> To make this work, somehow Arrow core / libarrow would have to be
>>> >> built before invoking the Parquet core part of the build system. You
>>> >> would need to pass dependent targets across different CMake build
>>> >> systems; I don't know if it's possible (I spent some time looking into
>>> >> it earlier this year). This is what I meant by the lack of a "concrete
>>> >> and actionable plan". The only thing that would really work would be
>>> >> for the Parquet core to be "included" in the Arrow build system
>>> >> somehow rather than using ExternalProject. Currently Parquet builds
>>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>>> >> system because it's only depended upon by the Python bindings.
>>> >>
>>> >> And even if a solution could be devised, it would not wholly resolve
>>> >> the CI workflow issues.
>>> >>
>>> >> You could make Parquet completely independent of the Arrow codebase,
>>> >> but at that point there is little reason to maintain a relationship
>>> >> between the projects or their communities. We have spent a great deal
>>> >> of effort refactoring the two projects to enable as much code sharing
>>> >> as there is now.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>>> >> parquet-cpp repos is no way a better approach.
>>> >> >
>>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>>> >> > fork. That would obviously be a bad outcome for the community.
>>> >> >
>>> >> > It doesn't look like I will be able to convince you that a monorepo is
>>> >> > a good idea; what I would ask instead is that you be willing to give
>>> >> > it a shot, and if it turns out in the way you're describing (which I
>>> >> > don't think it will) then I suggest that we fork at that point.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>>> majeti.deepak@gmail.com>
>>> >> wrote:
>>> >> >> Wes,
>>> >> >>
>>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>>> of a
>>> >> >> non-existent Arrow-Parquet mono-repo.
>>> >> >> Bringing in related Apache community experiences are more meaningful
>>> >> than
>>> >> >> how mono-repos work at Google and other big organizations.
>>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>>> >> >> You are very well aware of how difficult it has been to find more
>>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>>> >> >> contribution rate to its core components.
>>> >> >>
>>> >> >> We should target to ensure that new volunteers who want to contribute
>>> >> >> bug-fixes/features should spend the least amount of time in figuring
>>> out
>>> >> >> the project repo. We can never come up with an automated build system
>>> >> that
>>> >> >> caters to every possible environment.
>>> >> >> My only concern is if the mono-repo will make it harder for new
>>> >> developers
>>> >> >> to work on parquet-cpp core just due to the additional code, build
>>> and
>>> >> test
>>> >> >> dependencies.
>>> >> >> I am not saying that the Arrow community/committers will be less
>>> >> >> co-operative.
>>> >> >> I just don't think the mono-repo structure model will be sustainable
>>> in
>>> >> an
>>> >> >> open source community unless there are long-term vested interests. We
>>> >> can't
>>> >> >> predict that.
>>> >> >>
>>> >> >> The current circular dependency problems between Arrow and Parquet
>>> is a
>>> >> >> major problem for the community and it is important.
>>> >> >>
>>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo.
>>> >> >> That will remove a majority of the dependency issues.
>>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>>> adapter
>>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>>> >> >>
>>> >> >> The platform API code is pretty stable at this point. Minor changes
>>> in
>>> >> the
>>> >> >> future to this code should not be the main reason to combine the
>>> arrow
>>> >> >> parquet repos.
>>> >> >>
>>> >> >> "
>>> >> >> *I question whether it's worth the community's time long term to
>>> wear*
>>> >> >>
>>> >> >>
>>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>>> >> eachlibrary
>>> >> >> to plug components together rather than utilizing commonplatform
>>> APIs.*"
>>> >> >>
>>> >> >> My answer to your question below would be "Yes".
>>> Modularity/separation
>>> >> is
>>> >> >> very important in an open source community where priorities of
>>> >> contributors
>>> >> >> are often short term.
>>> >> >> The retention is low and therefore the acquisition costs should be
>>> low
>>> >> as
>>> >> >> well. This is the community over code approach according to me. Minor
>>> >> code
>>> >> >> duplication is not a deal breaker.
>>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>>> data
>>> >> >> space serving their own functions.
>>> >> >>
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>>> two
>>> >> >> parquet-cpp repos is no way a better approach.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>>> >> wrote:
>>> >> >>
>>> >> >>> @Antoine
>>> >> >>>
>>> >> >>> > By the way, one concern with the monorepo approach: it would
>>> slightly
>>> >> >>> increase Arrow CI times (which are already too large).
>>> >> >>>
>>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>>> >> >>>
>>> >> >>> Parquet run takes about 28
>>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> >> >>>
>>> >> >>> Inevitably we will need to create some kind of bot to run certain
>>> >> >>> builds on-demand based on commit / PR metadata or on request.
>>> >> >>>
>>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> >> >>> made substantially shorter by moving some of the slower parts (like
>>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> >> >>> improve build times (valgrind build could be moved to a nightly
>>> >> >>> exhaustive test run)
>>> >> >>>
>>> >> >>> - Wes
>>> >> >>>
>>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmckinn@gmail.com
>>> >
>>> >> >>> wrote:
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> gives me hope that the projects could be managed separately some
>>> day.
>>> >> >>> >
>>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>>> codebase
>>> >> >>> > features several areas of duplicated logic which could be
>>> replaced by
>>> >> >>> > components from the Arrow platform for better platform-wide
>>> >> >>> > interoperability:
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/OrcFile.hh#L37
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/MemoryPool.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>>> OutputStream.hh
>>> >> >>> >
>>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>>> from
>>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>>> >> >>> > available for static linking at the moment AFAIK).
>>> >> >>> >
>>> >> >>> > I question whether it's worth the community's time long term to
>>> wear
>>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> >> >>> > library to plug components together rather than utilizing common
>>> >> >>> > platform APIs.
>>> >> >>> >
>>> >> >>> > - Wes
>>> >> >>> >
>>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>>> >> joshuastorck@gmail.com>
>>> >> >>> wrote:
>>> >> >>> >> You're point about the constraints of the ASF release process are
>>> >> well
>>> >> >>> >> taken and as a developer who's trying to work in the current
>>> >> >>> environment I
>>> >> >>> >> would be much happier if the codebases were merged. The main
>>> issues
>>> >> I
>>> >> >>> worry
>>> >> >>> >> about when you put codebases like these together are:
>>> >> >>> >>
>>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>>> too
>>> >> >>> coupled
>>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> >> >>> delayed
>>> >> >>> >> by artifacts higher in the dependency tree
>>> >> >>> >>
>>> >> >>> >> If the project/release management is structured well and someone
>>> >> keeps
>>> >> >>> an
>>> >> >>> >> eye on the coupling, then I don't have any concerns.
>>> >> >>> >>
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how
>>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> >> gives me hope that the projects could be managed separately some
>>> >> day.
>>> >> >>> >>
>>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>>> wesmckinn@gmail.com>
>>> >> >>> wrote:
>>> >> >>> >>
>>> >> >>> >>> hi Josh,
>>> >> >>> >>>
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them together seems like the wrong choice.
>>> >> >>> >>>
>>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>>> >> >>> >>> building these projects -- my argument (which I think you agree
>>> >> with?)
>>> >> >>> >>> is that we should work more closely together until the community
>>> >> grows
>>> >> >>> >>> large enough to support larger-scope process than we have now.
>>> As
>>> >> >>> >>> you've seen, our process isn't serving developers of these
>>> >> projects.
>>> >> >>> >>>
>>> >> >>> >>> > I also think build tooling should be pulled into its own
>>> >> codebase.
>>> >> >>> >>>
>>> >> >>> >>> I don't see how this can possibly be practical taking into
>>> >> >>> >>> consideration the constraints imposed by the combination of the
>>> >> GitHub
>>> >> >>> >>> platform and the ASF release process. I'm all for being
>>> idealistic,
>>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>>> day
>>> >> >>> >>> which may touch both code and build system simultaneously
>>> without
>>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>>> we
>>> >> can
>>> >> >>> >>> move forward.
>>> >> >>> >>>
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short term with the express purpose of separating them in the
>>> near
>>> >> >>> term.
>>> >> >>> >>>
>>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>>> >> >>> >>> practical and result in net improvements in productivity and
>>> >> community
>>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>>> >> current
>>> >> >>> >>> separation is impractical, and is causing problems.
>>> >> >>> >>>
>>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >> >>> >>> development process and ASF releases separately. My argument is
>>> as
>>> >> >>> >>> follows:
>>> >> >>> >>>
>>> >> >>> >>> * Monorepo for development (for practicality)
>>> >> >>> >>> * Releases structured according to the desires of the PMCs
>>> >> >>> >>>
>>> >> >>> >>> - Wes
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>>> >> joshuastorck@gmail.com
>>> >> >>> >
>>> >> >>> >>> wrote:
>>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>>> >> >>> parquet-cpp
>>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>>> >> (ARROW-2585,
>>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>>> >> hard to
>>> >> >>> work
>>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>>> >> (created on
>>> >> >>> May
>>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>>> >> recently
>>> >> >>> >>> merged.
>>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>>> >> change in
>>> >> >>> >>> arrow
>>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>>> >> >>> >>> run_clang_format.py
>>> >> >>> >>> > script in the arrow project only to find out later that there
>>> >> was an
>>> >> >>> >>> exact
>>> >> >>> >>> > copy of it in parquet-cpp.
>>> >> >>> >>> >
>>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>>> the
>>> >> long
>>> >> >>> >>> term.
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them
>>> >> >>> >>> > together seems like the wrong choice. There will be other
>>> formats
>>> >> >>> that
>>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>>> Orc),
>>> >> so I
>>> >> >>> >>> don't
>>> >> >>> >>> > see why parquet should be special. I also think build tooling
>>> >> should
>>> >> >>> be
>>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>>> >> >>> developing
>>> >> >>> >>> open
>>> >> >>> >>> > source C/C++ projects that way and made projects like
>>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>>> >> good
>>> >> >>> >>> > counter-example since there have been lots of successful open
>>> >> source
>>> >> >>> >>> > projects that have used nightly build systems that pinned
>>> >> versions of
>>> >> >>> >>> > dependent software.
>>> >> >>> >>> >
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short
>>> >> >>> >>> > term with the express purpose of separating them in the near
>>> >> term.
>>> >> >>> My
>>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>>> you
>>> >> can
>>> >> >>> more
>>> >> >>> >>> > easily delineate the boundaries between the API's with a
>>> single
>>> >> PR.
>>> >> >>> >>> Second,
>>> >> >>> >>> > it will force the build tooling to converge instead of
>>> diverge,
>>> >> >>> which has
>>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>>> >> sorted
>>> >> >>> out,
>>> >> >>> >>> it
>>> >> >>> >>> > should be easy to separate them back into their own codebases.
>>> >> >>> >>> >
>>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>>> codebases
>>> >> for
>>> >> >>> arrow
>>> >> >>> >>> > be separated from other languages. Looking at it from the
>>> >> >>> perspective of
>>> >> >>> >>> a
>>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>>> large
>>> >> tax
>>> >> >>> to
>>> >> >>> >>> pay
>>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>>> >> 0.10.0
>>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>>> >> hope
>>> >> >>> that
>>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>>> >> reduce
>>> >> >>> the
>>> >> >>> >>> > complexity of the build/release tooling.
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>>> >> ted.dunning@gmail.com>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >
>>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>>> >> wesmckinn@gmail.com>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >>
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > > The community will be less willing to accept large
>>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>>> >> stability
>>> >> >>> and
>>> >> >>> >>> API
>>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> >> >>> community
>>> >> >>> >>> took
>>> >> >>> >>> >> a
>>> >> >>> >>> >> > > significantly long time for the very same reason.
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > Please don't use bad experiences from another open source
>>> >> >>> community as
>>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>>> go
>>> >> the
>>> >> >>> way
>>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>>> community
>>> >> which
>>> >> >>> >>> >> > happens to operate under a similar open governance model.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> There are some more radical and community building options as
>>> >> well.
>>> >> >>> Take
>>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>>> >> Apache
>>> >> >>> >>> >> committer can request and receive a commit bit on some large
>>> >> >>> fraction of
>>> >> >>> >>> >> subversion.
>>> >> >>> >>> >>
>>> >> >>> >>> >> So why not take this a bit further and give every parquet
>>> >> committer
>>> >> >>> a
>>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>>> >> committers in
>>> >> >>> >>> Arrow?
>>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>>> >> asks
>>> >> >>> will
>>> >> >>> >>> be
>>> >> >>> >>> >> given committer status in Arrow.
>>> >> >>> >>> >>
>>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>>> >> committers
>>> >> >>> >>> can't be
>>> >> >>> >>> >> worried at that point whether their patches will get merged;
>>> >> they
>>> >> >>> can
>>> >> >>> >>> just
>>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> >> >>> Parquet
>>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>>> parquet so
>>> >> >>> why not
>>> >> >>> >>> >> invite them in?
>>> >> >>> >>> >>
>>> >> >>> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> regards,
>>> >> >> Deepak Majeti
>>> >>
>>> >
>>> >
>>> > --
>>> > regards,
>>> > Deepak Majeti
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Do other people have opinions? I would like to undertake this work in
the near future (the next 8-10 weeks); I would be OK with taking
responsibility for the primary codebase surgery.

Some logistical questions:

* We have a handful of pull requests in flight in parquet-cpp that
would need to be resolved / merged
* We should probably cut a status-quo cpp-1.5.0 release, with future
releases cut out of the new structure
* Management of shared commit rights (I can discuss with the Arrow
PMC; I believe that approving any committer who has actively
maintained parquet-cpp should be a reasonable approach per Ted's
comments)

If working more closely together proves to not be working out after
some period of time, I will be fully supportive of a fork or something
like it

Thanks,
Wes

On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <we...@gmail.com> wrote:
> Thanks Tim.
>
> Indeed, it's not very simple. Just today Antoine cleaned up some
> platform code intending to improve the performance of bit-packing in
> Parquet writes, and we resulted with 2 interdependent PRs
>
> * https://github.com/apache/parquet-cpp/pull/483
> * https://github.com/apache/arrow/pull/2355
>
> Changes that impact the Python interface to Parquet are even more complex.
>
> Adding options to Arrow's CMake build system to only build
> Parquet-related code and dependencies (in a monorepo framework) would
> not be difficult, and amount to writing "make parquet".
>
> See e.g. https://stackoverflow.com/a/17201375. The desired commands to
> build and install the Parquet core libraries and their dependencies
> would be:
>
> ninja parquet && ninja install
>
> - Wes
>
> On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
> <ta...@cloudera.com.invalid> wrote:
>> I don't have a direct stake in this beyond wanting to see Parquet be
>> successful, but I thought I'd give my two cents.
>>
>> For me, the thing that makes the biggest difference in contributing to a
>> new codebase is the number of steps in the workflow for writing, testing,
>> posting and iterating on a commit and also the number of opportunities for
>> missteps. The size of the repo and build/test times matter but are
>> secondary so long as the workflow is simple and reliable.
>>
>> I don't really know what the current state of things is, but it sounds like
>> it's not as simple as check out -> build -> test if you're doing a
>> cross-repo change. Circular dependencies are a real headache.
>>
>> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi,
>>>
>>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com>
>>> wrote:
>>> > I think the circular dependency can be broken if we build a new library
>>> for
>>> > the platform code. This will also make it easy for other projects such as
>>> > ORC to use it.
>>> > I also remember your proposal a while ago of having a separate project
>>> for
>>> > the platform code.  That project can live in the arrow repo. However, one
>>> > has to clone the entire apache arrow repo but can just build the platform
>>> > code. This will be temporary until we can find a new home for it.
>>> >
>>> > The dependency will look like:
>>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>>> > libplatform(platform api)
>>> >
>>> > CI workflow will clone the arrow project twice, once for the platform
>>> > library and once for the arrow-core/bindings library.
>>>
>>> This seems like an interesting proposal; the best place to work toward
>>> this goal (if it is even possible; the build system interactions and
>>> ASF release management are the hard problems) is to have all of the
>>> code in a single repository. ORC could already be using Arrow if it
>>> wanted, but the ORC contributors aren't active in Arrow.
>>>
>>> >
>>> > There is no doubt that the collaborations between the Arrow and Parquet
>>> > communities so far have been very successful.
>>> > The reason to maintain this relationship moving forward is to continue to
>>> > reap the mutual benefits.
>>> > We should continue to take advantage of sharing code as well. However, I
>>> > don't see any code sharing opportunities between arrow-core and the
>>> > parquet-core. Both have different functions.
>>>
>>> I think you mean the Arrow columnar format. The Arrow columnar format
>>> is only one part of a project that has become quite large already
>>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>>> platform-for-inmemory-data-105427919).
>>>
>>> >
>>> > We are at a point where the parquet-cpp public API is pretty stable. We
>>> > already passed that difficult stage. My take at arrow and parquet is to
>>> > keep them nimble since we can.
>>>
>>> I believe that parquet-core has progress to make yet ahead of it. We
>>> have done little work in asynchronous IO and concurrency which would
>>> yield both improved read and write throughput. This aligns well with
>>> other concurrency and async-IO work planned in the Arrow platform. I
>>> believe that more development will happen on parquet-core once the
>>> development process issues are resolved by having a single codebase,
>>> single build system, and a single CI framework.
>>>
>>> I have some gripes about design decisions made early in parquet-cpp,
>>> like the use of C++ exceptions. So while "stability" is a reasonable
>>> goal I think we should still be open to making significant changes in
>>> the interest of long term progress.
>>>
>>> Having now worked on these projects for more than 2 and a half years
>>> and the most frequent contributor to both codebases, I'm sadly far
>>> past the "breaking point" and not willing to continue contributing in
>>> a significant way to parquet-cpp if the projects remained structured
>>> as they are now. It's hampering progress and not serving the
>>> community.
>>>
>>> - Wes
>>>
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> >> > The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo. That will remove a majority of the dependency issues. Joshua's
>>> work
>>> >> would not have been blocked in parquet-cpp if that adapter was in the
>>> arrow
>>> >> repo.  This will be similar to the ORC adaptor.
>>> >>
>>> >> This has been suggested before, but I don't see how it would alleviate
>>> >> any issues because of the significant dependencies on other parts of
>>> >> the Arrow codebase. What you are proposing is:
>>> >>
>>> >> - (Arrow) arrow platform
>>> >> - (Parquet) parquet core
>>> >> - (Arrow) arrow columnar-parquet adapter interface
>>> >> - (Arrow) Python bindings
>>> >>
>>> >> To make this work, somehow Arrow core / libarrow would have to be
>>> >> built before invoking the Parquet core part of the build system. You
>>> >> would need to pass dependent targets across different CMake build
>>> >> systems; I don't know if it's possible (I spent some time looking into
>>> >> it earlier this year). This is what I meant by the lack of a "concrete
>>> >> and actionable plan". The only thing that would really work would be
>>> >> for the Parquet core to be "included" in the Arrow build system
>>> >> somehow rather than using ExternalProject. Currently Parquet builds
>>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>>> >> system because it's only depended upon by the Python bindings.
>>> >>
>>> >> And even if a solution could be devised, it would not wholly resolve
>>> >> the CI workflow issues.
>>> >>
>>> >> You could make Parquet completely independent of the Arrow codebase,
>>> >> but at that point there is little reason to maintain a relationship
>>> >> between the projects or their communities. We have spent a great deal
>>> >> of effort refactoring the two projects to enable as much code sharing
>>> >> as there is now.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>>> >> parquet-cpp repos is no way a better approach.
>>> >> >
>>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>>> >> > fork. That would obviously be a bad outcome for the community.
>>> >> >
>>> >> > It doesn't look like I will be able to convince you that a monorepo is
>>> >> > a good idea; what I would ask instead is that you be willing to give
>>> >> > it a shot, and if it turns out in the way you're describing (which I
>>> >> > don't think it will) then I suggest that we fork at that point.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>>> majeti.deepak@gmail.com>
>>> >> wrote:
>>> >> >> Wes,
>>> >> >>
>>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>>> of a
>>> >> >> non-existent Arrow-Parquet mono-repo.
>>> >> >> Bringing in related Apache community experiences are more meaningful
>>> >> than
>>> >> >> how mono-repos work at Google and other big organizations.
>>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>>> >> >> You are very well aware of how difficult it has been to find more
>>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>>> >> >> contribution rate to its core components.
>>> >> >>
>>> >> >> We should target to ensure that new volunteers who want to contribute
>>> >> >> bug-fixes/features should spend the least amount of time in figuring
>>> out
>>> >> >> the project repo. We can never come up with an automated build system
>>> >> that
>>> >> >> caters to every possible environment.
>>> >> >> My only concern is if the mono-repo will make it harder for new
>>> >> developers
>>> >> >> to work on parquet-cpp core just due to the additional code, build
>>> and
>>> >> test
>>> >> >> dependencies.
>>> >> >> I am not saying that the Arrow community/committers will be less
>>> >> >> co-operative.
>>> >> >> I just don't think the mono-repo structure model will be sustainable
>>> in
>>> >> an
>>> >> >> open source community unless there are long-term vested interests. We
>>> >> can't
>>> >> >> predict that.
>>> >> >>
>>> >> >> The current circular dependency problems between Arrow and Parquet
>>> is a
>>> >> >> major problem for the community and it is important.
>>> >> >>
>>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>>> >> repo.
>>> >> >> That will remove a majority of the dependency issues.
>>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>>> adapter
>>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>>> >> >>
>>> >> >> The platform API code is pretty stable at this point. Minor changes
>>> in
>>> >> the
>>> >> >> future to this code should not be the main reason to combine the
>>> arrow
>>> >> >> parquet repos.
>>> >> >>
>>> >> >> "
>>> >> >> *I question whether it's worth the community's time long term to
>>> wear*
>>> >> >>
>>> >> >>
>>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>>> >> eachlibrary
>>> >> >> to plug components together rather than utilizing commonplatform
>>> APIs.*"
>>> >> >>
>>> >> >> My answer to your question below would be "Yes".
>>> Modularity/separation
>>> >> is
>>> >> >> very important in an open source community where priorities of
>>> >> contributors
>>> >> >> are often short term.
>>> >> >> The retention is low and therefore the acquisition costs should be
>>> low
>>> >> as
>>> >> >> well. This is the community over code approach according to me. Minor
>>> >> code
>>> >> >> duplication is not a deal breaker.
>>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>>> data
>>> >> >> space serving their own functions.
>>> >> >>
>>> >> >> If you still strongly feel that the only way forward is to clone the
>>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>>> two
>>> >> >> parquet-cpp repos is no way a better approach.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>>> >> wrote:
>>> >> >>
>>> >> >>> @Antoine
>>> >> >>>
>>> >> >>> > By the way, one concern with the monorepo approach: it would
>>> slightly
>>> >> >>> increase Arrow CI times (which are already too large).
>>> >> >>>
>>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>>> >> >>>
>>> >> >>> Parquet run takes about 28
>>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> >> >>>
>>> >> >>> Inevitably we will need to create some kind of bot to run certain
>>> >> >>> builds on-demand based on commit / PR metadata or on request.
>>> >> >>>
>>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> >> >>> made substantially shorter by moving some of the slower parts (like
>>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> >> >>> improve build times (valgrind build could be moved to a nightly
>>> >> >>> exhaustive test run)
>>> >> >>>
>>> >> >>> - Wes
>>> >> >>>
>>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmckinn@gmail.com
>>> >
>>> >> >>> wrote:
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> gives me hope that the projects could be managed separately some
>>> day.
>>> >> >>> >
>>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>>> codebase
>>> >> >>> > features several areas of duplicated logic which could be
>>> replaced by
>>> >> >>> > components from the Arrow platform for better platform-wide
>>> >> >>> > interoperability:
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/OrcFile.hh#L37
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >> >>> >
>>> >> >>>
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>>> orc/MemoryPool.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> >> >>> >
>>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>>> OutputStream.hh
>>> >> >>> >
>>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>>> from
>>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>>> >> >>> > available for static linking at the moment AFAIK).
>>> >> >>> >
>>> >> >>> > I question whether it's worth the community's time long term to
>>> wear
>>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> >> >>> > library to plug components together rather than utilizing common
>>> >> >>> > platform APIs.
>>> >> >>> >
>>> >> >>> > - Wes
>>> >> >>> >
>>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>>> >> joshuastorck@gmail.com>
>>> >> >>> wrote:
>>> >> >>> >> You're point about the constraints of the ASF release process are
>>> >> well
>>> >> >>> >> taken and as a developer who's trying to work in the current
>>> >> >>> environment I
>>> >> >>> >> would be much happier if the codebases were merged. The main
>>> issues
>>> >> I
>>> >> >>> worry
>>> >> >>> >> about when you put codebases like these together are:
>>> >> >>> >>
>>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>>> too
>>> >> >>> coupled
>>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> >> >>> delayed
>>> >> >>> >> by artifacts higher in the dependency tree
>>> >> >>> >>
>>> >> >>> >> If the project/release management is structured well and someone
>>> >> keeps
>>> >> >>> an
>>> >> >>> >> eye on the coupling, then I don't have any concerns.
>>> >> >>> >>
>>> >> >>> >> I would like to point out that arrow's use of orc is a great
>>> >> example of
>>> >> >>> how
>>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>>> codebase.
>>> >> That
>>> >> >>> >> gives me hope that the projects could be managed separately some
>>> >> day.
>>> >> >>> >>
>>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>>> wesmckinn@gmail.com>
>>> >> >>> wrote:
>>> >> >>> >>
>>> >> >>> >>> hi Josh,
>>> >> >>> >>>
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them together seems like the wrong choice.
>>> >> >>> >>>
>>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>>> >> >>> >>> building these projects -- my argument (which I think you agree
>>> >> with?)
>>> >> >>> >>> is that we should work more closely together until the community
>>> >> grows
>>> >> >>> >>> large enough to support larger-scope process than we have now.
>>> As
>>> >> >>> >>> you've seen, our process isn't serving developers of these
>>> >> projects.
>>> >> >>> >>>
>>> >> >>> >>> > I also think build tooling should be pulled into its own
>>> >> codebase.
>>> >> >>> >>>
>>> >> >>> >>> I don't see how this can possibly be practical taking into
>>> >> >>> >>> consideration the constraints imposed by the combination of the
>>> >> GitHub
>>> >> >>> >>> platform and the ASF release process. I'm all for being
>>> idealistic,
>>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>>> day
>>> >> >>> >>> which may touch both code and build system simultaneously
>>> without
>>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>>> we
>>> >> can
>>> >> >>> >>> move forward.
>>> >> >>> >>>
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short term with the express purpose of separating them in the
>>> near
>>> >> >>> term.
>>> >> >>> >>>
>>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>>> >> >>> >>> practical and result in net improvements in productivity and
>>> >> community
>>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>>> >> current
>>> >> >>> >>> separation is impractical, and is causing problems.
>>> >> >>> >>>
>>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >> >>> >>> development process and ASF releases separately. My argument is
>>> as
>>> >> >>> >>> follows:
>>> >> >>> >>>
>>> >> >>> >>> * Monorepo for development (for practicality)
>>> >> >>> >>> * Releases structured according to the desires of the PMCs
>>> >> >>> >>>
>>> >> >>> >>> - Wes
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>>> >> joshuastorck@gmail.com
>>> >> >>> >
>>> >> >>> >>> wrote:
>>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>>> >> >>> parquet-cpp
>>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>>> >> (ARROW-2585,
>>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>>> >> hard to
>>> >> >>> work
>>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>>> >> (created on
>>> >> >>> May
>>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>>> >> recently
>>> >> >>> >>> merged.
>>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>>> >> change in
>>> >> >>> >>> arrow
>>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>>> >> >>> >>> run_clang_format.py
>>> >> >>> >>> > script in the arrow project only to find out later that there
>>> >> was an
>>> >> >>> >>> exact
>>> >> >>> >>> > copy of it in parquet-cpp.
>>> >> >>> >>> >
>>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>>> the
>>> >> long
>>> >> >>> >>> term.
>>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>>> and
>>> >> >>> tying
>>> >> >>> >>> them
>>> >> >>> >>> > together seems like the wrong choice. There will be other
>>> formats
>>> >> >>> that
>>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>>> Orc),
>>> >> so I
>>> >> >>> >>> don't
>>> >> >>> >>> > see why parquet should be special. I also think build tooling
>>> >> should
>>> >> >>> be
>>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>>> >> >>> developing
>>> >> >>> >>> open
>>> >> >>> >>> > source C/C++ projects that way and made projects like
>>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>>> >> good
>>> >> >>> >>> > counter-example since there have been lots of successful open
>>> >> source
>>> >> >>> >>> > projects that have used nightly build systems that pinned
>>> >> versions of
>>> >> >>> >>> > dependent software.
>>> >> >>> >>> >
>>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>>> >> in the
>>> >> >>> >>> short
>>> >> >>> >>> > term with the express purpose of separating them in the near
>>> >> term.
>>> >> >>> My
>>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>>> you
>>> >> can
>>> >> >>> more
>>> >> >>> >>> > easily delineate the boundaries between the API's with a
>>> single
>>> >> PR.
>>> >> >>> >>> Second,
>>> >> >>> >>> > it will force the build tooling to converge instead of
>>> diverge,
>>> >> >>> which has
>>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>>> >> sorted
>>> >> >>> out,
>>> >> >>> >>> it
>>> >> >>> >>> > should be easy to separate them back into their own codebases.
>>> >> >>> >>> >
>>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>>> codebases
>>> >> for
>>> >> >>> arrow
>>> >> >>> >>> > be separated from other languages. Looking at it from the
>>> >> >>> perspective of
>>> >> >>> >>> a
>>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>>> large
>>> >> tax
>>> >> >>> to
>>> >> >>> >>> pay
>>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>>> >> 0.10.0
>>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>>> >> hope
>>> >> >>> that
>>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>>> >> reduce
>>> >> >>> the
>>> >> >>> >>> > complexity of the build/release tooling.
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>>> >> ted.dunning@gmail.com>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >
>>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>>> >> wesmckinn@gmail.com>
>>> >> >>> >>> wrote:
>>> >> >>> >>> >>
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > > The community will be less willing to accept large
>>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>>> >> stability
>>> >> >>> and
>>> >> >>> >>> API
>>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> >> >>> community
>>> >> >>> >>> took
>>> >> >>> >>> >> a
>>> >> >>> >>> >> > > significantly long time for the very same reason.
>>> >> >>> >>> >> >
>>> >> >>> >>> >> > Please don't use bad experiences from another open source
>>> >> >>> community as
>>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>>> go
>>> >> the
>>> >> >>> way
>>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>>> community
>>> >> which
>>> >> >>> >>> >> > happens to operate under a similar open governance model.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> There are some more radical and community building options as
>>> >> well.
>>> >> >>> Take
>>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>>> >> Apache
>>> >> >>> >>> >> committer can request and receive a commit bit on some large
>>> >> >>> fraction of
>>> >> >>> >>> >> subversion.
>>> >> >>> >>> >>
>>> >> >>> >>> >> So why not take this a bit further and give every parquet
>>> >> committer
>>> >> >>> a
>>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>>> >> committers in
>>> >> >>> >>> Arrow?
>>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>>> >> asks
>>> >> >>> will
>>> >> >>> >>> be
>>> >> >>> >>> >> given committer status in Arrow.
>>> >> >>> >>> >>
>>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>>> >> committers
>>> >> >>> >>> can't be
>>> >> >>> >>> >> worried at that point whether their patches will get merged;
>>> >> they
>>> >> >>> can
>>> >> >>> >>> just
>>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> >> >>> Parquet
>>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>>> parquet so
>>> >> >>> why not
>>> >> >>> >>> >> invite them in?
>>> >> >>> >>> >>
>>> >> >>> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> regards,
>>> >> >> Deepak Majeti
>>> >>
>>> >
>>> >
>>> > --
>>> > regards,
>>> > Deepak Majeti
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Thanks Tim.

Indeed, it's not very simple. Just today Antoine cleaned up some
platform code intending to improve the performance of bit-packing in
Parquet writes, and we resulted with 2 interdependent PRs

* https://github.com/apache/parquet-cpp/pull/483
* https://github.com/apache/arrow/pull/2355

Changes that impact the Python interface to Parquet are even more complex.

Adding options to Arrow's CMake build system to only build
Parquet-related code and dependencies (in a monorepo framework) would
not be difficult, and amount to writing "make parquet".

See e.g. https://stackoverflow.com/a/17201375. The desired commands to
build and install the Parquet core libraries and their dependencies
would be:

ninja parquet && ninja install

- Wes

On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
<ta...@cloudera.com.invalid> wrote:
> I don't have a direct stake in this beyond wanting to see Parquet be
> successful, but I thought I'd give my two cents.
>
> For me, the thing that makes the biggest difference in contributing to a
> new codebase is the number of steps in the workflow for writing, testing,
> posting and iterating on a commit and also the number of opportunities for
> missteps. The size of the repo and build/test times matter but are
> secondary so long as the workflow is simple and reliable.
>
> I don't really know what the current state of things is, but it sounds like
> it's not as simple as check out -> build -> test if you're doing a
> cross-repo change. Circular dependencies are a real headache.
>
> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi,
>>
>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > I think the circular dependency can be broken if we build a new library
>> for
>> > the platform code. This will also make it easy for other projects such as
>> > ORC to use it.
>> > I also remember your proposal a while ago of having a separate project
>> for
>> > the platform code.  That project can live in the arrow repo. However, one
>> > has to clone the entire apache arrow repo but can just build the platform
>> > code. This will be temporary until we can find a new home for it.
>> >
>> > The dependency will look like:
>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> > libplatform(platform api)
>> >
>> > CI workflow will clone the arrow project twice, once for the platform
>> > library and once for the arrow-core/bindings library.
>>
>> This seems like an interesting proposal; the best place to work toward
>> this goal (if it is even possible; the build system interactions and
>> ASF release management are the hard problems) is to have all of the
>> code in a single repository. ORC could already be using Arrow if it
>> wanted, but the ORC contributors aren't active in Arrow.
>>
>> >
>> > There is no doubt that the collaborations between the Arrow and Parquet
>> > communities so far have been very successful.
>> > The reason to maintain this relationship moving forward is to continue to
>> > reap the mutual benefits.
>> > We should continue to take advantage of sharing code as well. However, I
>> > don't see any code sharing opportunities between arrow-core and the
>> > parquet-core. Both have different functions.
>>
>> I think you mean the Arrow columnar format. The Arrow columnar format
>> is only one part of a project that has become quite large already
>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> platform-for-inmemory-data-105427919).
>>
>> >
>> > We are at a point where the parquet-cpp public API is pretty stable. We
>> > already passed that difficult stage. My take at arrow and parquet is to
>> > keep them nimble since we can.
>>
>> I believe that parquet-core has progress to make yet ahead of it. We
>> have done little work in asynchronous IO and concurrency which would
>> yield both improved read and write throughput. This aligns well with
>> other concurrency and async-IO work planned in the Arrow platform. I
>> believe that more development will happen on parquet-core once the
>> development process issues are resolved by having a single codebase,
>> single build system, and a single CI framework.
>>
>> I have some gripes about design decisions made early in parquet-cpp,
>> like the use of C++ exceptions. So while "stability" is a reasonable
>> goal I think we should still be open to making significant changes in
>> the interest of long term progress.
>>
>> Having now worked on these projects for more than 2 and a half years
>> and the most frequent contributor to both codebases, I'm sadly far
>> past the "breaking point" and not willing to continue contributing in
>> a significant way to parquet-cpp if the projects remained structured
>> as they are now. It's hampering progress and not serving the
>> community.
>>
>> - Wes
>>
>> >
>> >
>> >
>> >
>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> > The current Arrow adaptor code for parquet should live in the arrow
>> >> repo. That will remove a majority of the dependency issues. Joshua's
>> work
>> >> would not have been blocked in parquet-cpp if that adapter was in the
>> arrow
>> >> repo.  This will be similar to the ORC adaptor.
>> >>
>> >> This has been suggested before, but I don't see how it would alleviate
>> >> any issues because of the significant dependencies on other parts of
>> >> the Arrow codebase. What you are proposing is:
>> >>
>> >> - (Arrow) arrow platform
>> >> - (Parquet) parquet core
>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> - (Arrow) Python bindings
>> >>
>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> built before invoking the Parquet core part of the build system. You
>> >> would need to pass dependent targets across different CMake build
>> >> systems; I don't know if it's possible (I spent some time looking into
>> >> it earlier this year). This is what I meant by the lack of a "concrete
>> >> and actionable plan". The only thing that would really work would be
>> >> for the Parquet core to be "included" in the Arrow build system
>> >> somehow rather than using ExternalProject. Currently Parquet builds
>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> >> system because it's only depended upon by the Python bindings.
>> >>
>> >> And even if a solution could be devised, it would not wholly resolve
>> >> the CI workflow issues.
>> >>
>> >> You could make Parquet completely independent of the Arrow codebase,
>> >> but at that point there is little reason to maintain a relationship
>> >> between the projects or their communities. We have spent a great deal
>> >> of effort refactoring the two projects to enable as much code sharing
>> >> as there is now.
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> >> parquet-cpp repos is no way a better approach.
>> >> >
>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >
>> >> > It doesn't look like I will be able to convince you that a monorepo is
>> >> > a good idea; what I would ask instead is that you be willing to give
>> >> > it a shot, and if it turns out in the way you're describing (which I
>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >> wrote:
>> >> >> Wes,
>> >> >>
>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>> of a
>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >> Bringing in related Apache community experiences are more meaningful
>> >> than
>> >> >> how mono-repos work at Google and other big organizations.
>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>> >> >> You are very well aware of how difficult it has been to find more
>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>> >> >> contribution rate to its core components.
>> >> >>
>> >> >> We should target to ensure that new volunteers who want to contribute
>> >> >> bug-fixes/features should spend the least amount of time in figuring
>> out
>> >> >> the project repo. We can never come up with an automated build system
>> >> that
>> >> >> caters to every possible environment.
>> >> >> My only concern is if the mono-repo will make it harder for new
>> >> developers
>> >> >> to work on parquet-cpp core just due to the additional code, build
>> and
>> >> test
>> >> >> dependencies.
>> >> >> I am not saying that the Arrow community/committers will be less
>> >> >> co-operative.
>> >> >> I just don't think the mono-repo structure model will be sustainable
>> in
>> >> an
>> >> >> open source community unless there are long-term vested interests. We
>> >> can't
>> >> >> predict that.
>> >> >>
>> >> >> The current circular dependency problems between Arrow and Parquet
>> is a
>> >> >> major problem for the community and it is important.
>> >> >>
>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>> >> repo.
>> >> >> That will remove a majority of the dependency issues.
>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>> adapter
>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >> >>
>> >> >> The platform API code is pretty stable at this point. Minor changes
>> in
>> >> the
>> >> >> future to this code should not be the main reason to combine the
>> arrow
>> >> >> parquet repos.
>> >> >>
>> >> >> "
>> >> >> *I question whether it's worth the community's time long term to
>> wear*
>> >> >>
>> >> >>
>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> eachlibrary
>> >> >> to plug components together rather than utilizing commonplatform
>> APIs.*"
>> >> >>
>> >> >> My answer to your question below would be "Yes".
>> Modularity/separation
>> >> is
>> >> >> very important in an open source community where priorities of
>> >> contributors
>> >> >> are often short term.
>> >> >> The retention is low and therefore the acquisition costs should be
>> low
>> >> as
>> >> >> well. This is the community over code approach according to me. Minor
>> >> code
>> >> >> duplication is not a deal breaker.
>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>> data
>> >> >> space serving their own functions.
>> >> >>
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>> two
>> >> >> parquet-cpp repos is no way a better approach.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> @Antoine
>> >> >>>
>> >> >>> > By the way, one concern with the monorepo approach: it would
>> slightly
>> >> >>> increase Arrow CI times (which are already too large).
>> >> >>>
>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >>>
>> >> >>> Parquet run takes about 28
>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >>>
>> >> >>> Inevitably we will need to create some kind of bot to run certain
>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >>>
>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> >> >>> made substantially shorter by moving some of the slower parts (like
>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>> >> >>> improve build times (valgrind build could be moved to a nightly
>> >> >>> exhaustive test run)
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmckinn@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> gives me hope that the projects could be managed separately some
>> day.
>> >> >>> >
>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>> codebase
>> >> >>> > features several areas of duplicated logic which could be
>> replaced by
>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >>> > interoperability:
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/OrcFile.hh#L37
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/MemoryPool.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> OutputStream.hh
>> >> >>> >
>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>> from
>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>> >> >>> > available for static linking at the moment AFAIK).
>> >> >>> >
>> >> >>> > I question whether it's worth the community's time long term to
>> wear
>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>> >> >>> > library to plug components together rather than utilizing common
>> >> >>> > platform APIs.
>> >> >>> >
>> >> >>> > - Wes
>> >> >>> >
>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> joshuastorck@gmail.com>
>> >> >>> wrote:
>> >> >>> >> You're point about the constraints of the ASF release process are
>> >> well
>> >> >>> >> taken and as a developer who's trying to work in the current
>> >> >>> environment I
>> >> >>> >> would be much happier if the codebases were merged. The main
>> issues
>> >> I
>> >> >>> worry
>> >> >>> >> about when you put codebases like these together are:
>> >> >>> >>
>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>> too
>> >> >>> coupled
>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>> >> >>> delayed
>> >> >>> >> by artifacts higher in the dependency tree
>> >> >>> >>
>> >> >>> >> If the project/release management is structured well and someone
>> >> keeps
>> >> >>> an
>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >>> >>
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how
>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> >> gives me hope that the projects could be managed separately some
>> >> day.
>> >> >>> >>
>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >> >>> wrote:
>> >> >>> >>
>> >> >>> >>> hi Josh,
>> >> >>> >>>
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them together seems like the wrong choice.
>> >> >>> >>>
>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>> >> >>> >>> building these projects -- my argument (which I think you agree
>> >> with?)
>> >> >>> >>> is that we should work more closely together until the community
>> >> grows
>> >> >>> >>> large enough to support larger-scope process than we have now.
>> As
>> >> >>> >>> you've seen, our process isn't serving developers of these
>> >> projects.
>> >> >>> >>>
>> >> >>> >>> > I also think build tooling should be pulled into its own
>> >> codebase.
>> >> >>> >>>
>> >> >>> >>> I don't see how this can possibly be practical taking into
>> >> >>> >>> consideration the constraints imposed by the combination of the
>> >> GitHub
>> >> >>> >>> platform and the ASF release process. I'm all for being
>> idealistic,
>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>> day
>> >> >>> >>> which may touch both code and build system simultaneously
>> without
>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>> we
>> >> can
>> >> >>> >>> move forward.
>> >> >>> >>>
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short term with the express purpose of separating them in the
>> near
>> >> >>> term.
>> >> >>> >>>
>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>> >> >>> >>> practical and result in net improvements in productivity and
>> >> community
>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>> >> current
>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >>> >>>
>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >> >>> >>> development process and ASF releases separately. My argument is
>> as
>> >> >>> >>> follows:
>> >> >>> >>>
>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >>> >>> * Releases structured according to the desires of the PMCs
>> >> >>> >>>
>> >> >>> >>> - Wes
>> >> >>> >>>
>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> joshuastorck@gmail.com
>> >> >>> >
>> >> >>> >>> wrote:
>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>> >> >>> parquet-cpp
>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> (ARROW-2585,
>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>> >> hard to
>> >> >>> work
>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> >> (created on
>> >> >>> May
>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> >> recently
>> >> >>> >>> merged.
>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>> >> change in
>> >> >>> >>> arrow
>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >>> >>> run_clang_format.py
>> >> >>> >>> > script in the arrow project only to find out later that there
>> >> was an
>> >> >>> >>> exact
>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >>> >>> >
>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>> the
>> >> long
>> >> >>> >>> term.
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them
>> >> >>> >>> > together seems like the wrong choice. There will be other
>> formats
>> >> >>> that
>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>> Orc),
>> >> so I
>> >> >>> >>> don't
>> >> >>> >>> > see why parquet should be special. I also think build tooling
>> >> should
>> >> >>> be
>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>> >> >>> developing
>> >> >>> >>> open
>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>> >> good
>> >> >>> >>> > counter-example since there have been lots of successful open
>> >> source
>> >> >>> >>> > projects that have used nightly build systems that pinned
>> >> versions of
>> >> >>> >>> > dependent software.
>> >> >>> >>> >
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short
>> >> >>> >>> > term with the express purpose of separating them in the near
>> >> term.
>> >> >>> My
>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>> you
>> >> can
>> >> >>> more
>> >> >>> >>> > easily delineate the boundaries between the API's with a
>> single
>> >> PR.
>> >> >>> >>> Second,
>> >> >>> >>> > it will force the build tooling to converge instead of
>> diverge,
>> >> >>> which has
>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>> >> sorted
>> >> >>> out,
>> >> >>> >>> it
>> >> >>> >>> > should be easy to separate them back into their own codebases.
>> >> >>> >>> >
>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> codebases
>> >> for
>> >> >>> arrow
>> >> >>> >>> > be separated from other languages. Looking at it from the
>> >> >>> perspective of
>> >> >>> >>> a
>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>> large
>> >> tax
>> >> >>> to
>> >> >>> >>> pay
>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>> >> 0.10.0
>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>> >> hope
>> >> >>> that
>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>> >> reduce
>> >> >>> the
>> >> >>> >>> > complexity of the build/release tooling.
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> ted.dunning@gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>> >
>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>> >>
>> >> >>> >>> >> >
>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>> >> stability
>> >> >>> and
>> >> >>> >>> API
>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> >> >>> community
>> >> >>> >>> took
>> >> >>> >>> >> a
>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >>> >>> >> >
>> >> >>> >>> >> > Please don't use bad experiences from another open source
>> >> >>> community as
>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>> go
>> >> the
>> >> >>> way
>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> community
>> >> which
>> >> >>> >>> >> > happens to operate under a similar open governance model.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> There are some more radical and community building options as
>> >> well.
>> >> >>> Take
>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>> >> Apache
>> >> >>> >>> >> committer can request and receive a commit bit on some large
>> >> >>> fraction of
>> >> >>> >>> >> subversion.
>> >> >>> >>> >>
>> >> >>> >>> >> So why not take this a bit further and give every parquet
>> >> committer
>> >> >>> a
>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> committers in
>> >> >>> >>> Arrow?
>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>> >> asks
>> >> >>> will
>> >> >>> >>> be
>> >> >>> >>> >> given committer status in Arrow.
>> >> >>> >>> >>
>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> >> committers
>> >> >>> >>> can't be
>> >> >>> >>> >> worried at that point whether their patches will get merged;
>> >> they
>> >> >>> can
>> >> >>> >>> just
>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> >> >>> Parquet
>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> parquet so
>> >> >>> why not
>> >> >>> >>> >> invite them in?
>> >> >>> >>> >>
>> >> >>> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> regards,
>> >> >> Deepak Majeti
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

Thanks Tim.

Indeed, it's not very simple. Just today Antoine cleaned up some
platform code intending to improve the performance of bit-packing in
Parquet writes, and we resulted with 2 interdependent PRs

* https://github.com/apache/parquet-cpp/pull/483
* https://github.com/apache/arrow/pull/2355

Changes that impact the Python interface to Parquet are even more complex.

Adding options to Arrow's CMake build system to only build
Parquet-related code and dependencies (in a monorepo framework) would
not be difficult, and amount to writing "make parquet".

See e.g. https://stackoverflow.com/a/17201375. The desired commands to
build and install the Parquet core libraries and their dependencies
would be:

ninja parquet && ninja install

- Wes

On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
<ta...@cloudera.com.invalid> wrote:
> I don't have a direct stake in this beyond wanting to see Parquet be
> successful, but I thought I'd give my two cents.
>
> For me, the thing that makes the biggest difference in contributing to a
> new codebase is the number of steps in the workflow for writing, testing,
> posting and iterating on a commit and also the number of opportunities for
> missteps. The size of the repo and build/test times matter but are
> secondary so long as the workflow is simple and reliable.
>
> I don't really know what the current state of things is, but it sounds like
> it's not as simple as check out -> build -> test if you're doing a
> cross-repo change. Circular dependencies are a real headache.
>
> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi,
>>
>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > I think the circular dependency can be broken if we build a new library
>> for
>> > the platform code. This will also make it easy for other projects such as
>> > ORC to use it.
>> > I also remember your proposal a while ago of having a separate project
>> for
>> > the platform code.  That project can live in the arrow repo. However, one
>> > has to clone the entire apache arrow repo but can just build the platform
>> > code. This will be temporary until we can find a new home for it.
>> >
>> > The dependency will look like:
>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> > libplatform(platform api)
>> >
>> > CI workflow will clone the arrow project twice, once for the platform
>> > library and once for the arrow-core/bindings library.
>>
>> This seems like an interesting proposal; the best place to work toward
>> this goal (if it is even possible; the build system interactions and
>> ASF release management are the hard problems) is to have all of the
>> code in a single repository. ORC could already be using Arrow if it
>> wanted, but the ORC contributors aren't active in Arrow.
>>
>> >
>> > There is no doubt that the collaborations between the Arrow and Parquet
>> > communities so far have been very successful.
>> > The reason to maintain this relationship moving forward is to continue to
>> > reap the mutual benefits.
>> > We should continue to take advantage of sharing code as well. However, I
>> > don't see any code sharing opportunities between arrow-core and the
>> > parquet-core. Both have different functions.
>>
>> I think you mean the Arrow columnar format. The Arrow columnar format
>> is only one part of a project that has become quite large already
>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> platform-for-inmemory-data-105427919).
>>
>> >
>> > We are at a point where the parquet-cpp public API is pretty stable. We
>> > already passed that difficult stage. My take at arrow and parquet is to
>> > keep them nimble since we can.
>>
>> I believe that parquet-core has progress to make yet ahead of it. We
>> have done little work in asynchronous IO and concurrency which would
>> yield both improved read and write throughput. This aligns well with
>> other concurrency and async-IO work planned in the Arrow platform. I
>> believe that more development will happen on parquet-core once the
>> development process issues are resolved by having a single codebase,
>> single build system, and a single CI framework.
>>
>> I have some gripes about design decisions made early in parquet-cpp,
>> like the use of C++ exceptions. So while "stability" is a reasonable
>> goal I think we should still be open to making significant changes in
>> the interest of long term progress.
>>
>> Having now worked on these projects for more than 2 and a half years
>> and the most frequent contributor to both codebases, I'm sadly far
>> past the "breaking point" and not willing to continue contributing in
>> a significant way to parquet-cpp if the projects remained structured
>> as they are now. It's hampering progress and not serving the
>> community.
>>
>> - Wes
>>
>> >
>> >
>> >
>> >
>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> > The current Arrow adaptor code for parquet should live in the arrow
>> >> repo. That will remove a majority of the dependency issues. Joshua's
>> work
>> >> would not have been blocked in parquet-cpp if that adapter was in the
>> arrow
>> >> repo.  This will be similar to the ORC adaptor.
>> >>
>> >> This has been suggested before, but I don't see how it would alleviate
>> >> any issues because of the significant dependencies on other parts of
>> >> the Arrow codebase. What you are proposing is:
>> >>
>> >> - (Arrow) arrow platform
>> >> - (Parquet) parquet core
>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> - (Arrow) Python bindings
>> >>
>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> built before invoking the Parquet core part of the build system. You
>> >> would need to pass dependent targets across different CMake build
>> >> systems; I don't know if it's possible (I spent some time looking into
>> >> it earlier this year). This is what I meant by the lack of a "concrete
>> >> and actionable plan". The only thing that would really work would be
>> >> for the Parquet core to be "included" in the Arrow build system
>> >> somehow rather than using ExternalProject. Currently Parquet builds
>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> >> system because it's only depended upon by the Python bindings.
>> >>
>> >> And even if a solution could be devised, it would not wholly resolve
>> >> the CI workflow issues.
>> >>
>> >> You could make Parquet completely independent of the Arrow codebase,
>> >> but at that point there is little reason to maintain a relationship
>> >> between the projects or their communities. We have spent a great deal
>> >> of effort refactoring the two projects to enable as much code sharing
>> >> as there is now.
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> >> parquet-cpp repos is no way a better approach.
>> >> >
>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >
>> >> > It doesn't look like I will be able to convince you that a monorepo is
>> >> > a good idea; what I would ask instead is that you be willing to give
>> >> > it a shot, and if it turns out in the way you're describing (which I
>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >> wrote:
>> >> >> Wes,
>> >> >>
>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>> of a
>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >> Bringing in related Apache community experiences are more meaningful
>> >> than
>> >> >> how mono-repos work at Google and other big organizations.
>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>> >> >> You are very well aware of how difficult it has been to find more
>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>> >> >> contribution rate to its core components.
>> >> >>
>> >> >> We should target to ensure that new volunteers who want to contribute
>> >> >> bug-fixes/features should spend the least amount of time in figuring
>> out
>> >> >> the project repo. We can never come up with an automated build system
>> >> that
>> >> >> caters to every possible environment.
>> >> >> My only concern is if the mono-repo will make it harder for new
>> >> developers
>> >> >> to work on parquet-cpp core just due to the additional code, build
>> and
>> >> test
>> >> >> dependencies.
>> >> >> I am not saying that the Arrow community/committers will be less
>> >> >> co-operative.
>> >> >> I just don't think the mono-repo structure model will be sustainable
>> in
>> >> an
>> >> >> open source community unless there are long-term vested interests. We
>> >> can't
>> >> >> predict that.
>> >> >>
>> >> >> The current circular dependency problems between Arrow and Parquet
>> is a
>> >> >> major problem for the community and it is important.
>> >> >>
>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>> >> repo.
>> >> >> That will remove a majority of the dependency issues.
>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>> adapter
>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >> >>
>> >> >> The platform API code is pretty stable at this point. Minor changes
>> in
>> >> the
>> >> >> future to this code should not be the main reason to combine the
>> arrow
>> >> >> parquet repos.
>> >> >>
>> >> >> "
>> >> >> *I question whether it's worth the community's time long term to
>> wear*
>> >> >>
>> >> >>
>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> eachlibrary
>> >> >> to plug components together rather than utilizing commonplatform
>> APIs.*"
>> >> >>
>> >> >> My answer to your question below would be "Yes".
>> Modularity/separation
>> >> is
>> >> >> very important in an open source community where priorities of
>> >> contributors
>> >> >> are often short term.
>> >> >> The retention is low and therefore the acquisition costs should be
>> low
>> >> as
>> >> >> well. This is the community over code approach according to me. Minor
>> >> code
>> >> >> duplication is not a deal breaker.
>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>> data
>> >> >> space serving their own functions.
>> >> >>
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>> two
>> >> >> parquet-cpp repos is no way a better approach.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> @Antoine
>> >> >>>
>> >> >>> > By the way, one concern with the monorepo approach: it would
>> slightly
>> >> >>> increase Arrow CI times (which are already too large).
>> >> >>>
>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >>>
>> >> >>> Parquet run takes about 28
>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >>>
>> >> >>> Inevitably we will need to create some kind of bot to run certain
>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >>>
>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> >> >>> made substantially shorter by moving some of the slower parts (like
>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>> >> >>> improve build times (valgrind build could be moved to a nightly
>> >> >>> exhaustive test run)
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmckinn@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> gives me hope that the projects could be managed separately some
>> day.
>> >> >>> >
>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>> codebase
>> >> >>> > features several areas of duplicated logic which could be
>> replaced by
>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >>> > interoperability:
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/OrcFile.hh#L37
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/MemoryPool.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> OutputStream.hh
>> >> >>> >
>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>> from
>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>> >> >>> > available for static linking at the moment AFAIK).
>> >> >>> >
>> >> >>> > I question whether it's worth the community's time long term to
>> wear
>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>> >> >>> > library to plug components together rather than utilizing common
>> >> >>> > platform APIs.
>> >> >>> >
>> >> >>> > - Wes
>> >> >>> >
>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> joshuastorck@gmail.com>
>> >> >>> wrote:
>> >> >>> >> You're point about the constraints of the ASF release process are
>> >> well
>> >> >>> >> taken and as a developer who's trying to work in the current
>> >> >>> environment I
>> >> >>> >> would be much happier if the codebases were merged. The main
>> issues
>> >> I
>> >> >>> worry
>> >> >>> >> about when you put codebases like these together are:
>> >> >>> >>
>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>> too
>> >> >>> coupled
>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>> >> >>> delayed
>> >> >>> >> by artifacts higher in the dependency tree
>> >> >>> >>
>> >> >>> >> If the project/release management is structured well and someone
>> >> keeps
>> >> >>> an
>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >>> >>
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how
>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> >> gives me hope that the projects could be managed separately some
>> >> day.
>> >> >>> >>
>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >> >>> wrote:
>> >> >>> >>
>> >> >>> >>> hi Josh,
>> >> >>> >>>
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them together seems like the wrong choice.
>> >> >>> >>>
>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>> >> >>> >>> building these projects -- my argument (which I think you agree
>> >> with?)
>> >> >>> >>> is that we should work more closely together until the community
>> >> grows
>> >> >>> >>> large enough to support larger-scope process than we have now.
>> As
>> >> >>> >>> you've seen, our process isn't serving developers of these
>> >> projects.
>> >> >>> >>>
>> >> >>> >>> > I also think build tooling should be pulled into its own
>> >> codebase.
>> >> >>> >>>
>> >> >>> >>> I don't see how this can possibly be practical taking into
>> >> >>> >>> consideration the constraints imposed by the combination of the
>> >> GitHub
>> >> >>> >>> platform and the ASF release process. I'm all for being
>> idealistic,
>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>> day
>> >> >>> >>> which may touch both code and build system simultaneously
>> without
>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>> we
>> >> can
>> >> >>> >>> move forward.
>> >> >>> >>>
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short term with the express purpose of separating them in the
>> near
>> >> >>> term.
>> >> >>> >>>
>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>> >> >>> >>> practical and result in net improvements in productivity and
>> >> community
>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>> >> current
>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >>> >>>
>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >> >>> >>> development process and ASF releases separately. My argument is
>> as
>> >> >>> >>> follows:
>> >> >>> >>>
>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >>> >>> * Releases structured according to the desires of the PMCs
>> >> >>> >>>
>> >> >>> >>> - Wes
>> >> >>> >>>
>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> joshuastorck@gmail.com
>> >> >>> >
>> >> >>> >>> wrote:
>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>> >> >>> parquet-cpp
>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> (ARROW-2585,
>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>> >> hard to
>> >> >>> work
>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> >> (created on
>> >> >>> May
>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> >> recently
>> >> >>> >>> merged.
>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>> >> change in
>> >> >>> >>> arrow
>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >>> >>> run_clang_format.py
>> >> >>> >>> > script in the arrow project only to find out later that there
>> >> was an
>> >> >>> >>> exact
>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >>> >>> >
>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>> the
>> >> long
>> >> >>> >>> term.
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them
>> >> >>> >>> > together seems like the wrong choice. There will be other
>> formats
>> >> >>> that
>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>> Orc),
>> >> so I
>> >> >>> >>> don't
>> >> >>> >>> > see why parquet should be special. I also think build tooling
>> >> should
>> >> >>> be
>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>> >> >>> developing
>> >> >>> >>> open
>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>> >> good
>> >> >>> >>> > counter-example since there have been lots of successful open
>> >> source
>> >> >>> >>> > projects that have used nightly build systems that pinned
>> >> versions of
>> >> >>> >>> > dependent software.
>> >> >>> >>> >
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short
>> >> >>> >>> > term with the express purpose of separating them in the near
>> >> term.
>> >> >>> My
>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>> you
>> >> can
>> >> >>> more
>> >> >>> >>> > easily delineate the boundaries between the API's with a
>> single
>> >> PR.
>> >> >>> >>> Second,
>> >> >>> >>> > it will force the build tooling to converge instead of
>> diverge,
>> >> >>> which has
>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>> >> sorted
>> >> >>> out,
>> >> >>> >>> it
>> >> >>> >>> > should be easy to separate them back into their own codebases.
>> >> >>> >>> >
>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> codebases
>> >> for
>> >> >>> arrow
>> >> >>> >>> > be separated from other languages. Looking at it from the
>> >> >>> perspective of
>> >> >>> >>> a
>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>> large
>> >> tax
>> >> >>> to
>> >> >>> >>> pay
>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>> >> 0.10.0
>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>> >> hope
>> >> >>> that
>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>> >> reduce
>> >> >>> the
>> >> >>> >>> > complexity of the build/release tooling.
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> ted.dunning@gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>> >
>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>> >>
>> >> >>> >>> >> >
>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>> >> stability
>> >> >>> and
>> >> >>> >>> API
>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> >> >>> community
>> >> >>> >>> took
>> >> >>> >>> >> a
>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >>> >>> >> >
>> >> >>> >>> >> > Please don't use bad experiences from another open source
>> >> >>> community as
>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>> go
>> >> the
>> >> >>> way
>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> community
>> >> which
>> >> >>> >>> >> > happens to operate under a similar open governance model.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> There are some more radical and community building options as
>> >> well.
>> >> >>> Take
>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>> >> Apache
>> >> >>> >>> >> committer can request and receive a commit bit on some large
>> >> >>> fraction of
>> >> >>> >>> >> subversion.
>> >> >>> >>> >>
>> >> >>> >>> >> So why not take this a bit further and give every parquet
>> >> committer
>> >> >>> a
>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> committers in
>> >> >>> >>> Arrow?
>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>> >> asks
>> >> >>> will
>> >> >>> >>> be
>> >> >>> >>> >> given committer status in Arrow.
>> >> >>> >>> >>
>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> >> committers
>> >> >>> >>> can't be
>> >> >>> >>> >> worried at that point whether their patches will get merged;
>> >> they
>> >> >>> can
>> >> >>> >>> just
>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> >> >>> Parquet
>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> parquet so
>> >> >>> why not
>> >> >>> >>> >> invite them in?
>> >> >>> >>> >>
>> >> >>> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> regards,
>> >> >> Deepak Majeti
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

I don't have a direct stake in this beyond wanting to see Parquet be
successful, but I thought I'd give my two cents.

For me, the thing that makes the biggest difference in contributing to a
new codebase is the number of steps in the workflow for writing, testing,
posting and iterating on a commit and also the number of opportunities for
missteps. The size of the repo and build/test times matter but are
secondary so long as the workflow is simple and reliable.

I don't really know what the current state of things is, but it sounds like
it's not as simple as check out -> build -> test if you're doing a
cross-repo change. Circular dependencies are a real headache.

On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <we...@gmail.com> wrote:

> hi,
>
> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > I think the circular dependency can be broken if we build a new library
> for
> > the platform code. This will also make it easy for other projects such as
> > ORC to use it.
> > I also remember your proposal a while ago of having a separate project
> for
> > the platform code.  That project can live in the arrow repo. However, one
> > has to clone the entire apache arrow repo but can just build the platform
> > code. This will be temporary until we can find a new home for it.
> >
> > The dependency will look like:
> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> > libplatform(platform api)
> >
> > CI workflow will clone the arrow project twice, once for the platform
> > library and once for the arrow-core/bindings library.
>
> This seems like an interesting proposal; the best place to work toward
> this goal (if it is even possible; the build system interactions and
> ASF release management are the hard problems) is to have all of the
> code in a single repository. ORC could already be using Arrow if it
> wanted, but the ORC contributors aren't active in Arrow.
>
> >
> > There is no doubt that the collaborations between the Arrow and Parquet
> > communities so far have been very successful.
> > The reason to maintain this relationship moving forward is to continue to
> > reap the mutual benefits.
> > We should continue to take advantage of sharing code as well. However, I
> > don't see any code sharing opportunities between arrow-core and the
> > parquet-core. Both have different functions.
>
> I think you mean the Arrow columnar format. The Arrow columnar format
> is only one part of a project that has become quite large already
> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> platform-for-inmemory-data-105427919).
>
> >
> > We are at a point where the parquet-cpp public API is pretty stable. We
> > already passed that difficult stage. My take at arrow and parquet is to
> > keep them nimble since we can.
>
> I believe that parquet-core has progress to make yet ahead of it. We
> have done little work in asynchronous IO and concurrency which would
> yield both improved read and write throughput. This aligns well with
> other concurrency and async-IO work planned in the Arrow platform. I
> believe that more development will happen on parquet-core once the
> development process issues are resolved by having a single codebase,
> single build system, and a single CI framework.
>
> I have some gripes about design decisions made early in parquet-cpp,
> like the use of C++ exceptions. So while "stability" is a reasonable
> goal I think we should still be open to making significant changes in
> the interest of long term progress.
>
> Having now worked on these projects for more than 2 and a half years
> and the most frequent contributor to both codebases, I'm sadly far
> past the "breaking point" and not willing to continue contributing in
> a significant way to parquet-cpp if the projects remained structured
> as they are now. It's hampering progress and not serving the
> community.
>
> - Wes
>
> >
> >
> >
> >
> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> > The current Arrow adaptor code for parquet should live in the arrow
> >> repo. That will remove a majority of the dependency issues. Joshua's
> work
> >> would not have been blocked in parquet-cpp if that adapter was in the
> arrow
> >> repo.  This will be similar to the ORC adaptor.
> >>
> >> This has been suggested before, but I don't see how it would alleviate
> >> any issues because of the significant dependencies on other parts of
> >> the Arrow codebase. What you are proposing is:
> >>
> >> - (Arrow) arrow platform
> >> - (Parquet) parquet core
> >> - (Arrow) arrow columnar-parquet adapter interface
> >> - (Arrow) Python bindings
> >>
> >> To make this work, somehow Arrow core / libarrow would have to be
> >> built before invoking the Parquet core part of the build system. You
> >> would need to pass dependent targets across different CMake build
> >> systems; I don't know if it's possible (I spent some time looking into
> >> it earlier this year). This is what I meant by the lack of a "concrete
> >> and actionable plan". The only thing that would really work would be
> >> for the Parquet core to be "included" in the Arrow build system
> >> somehow rather than using ExternalProject. Currently Parquet builds
> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
> >> system because it's only depended upon by the Python bindings.
> >>
> >> And even if a solution could be devised, it would not wholly resolve
> >> the CI workflow issues.
> >>
> >> You could make Parquet completely independent of the Arrow codebase,
> >> but at that point there is little reason to maintain a relationship
> >> between the projects or their communities. We have spent a great deal
> >> of effort refactoring the two projects to enable as much code sharing
> >> as there is now.
> >>
> >> - Wes
> >>
> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >> >> If you still strongly feel that the only way forward is to clone the
> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> >> parquet-cpp repos is no way a better approach.
> >> >
> >> > Yes, indeed. In my view, the next best option after a monorepo is to
> >> > fork. That would obviously be a bad outcome for the community.
> >> >
> >> > It doesn't look like I will be able to convince you that a monorepo is
> >> > a good idea; what I would ask instead is that you be willing to give
> >> > it a shot, and if it turns out in the way you're describing (which I
> >> > don't think it will) then I suggest that we fork at that point.
> >> >
> >> > - Wes
> >> >
> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
> majeti.deepak@gmail.com>
> >> wrote:
> >> >> Wes,
> >> >>
> >> >> Unfortunately, I cannot show you any practical fact-based problems
> of a
> >> >> non-existent Arrow-Parquet mono-repo.
> >> >> Bringing in related Apache community experiences are more meaningful
> >> than
> >> >> how mono-repos work at Google and other big organizations.
> >> >> We solely depend on volunteers and cannot hire full-time developers.
> >> >> You are very well aware of how difficult it has been to find more
> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
> >> >> contribution rate to its core components.
> >> >>
> >> >> We should target to ensure that new volunteers who want to contribute
> >> >> bug-fixes/features should spend the least amount of time in figuring
> out
> >> >> the project repo. We can never come up with an automated build system
> >> that
> >> >> caters to every possible environment.
> >> >> My only concern is if the mono-repo will make it harder for new
> >> developers
> >> >> to work on parquet-cpp core just due to the additional code, build
> and
> >> test
> >> >> dependencies.
> >> >> I am not saying that the Arrow community/committers will be less
> >> >> co-operative.
> >> >> I just don't think the mono-repo structure model will be sustainable
> in
> >> an
> >> >> open source community unless there are long-term vested interests. We
> >> can't
> >> >> predict that.
> >> >>
> >> >> The current circular dependency problems between Arrow and Parquet
> is a
> >> >> major problem for the community and it is important.
> >> >>
> >> >> The current Arrow adaptor code for parquet should live in the arrow
> >> repo.
> >> >> That will remove a majority of the dependency issues.
> >> >> Joshua's work would not have been blocked in parquet-cpp if that
> adapter
> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
> >> >>
> >> >> The platform API code is pretty stable at this point. Minor changes
> in
> >> the
> >> >> future to this code should not be the main reason to combine the
> arrow
> >> >> parquet repos.
> >> >>
> >> >> "
> >> >> *I question whether it's worth the community's time long term to
> wear*
> >> >>
> >> >>
> >> >> *ourselves out defining custom "ports" / virtual interfaces in
> >> eachlibrary
> >> >> to plug components together rather than utilizing commonplatform
> APIs.*"
> >> >>
> >> >> My answer to your question below would be "Yes".
> Modularity/separation
> >> is
> >> >> very important in an open source community where priorities of
> >> contributors
> >> >> are often short term.
> >> >> The retention is low and therefore the acquisition costs should be
> low
> >> as
> >> >> well. This is the community over code approach according to me. Minor
> >> code
> >> >> duplication is not a deal breaker.
> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
> data
> >> >> space serving their own functions.
> >> >>
> >> >> If you still strongly feel that the only way forward is to clone the
> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
> two
> >> >> parquet-cpp repos is no way a better approach.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >>
> >> >>> @Antoine
> >> >>>
> >> >>> > By the way, one concern with the monorepo approach: it would
> slightly
> >> >>> increase Arrow CI times (which are already too large).
> >> >>>
> >> >>> A typical CI run in Arrow is taking about 45 minutes:
> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >> >>>
> >> >>> Parquet run takes about 28
> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >> >>>
> >> >>> Inevitably we will need to create some kind of bot to run certain
> >> >>> builds on-demand based on commit / PR metadata or on request.
> >> >>>
> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
> >> >>> made substantially shorter by moving some of the slower parts (like
> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
> >> >>> improve build times (valgrind build could be moved to a nightly
> >> >>> exhaustive test run)
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmckinn@gmail.com
> >
> >> >>> wrote:
> >> >>> >> I would like to point out that arrow's use of orc is a great
> >> example of
> >> >>> how it would be possible to manage parquet-cpp as a separate
> codebase.
> >> That
> >> >>> gives me hope that the projects could be managed separately some
> day.
> >> >>> >
> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
> codebase
> >> >>> > features several areas of duplicated logic which could be
> replaced by
> >> >>> > components from the Arrow platform for better platform-wide
> >> >>> > interoperability:
> >> >>> >
> >> >>> >
> >> >>>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> orc/OrcFile.hh#L37
> >> >>> >
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >> >>> >
> >> >>>
> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
> orc/MemoryPool.hh
> >> >>> >
> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >> >>> >
> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
> OutputStream.hh
> >> >>> >
> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
> from
> >> >>> > leaking to third party linkers when statically linked (ORC is only
> >> >>> > available for static linking at the moment AFAIK).
> >> >>> >
> >> >>> > I question whether it's worth the community's time long term to
> wear
> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
> >> >>> > library to plug components together rather than utilizing common
> >> >>> > platform APIs.
> >> >>> >
> >> >>> > - Wes
> >> >>> >
> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> >> joshuastorck@gmail.com>
> >> >>> wrote:
> >> >>> >> You're point about the constraints of the ASF release process are
> >> well
> >> >>> >> taken and as a developer who's trying to work in the current
> >> >>> environment I
> >> >>> >> would be much happier if the codebases were merged. The main
> issues
> >> I
> >> >>> worry
> >> >>> >> about when you put codebases like these together are:
> >> >>> >>
> >> >>> >> 1. The delineation of API's become blurred and the code becomes
> too
> >> >>> coupled
> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
> >> >>> delayed
> >> >>> >> by artifacts higher in the dependency tree
> >> >>> >>
> >> >>> >> If the project/release management is structured well and someone
> >> keeps
> >> >>> an
> >> >>> >> eye on the coupling, then I don't have any concerns.
> >> >>> >>
> >> >>> >> I would like to point out that arrow's use of orc is a great
> >> example of
> >> >>> how
> >> >>> >> it would be possible to manage parquet-cpp as a separate
> codebase.
> >> That
> >> >>> >> gives me hope that the projects could be managed separately some
> >> day.
> >> >>> >>
> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
> wesmckinn@gmail.com>
> >> >>> wrote:
> >> >>> >>
> >> >>> >>> hi Josh,
> >> >>> >>>
> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
> and
> >> >>> tying
> >> >>> >>> them together seems like the wrong choice.
> >> >>> >>>
> >> >>> >>> Apache is "Community over Code"; right now it's the same people
> >> >>> >>> building these projects -- my argument (which I think you agree
> >> with?)
> >> >>> >>> is that we should work more closely together until the community
> >> grows
> >> >>> >>> large enough to support larger-scope process than we have now.
> As
> >> >>> >>> you've seen, our process isn't serving developers of these
> >> projects.
> >> >>> >>>
> >> >>> >>> > I also think build tooling should be pulled into its own
> >> codebase.
> >> >>> >>>
> >> >>> >>> I don't see how this can possibly be practical taking into
> >> >>> >>> consideration the constraints imposed by the combination of the
> >> GitHub
> >> >>> >>> platform and the ASF release process. I'm all for being
> idealistic,
> >> >>> >>> but right now we need to be practical. Unless we can devise a
> >> >>> >>> practical procedure that can accommodate at least 1 patch per
> day
> >> >>> >>> which may touch both code and build system simultaneously
> without
> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
> we
> >> can
> >> >>> >>> move forward.
> >> >>> >>>
> >> >>> >>> > That being said, I think it makes sense to merge the codebases
> >> in the
> >> >>> >>> short term with the express purpose of separating them in the
> near
> >> >>> term.
> >> >>> >>>
> >> >>> >>> I would agree but only if separation can be demonstrated to be
> >> >>> >>> practical and result in net improvements in productivity and
> >> community
> >> >>> >>> growth. I think experience has clearly demonstrated that the
> >> current
> >> >>> >>> separation is impractical, and is causing problems.
> >> >>> >>>
> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
> >> >>> >>> development process and ASF releases separately. My argument is
> as
> >> >>> >>> follows:
> >> >>> >>>
> >> >>> >>> * Monorepo for development (for practicality)
> >> >>> >>> * Releases structured according to the desires of the PMCs
> >> >>> >>>
> >> >>> >>> - Wes
> >> >>> >>>
> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> >> joshuastorck@gmail.com
> >> >>> >
> >> >>> >>> wrote:
> >> >>> >>> > I recently worked on an issue that had to be implemented in
> >> >>> parquet-cpp
> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> >> (ARROW-2585,
> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
> >> hard to
> >> >>> work
> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
> >> (created on
> >> >>> May
> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
> >> recently
> >> >>> >>> merged.
> >> >>> >>> > I couldn't even address any CI issues in the PR because the
> >> change in
> >> >>> >>> arrow
> >> >>> >>> > was not yet in master. In a separate PR, I changed the
> >> >>> >>> run_clang_format.py
> >> >>> >>> > script in the arrow project only to find out later that there
> >> was an
> >> >>> >>> exact
> >> >>> >>> > copy of it in parquet-cpp.
> >> >>> >>> >
> >> >>> >>> > However, I don't think merging the codebases makes sense in
> the
> >> long
> >> >>> >>> term.
> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
> and
> >> >>> tying
> >> >>> >>> them
> >> >>> >>> > together seems like the wrong choice. There will be other
> formats
> >> >>> that
> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
> Orc),
> >> so I
> >> >>> >>> don't
> >> >>> >>> > see why parquet should be special. I also think build tooling
> >> should
> >> >>> be
> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
> >> >>> developing
> >> >>> >>> open
> >> >>> >>> > source C/C++ projects that way and made projects like
> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
> >> good
> >> >>> >>> > counter-example since there have been lots of successful open
> >> source
> >> >>> >>> > projects that have used nightly build systems that pinned
> >> versions of
> >> >>> >>> > dependent software.
> >> >>> >>> >
> >> >>> >>> > That being said, I think it makes sense to merge the codebases
> >> in the
> >> >>> >>> short
> >> >>> >>> > term with the express purpose of separating them in the near
> >> term.
> >> >>> My
> >> >>> >>> > reasoning is as follows. By putting the codebases together,
> you
> >> can
> >> >>> more
> >> >>> >>> > easily delineate the boundaries between the API's with a
> single
> >> PR.
> >> >>> >>> Second,
> >> >>> >>> > it will force the build tooling to converge instead of
> diverge,
> >> >>> which has
> >> >>> >>> > already happened. Once the boundaries and tooling have been
> >> sorted
> >> >>> out,
> >> >>> >>> it
> >> >>> >>> > should be easy to separate them back into their own codebases.
> >> >>> >>> >
> >> >>> >>> > If the codebases are merged, I would ask that the C++
> codebases
> >> for
> >> >>> arrow
> >> >>> >>> > be separated from other languages. Looking at it from the
> >> >>> perspective of
> >> >>> >>> a
> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
> large
> >> tax
> >> >>> to
> >> >>> >>> pay
> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
> >> 0.10.0
> >> >>> >>> > release of arrow, many of which were holding up the release. I
> >> hope
> >> >>> that
> >> >>> >>> > seems like a reasonable compromise, and I think it will help
> >> reduce
> >> >>> the
> >> >>> >>> > complexity of the build/release tooling.
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> >> ted.dunning@gmail.com>
> >> >>> >>> wrote:
> >> >>> >>> >
> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> >> wesmckinn@gmail.com>
> >> >>> >>> wrote:
> >> >>> >>> >>
> >> >>> >>> >> >
> >> >>> >>> >> > > The community will be less willing to accept large
> >> >>> >>> >> > > changes that require multiple rounds of patches for
> >> stability
> >> >>> and
> >> >>> >>> API
> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
> >> >>> community
> >> >>> >>> took
> >> >>> >>> >> a
> >> >>> >>> >> > > significantly long time for the very same reason.
> >> >>> >>> >> >
> >> >>> >>> >> > Please don't use bad experiences from another open source
> >> >>> community as
> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
> go
> >> the
> >> >>> way
> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
> community
> >> which
> >> >>> >>> >> > happens to operate under a similar open governance model.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> There are some more radical and community building options as
> >> well.
> >> >>> Take
> >> >>> >>> >> the subversion project as a precedent. With subversion, any
> >> Apache
> >> >>> >>> >> committer can request and receive a commit bit on some large
> >> >>> fraction of
> >> >>> >>> >> subversion.
> >> >>> >>> >>
> >> >>> >>> >> So why not take this a bit further and give every parquet
> >> committer
> >> >>> a
> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
> >> committers in
> >> >>> >>> Arrow?
> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
> >> asks
> >> >>> will
> >> >>> >>> be
> >> >>> >>> >> given committer status in Arrow.
> >> >>> >>> >>
> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
> >> committers
> >> >>> >>> can't be
> >> >>> >>> >> worried at that point whether their patches will get merged;
> >> they
> >> >>> can
> >> >>> >>> just
> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
> >> >>> Parquet
> >> >>> >>> >> committers. After all, Arrow already depends a lot on
> parquet so
> >> >>> why not
> >> >>> >>> >> invite them in?
> >> >>> >>> >>
> >> >>> >>>
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> regards,
> >> >> Deepak Majeti
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi,

On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com> wrote:
> I think the circular dependency can be broken if we build a new library for
> the platform code. This will also make it easy for other projects such as
> ORC to use it.
> I also remember your proposal a while ago of having a separate project for
> the platform code.  That project can live in the arrow repo. However, one
> has to clone the entire apache arrow repo but can just build the platform
> code. This will be temporary until we can find a new home for it.
>
> The dependency will look like:
> libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> libplatform(platform api)
>
> CI workflow will clone the arrow project twice, once for the platform
> library and once for the arrow-core/bindings library.

This seems like an interesting proposal; the best place to work toward
this goal (if it is even possible; the build system interactions and
ASF release management are the hard problems) is to have all of the
code in a single repository. ORC could already be using Arrow if it
wanted, but the ORC contributors aren't active in Arrow.

>
> There is no doubt that the collaborations between the Arrow and Parquet
> communities so far have been very successful.
> The reason to maintain this relationship moving forward is to continue to
> reap the mutual benefits.
> We should continue to take advantage of sharing code as well. However, I
> don't see any code sharing opportunities between arrow-core and the
> parquet-core. Both have different functions.

I think you mean the Arrow columnar format. The Arrow columnar format
is only one part of a project that has become quite large already
(https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919).

>
> We are at a point where the parquet-cpp public API is pretty stable. We
> already passed that difficult stage. My take at arrow and parquet is to
> keep them nimble since we can.

I believe that parquet-core has progress to make yet ahead of it. We
have done little work in asynchronous IO and concurrency which would
yield both improved read and write throughput. This aligns well with
other concurrency and async-IO work planned in the Arrow platform. I
believe that more development will happen on parquet-core once the
development process issues are resolved by having a single codebase,
single build system, and a single CI framework.

I have some gripes about design decisions made early in parquet-cpp,
like the use of C++ exceptions. So while "stability" is a reasonable
goal I think we should still be open to making significant changes in
the interest of long term progress.

Having now worked on these projects for more than 2 and a half years
and the most frequent contributor to both codebases, I'm sadly far
past the "breaking point" and not willing to continue contributing in
a significant way to parquet-cpp if the projects remained structured
as they are now. It's hampering progress and not serving the
community.

- Wes

>
>
>
>
> On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>
>> > The current Arrow adaptor code for parquet should live in the arrow
>> repo. That will remove a majority of the dependency issues. Joshua's work
>> would not have been blocked in parquet-cpp if that adapter was in the arrow
>> repo.  This will be similar to the ORC adaptor.
>>
>> This has been suggested before, but I don't see how it would alleviate
>> any issues because of the significant dependencies on other parts of
>> the Arrow codebase. What you are proposing is:
>>
>> - (Arrow) arrow platform
>> - (Parquet) parquet core
>> - (Arrow) arrow columnar-parquet adapter interface
>> - (Arrow) Python bindings
>>
>> To make this work, somehow Arrow core / libarrow would have to be
>> built before invoking the Parquet core part of the build system. You
>> would need to pass dependent targets across different CMake build
>> systems; I don't know if it's possible (I spent some time looking into
>> it earlier this year). This is what I meant by the lack of a "concrete
>> and actionable plan". The only thing that would really work would be
>> for the Parquet core to be "included" in the Arrow build system
>> somehow rather than using ExternalProject. Currently Parquet builds
>> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> system because it's only depended upon by the Python bindings.
>>
>> And even if a solution could be devised, it would not wholly resolve
>> the CI workflow issues.
>>
>> You could make Parquet completely independent of the Arrow codebase,
>> but at that point there is little reason to maintain a relationship
>> between the projects or their communities. We have spent a great deal
>> of effort refactoring the two projects to enable as much code sharing
>> as there is now.
>>
>> - Wes
>>
>> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
>> >> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>> >
>> > Yes, indeed. In my view, the next best option after a monorepo is to
>> > fork. That would obviously be a bad outcome for the community.
>> >
>> > It doesn't look like I will be able to convince you that a monorepo is
>> > a good idea; what I would ask instead is that you be willing to give
>> > it a shot, and if it turns out in the way you're describing (which I
>> > don't think it will) then I suggest that we fork at that point.
>> >
>> > - Wes
>> >
>> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> >> Wes,
>> >>
>> >> Unfortunately, I cannot show you any practical fact-based problems of a
>> >> non-existent Arrow-Parquet mono-repo.
>> >> Bringing in related Apache community experiences are more meaningful
>> than
>> >> how mono-repos work at Google and other big organizations.
>> >> We solely depend on volunteers and cannot hire full-time developers.
>> >> You are very well aware of how difficult it has been to find more
>> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>> >> contribution rate to its core components.
>> >>
>> >> We should target to ensure that new volunteers who want to contribute
>> >> bug-fixes/features should spend the least amount of time in figuring out
>> >> the project repo. We can never come up with an automated build system
>> that
>> >> caters to every possible environment.
>> >> My only concern is if the mono-repo will make it harder for new
>> developers
>> >> to work on parquet-cpp core just due to the additional code, build and
>> test
>> >> dependencies.
>> >> I am not saying that the Arrow community/committers will be less
>> >> co-operative.
>> >> I just don't think the mono-repo structure model will be sustainable in
>> an
>> >> open source community unless there are long-term vested interests. We
>> can't
>> >> predict that.
>> >>
>> >> The current circular dependency problems between Arrow and Parquet is a
>> >> major problem for the community and it is important.
>> >>
>> >> The current Arrow adaptor code for parquet should live in the arrow
>> repo.
>> >> That will remove a majority of the dependency issues.
>> >> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >>
>> >> The platform API code is pretty stable at this point. Minor changes in
>> the
>> >> future to this code should not be the main reason to combine the arrow
>> >> parquet repos.
>> >>
>> >> "
>> >> *I question whether it's worth the community's time long term to wear*
>> >>
>> >>
>> >> *ourselves out defining custom "ports" / virtual interfaces in
>> eachlibrary
>> >> to plug components together rather than utilizing commonplatform APIs.*"
>> >>
>> >> My answer to your question below would be "Yes". Modularity/separation
>> is
>> >> very important in an open source community where priorities of
>> contributors
>> >> are often short term.
>> >> The retention is low and therefore the acquisition costs should be low
>> as
>> >> well. This is the community over code approach according to me. Minor
>> code
>> >> duplication is not a deal breaker.
>> >> ORC, Parquet, Arrow, etc. are all different components in the big data
>> >> space serving their own functions.
>> >>
>> >> If you still strongly feel that the only way forward is to clone the
>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> >> parquet-cpp repos is no way a better approach.
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> @Antoine
>> >>>
>> >>> > By the way, one concern with the monorepo approach: it would slightly
>> >>> increase Arrow CI times (which are already too large).
>> >>>
>> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >>>
>> >>> Parquet run takes about 28
>> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >>>
>> >>> Inevitably we will need to create some kind of bot to run certain
>> >>> builds on-demand based on commit / PR metadata or on request.
>> >>>
>> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> >>> made substantially shorter by moving some of the slower parts (like
>> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>> >>> improve build times (valgrind build could be moved to a nightly
>> >>> exhaustive test run)
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >> I would like to point out that arrow's use of orc is a great
>> example of
>> >>> how it would be possible to manage parquet-cpp as a separate codebase.
>> That
>> >>> gives me hope that the projects could be managed separately some day.
>> >>> >
>> >>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>> >>> > features several areas of duplicated logic which could be replaced by
>> >>> > components from the Arrow platform for better platform-wide
>> >>> > interoperability:
>> >>> >
>> >>> >
>> >>>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >>> >
>> >>>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>> >>> >
>> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> >>> > bugs that we had to fix in Arrow's build system to prevent them from
>> >>> > leaking to third party linkers when statically linked (ORC is only
>> >>> > available for static linking at the moment AFAIK).
>> >>> >
>> >>> > I question whether it's worth the community's time long term to wear
>> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>> >>> > library to plug components together rather than utilizing common
>> >>> > platform APIs.
>> >>> >
>> >>> > - Wes
>> >>> >
>> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> joshuastorck@gmail.com>
>> >>> wrote:
>> >>> >> You're point about the constraints of the ASF release process are
>> well
>> >>> >> taken and as a developer who's trying to work in the current
>> >>> environment I
>> >>> >> would be much happier if the codebases were merged. The main issues
>> I
>> >>> worry
>> >>> >> about when you put codebases like these together are:
>> >>> >>
>> >>> >> 1. The delineation of API's become blurred and the code becomes too
>> >>> coupled
>> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>> >>> delayed
>> >>> >> by artifacts higher in the dependency tree
>> >>> >>
>> >>> >> If the project/release management is structured well and someone
>> keeps
>> >>> an
>> >>> >> eye on the coupling, then I don't have any concerns.
>> >>> >>
>> >>> >> I would like to point out that arrow's use of orc is a great
>> example of
>> >>> how
>> >>> >> it would be possible to manage parquet-cpp as a separate codebase.
>> That
>> >>> >> gives me hope that the projects could be managed separately some
>> day.
>> >>> >>
>> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >>> hi Josh,
>> >>> >>>
>> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> >>> tying
>> >>> >>> them together seems like the wrong choice.
>> >>> >>>
>> >>> >>> Apache is "Community over Code"; right now it's the same people
>> >>> >>> building these projects -- my argument (which I think you agree
>> with?)
>> >>> >>> is that we should work more closely together until the community
>> grows
>> >>> >>> large enough to support larger-scope process than we have now. As
>> >>> >>> you've seen, our process isn't serving developers of these
>> projects.
>> >>> >>>
>> >>> >>> > I also think build tooling should be pulled into its own
>> codebase.
>> >>> >>>
>> >>> >>> I don't see how this can possibly be practical taking into
>> >>> >>> consideration the constraints imposed by the combination of the
>> GitHub
>> >>> >>> platform and the ASF release process. I'm all for being idealistic,
>> >>> >>> but right now we need to be practical. Unless we can devise a
>> >>> >>> practical procedure that can accommodate at least 1 patch per day
>> >>> >>> which may touch both code and build system simultaneously without
>> >>> >>> being a hindrance to contributor or maintainer, I don't see how we
>> can
>> >>> >>> move forward.
>> >>> >>>
>> >>> >>> > That being said, I think it makes sense to merge the codebases
>> in the
>> >>> >>> short term with the express purpose of separating them in the near
>> >>> term.
>> >>> >>>
>> >>> >>> I would agree but only if separation can be demonstrated to be
>> >>> >>> practical and result in net improvements in productivity and
>> community
>> >>> >>> growth. I think experience has clearly demonstrated that the
>> current
>> >>> >>> separation is impractical, and is causing problems.
>> >>> >>>
>> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> >>> development process and ASF releases separately. My argument is as
>> >>> >>> follows:
>> >>> >>>
>> >>> >>> * Monorepo for development (for practicality)
>> >>> >>> * Releases structured according to the desires of the PMCs
>> >>> >>>
>> >>> >>> - Wes
>> >>> >>>
>> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> joshuastorck@gmail.com
>> >>> >
>> >>> >>> wrote:
>> >>> >>> > I recently worked on an issue that had to be implemented in
>> >>> parquet-cpp
>> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> (ARROW-2585,
>> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>> hard to
>> >>> work
>> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> (created on
>> >>> May
>> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> recently
>> >>> >>> merged.
>> >>> >>> > I couldn't even address any CI issues in the PR because the
>> change in
>> >>> >>> arrow
>> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> >>> run_clang_format.py
>> >>> >>> > script in the arrow project only to find out later that there
>> was an
>> >>> >>> exact
>> >>> >>> > copy of it in parquet-cpp.
>> >>> >>> >
>> >>> >>> > However, I don't think merging the codebases makes sense in the
>> long
>> >>> >>> term.
>> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> >>> tying
>> >>> >>> them
>> >>> >>> > together seems like the wrong choice. There will be other formats
>> >>> that
>> >>> >>> > arrow needs to support that will be kept separate (e.g. - Orc),
>> so I
>> >>> >>> don't
>> >>> >>> > see why parquet should be special. I also think build tooling
>> should
>> >>> be
>> >>> >>> > pulled into its own codebase. GNU has had a long history of
>> >>> developing
>> >>> >>> open
>> >>> >>> > source C/C++ projects that way and made projects like
>> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>> good
>> >>> >>> > counter-example since there have been lots of successful open
>> source
>> >>> >>> > projects that have used nightly build systems that pinned
>> versions of
>> >>> >>> > dependent software.
>> >>> >>> >
>> >>> >>> > That being said, I think it makes sense to merge the codebases
>> in the
>> >>> >>> short
>> >>> >>> > term with the express purpose of separating them in the near
>> term.
>> >>> My
>> >>> >>> > reasoning is as follows. By putting the codebases together, you
>> can
>> >>> more
>> >>> >>> > easily delineate the boundaries between the API's with a single
>> PR.
>> >>> >>> Second,
>> >>> >>> > it will force the build tooling to converge instead of diverge,
>> >>> which has
>> >>> >>> > already happened. Once the boundaries and tooling have been
>> sorted
>> >>> out,
>> >>> >>> it
>> >>> >>> > should be easy to separate them back into their own codebases.
>> >>> >>> >
>> >>> >>> > If the codebases are merged, I would ask that the C++ codebases
>> for
>> >>> arrow
>> >>> >>> > be separated from other languages. Looking at it from the
>> >>> perspective of
>> >>> >>> a
>> >>> >>> > parquet-cpp library user, having a dependency on Java is a large
>> tax
>> >>> to
>> >>> >>> pay
>> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>> 0.10.0
>> >>> >>> > release of arrow, many of which were holding up the release. I
>> hope
>> >>> that
>> >>> >>> > seems like a reasonable compromise, and I think it will help
>> reduce
>> >>> the
>> >>> >>> > complexity of the build/release tooling.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> ted.dunning@gmail.com>
>> >>> >>> wrote:
>> >>> >>> >
>> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >>> >>> wrote:
>> >>> >>> >>
>> >>> >>> >> >
>> >>> >>> >> > > The community will be less willing to accept large
>> >>> >>> >> > > changes that require multiple rounds of patches for
>> stability
>> >>> and
>> >>> >>> API
>> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> >>> community
>> >>> >>> took
>> >>> >>> >> a
>> >>> >>> >> > > significantly long time for the very same reason.
>> >>> >>> >> >
>> >>> >>> >> > Please don't use bad experiences from another open source
>> >>> community as
>> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't go
>> the
>> >>> way
>> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct community
>> which
>> >>> >>> >> > happens to operate under a similar open governance model.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> There are some more radical and community building options as
>> well.
>> >>> Take
>> >>> >>> >> the subversion project as a precedent. With subversion, any
>> Apache
>> >>> >>> >> committer can request and receive a commit bit on some large
>> >>> fraction of
>> >>> >>> >> subversion.
>> >>> >>> >>
>> >>> >>> >> So why not take this a bit further and give every parquet
>> committer
>> >>> a
>> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> committers in
>> >>> >>> Arrow?
>> >>> >>> >> Possibly even make it policy that every Parquet committer who
>> asks
>> >>> will
>> >>> >>> be
>> >>> >>> >> given committer status in Arrow.
>> >>> >>> >>
>> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> committers
>> >>> >>> can't be
>> >>> >>> >> worried at that point whether their patches will get merged;
>> they
>> >>> can
>> >>> >>> just
>> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> >>> Parquet
>> >>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>> >>> why not
>> >>> >>> >> invite them in?
>> >>> >>> >>
>> >>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> regards,
>> >> Deepak Majeti
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi,

On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <ma...@gmail.com> wrote:
> I think the circular dependency can be broken if we build a new library for
> the platform code. This will also make it easy for other projects such as
> ORC to use it.
> I also remember your proposal a while ago of having a separate project for
> the platform code.  That project can live in the arrow repo. However, one
> has to clone the entire apache arrow repo but can just build the platform
> code. This will be temporary until we can find a new home for it.
>
> The dependency will look like:
> libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> libplatform(platform api)
>
> CI workflow will clone the arrow project twice, once for the platform
> library and once for the arrow-core/bindings library.

This seems like an interesting proposal; the best place to work toward
this goal (if it is even possible; the build system interactions and
ASF release management are the hard problems) is to have all of the
code in a single repository. ORC could already be using Arrow if it
wanted, but the ORC contributors aren't active in Arrow.

>
> There is no doubt that the collaborations between the Arrow and Parquet
> communities so far have been very successful.
> The reason to maintain this relationship moving forward is to continue to
> reap the mutual benefits.
> We should continue to take advantage of sharing code as well. However, I
> don't see any code sharing opportunities between arrow-core and the
> parquet-core. Both have different functions.

I think you mean the Arrow columnar format. The Arrow columnar format
is only one part of a project that has become quite large already
(https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919).

>
> We are at a point where the parquet-cpp public API is pretty stable. We
> already passed that difficult stage. My take at arrow and parquet is to
> keep them nimble since we can.

I believe that parquet-core has progress to make yet ahead of it. We
have done little work in asynchronous IO and concurrency which would
yield both improved read and write throughput. This aligns well with
other concurrency and async-IO work planned in the Arrow platform. I
believe that more development will happen on parquet-core once the
development process issues are resolved by having a single codebase,
single build system, and a single CI framework.

I have some gripes about design decisions made early in parquet-cpp,
like the use of C++ exceptions. So while "stability" is a reasonable
goal I think we should still be open to making significant changes in
the interest of long term progress.

Having now worked on these projects for more than 2 and a half years
and the most frequent contributor to both codebases, I'm sadly far
past the "breaking point" and not willing to continue contributing in
a significant way to parquet-cpp if the projects remained structured
as they are now. It's hampering progress and not serving the
community.

- Wes

>
>
>
>
> On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>
>> > The current Arrow adaptor code for parquet should live in the arrow
>> repo. That will remove a majority of the dependency issues. Joshua's work
>> would not have been blocked in parquet-cpp if that adapter was in the arrow
>> repo.  This will be similar to the ORC adaptor.
>>
>> This has been suggested before, but I don't see how it would alleviate
>> any issues because of the significant dependencies on other parts of
>> the Arrow codebase. What you are proposing is:
>>
>> - (Arrow) arrow platform
>> - (Parquet) parquet core
>> - (Arrow) arrow columnar-parquet adapter interface
>> - (Arrow) Python bindings
>>
>> To make this work, somehow Arrow core / libarrow would have to be
>> built before invoking the Parquet core part of the build system. You
>> would need to pass dependent targets across different CMake build
>> systems; I don't know if it's possible (I spent some time looking into
>> it earlier this year). This is what I meant by the lack of a "concrete
>> and actionable plan". The only thing that would really work would be
>> for the Parquet core to be "included" in the Arrow build system
>> somehow rather than using ExternalProject. Currently Parquet builds
>> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> system because it's only depended upon by the Python bindings.
>>
>> And even if a solution could be devised, it would not wholly resolve
>> the CI workflow issues.
>>
>> You could make Parquet completely independent of the Arrow codebase,
>> but at that point there is little reason to maintain a relationship
>> between the projects or their communities. We have spent a great deal
>> of effort refactoring the two projects to enable as much code sharing
>> as there is now.
>>
>> - Wes
>>
>> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
>> >> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>> >
>> > Yes, indeed. In my view, the next best option after a monorepo is to
>> > fork. That would obviously be a bad outcome for the community.
>> >
>> > It doesn't look like I will be able to convince you that a monorepo is
>> > a good idea; what I would ask instead is that you be willing to give
>> > it a shot, and if it turns out in the way you're describing (which I
>> > don't think it will) then I suggest that we fork at that point.
>> >
>> > - Wes
>> >
>> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> >> Wes,
>> >>
>> >> Unfortunately, I cannot show you any practical fact-based problems of a
>> >> non-existent Arrow-Parquet mono-repo.
>> >> Bringing in related Apache community experiences are more meaningful
>> than
>> >> how mono-repos work at Google and other big organizations.
>> >> We solely depend on volunteers and cannot hire full-time developers.
>> >> You are very well aware of how difficult it has been to find more
>> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>> >> contribution rate to its core components.
>> >>
>> >> We should target to ensure that new volunteers who want to contribute
>> >> bug-fixes/features should spend the least amount of time in figuring out
>> >> the project repo. We can never come up with an automated build system
>> that
>> >> caters to every possible environment.
>> >> My only concern is if the mono-repo will make it harder for new
>> developers
>> >> to work on parquet-cpp core just due to the additional code, build and
>> test
>> >> dependencies.
>> >> I am not saying that the Arrow community/committers will be less
>> >> co-operative.
>> >> I just don't think the mono-repo structure model will be sustainable in
>> an
>> >> open source community unless there are long-term vested interests. We
>> can't
>> >> predict that.
>> >>
>> >> The current circular dependency problems between Arrow and Parquet is a
>> >> major problem for the community and it is important.
>> >>
>> >> The current Arrow adaptor code for parquet should live in the arrow
>> repo.
>> >> That will remove a majority of the dependency issues.
>> >> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >>
>> >> The platform API code is pretty stable at this point. Minor changes in
>> the
>> >> future to this code should not be the main reason to combine the arrow
>> >> parquet repos.
>> >>
>> >> "
>> >> *I question whether it's worth the community's time long term to wear*
>> >>
>> >>
>> >> *ourselves out defining custom "ports" / virtual interfaces in
>> eachlibrary
>> >> to plug components together rather than utilizing commonplatform APIs.*"
>> >>
>> >> My answer to your question below would be "Yes". Modularity/separation
>> is
>> >> very important in an open source community where priorities of
>> contributors
>> >> are often short term.
>> >> The retention is low and therefore the acquisition costs should be low
>> as
>> >> well. This is the community over code approach according to me. Minor
>> code
>> >> duplication is not a deal breaker.
>> >> ORC, Parquet, Arrow, etc. are all different components in the big data
>> >> space serving their own functions.
>> >>
>> >> If you still strongly feel that the only way forward is to clone the
>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> >> parquet-cpp repos is no way a better approach.
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> @Antoine
>> >>>
>> >>> > By the way, one concern with the monorepo approach: it would slightly
>> >>> increase Arrow CI times (which are already too large).
>> >>>
>> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >>>
>> >>> Parquet run takes about 28
>> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >>>
>> >>> Inevitably we will need to create some kind of bot to run certain
>> >>> builds on-demand based on commit / PR metadata or on request.
>> >>>
>> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> >>> made substantially shorter by moving some of the slower parts (like
>> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>> >>> improve build times (valgrind build could be moved to a nightly
>> >>> exhaustive test run)
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >> I would like to point out that arrow's use of orc is a great
>> example of
>> >>> how it would be possible to manage parquet-cpp as a separate codebase.
>> That
>> >>> gives me hope that the projects could be managed separately some day.
>> >>> >
>> >>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>> >>> > features several areas of duplicated logic which could be replaced by
>> >>> > components from the Arrow platform for better platform-wide
>> >>> > interoperability:
>> >>> >
>> >>> >
>> >>>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >>> >
>> >>>
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>> >>> >
>> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> >>> > bugs that we had to fix in Arrow's build system to prevent them from
>> >>> > leaking to third party linkers when statically linked (ORC is only
>> >>> > available for static linking at the moment AFAIK).
>> >>> >
>> >>> > I question whether it's worth the community's time long term to wear
>> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>> >>> > library to plug components together rather than utilizing common
>> >>> > platform APIs.
>> >>> >
>> >>> > - Wes
>> >>> >
>> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> joshuastorck@gmail.com>
>> >>> wrote:
>> >>> >> You're point about the constraints of the ASF release process are
>> well
>> >>> >> taken and as a developer who's trying to work in the current
>> >>> environment I
>> >>> >> would be much happier if the codebases were merged. The main issues
>> I
>> >>> worry
>> >>> >> about when you put codebases like these together are:
>> >>> >>
>> >>> >> 1. The delineation of API's become blurred and the code becomes too
>> >>> coupled
>> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>> >>> delayed
>> >>> >> by artifacts higher in the dependency tree
>> >>> >>
>> >>> >> If the project/release management is structured well and someone
>> keeps
>> >>> an
>> >>> >> eye on the coupling, then I don't have any concerns.
>> >>> >>
>> >>> >> I would like to point out that arrow's use of orc is a great
>> example of
>> >>> how
>> >>> >> it would be possible to manage parquet-cpp as a separate codebase.
>> That
>> >>> >> gives me hope that the projects could be managed separately some
>> day.
>> >>> >>
>> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >>> hi Josh,
>> >>> >>>
>> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> >>> tying
>> >>> >>> them together seems like the wrong choice.
>> >>> >>>
>> >>> >>> Apache is "Community over Code"; right now it's the same people
>> >>> >>> building these projects -- my argument (which I think you agree
>> with?)
>> >>> >>> is that we should work more closely together until the community
>> grows
>> >>> >>> large enough to support larger-scope process than we have now. As
>> >>> >>> you've seen, our process isn't serving developers of these
>> projects.
>> >>> >>>
>> >>> >>> > I also think build tooling should be pulled into its own
>> codebase.
>> >>> >>>
>> >>> >>> I don't see how this can possibly be practical taking into
>> >>> >>> consideration the constraints imposed by the combination of the
>> GitHub
>> >>> >>> platform and the ASF release process. I'm all for being idealistic,
>> >>> >>> but right now we need to be practical. Unless we can devise a
>> >>> >>> practical procedure that can accommodate at least 1 patch per day
>> >>> >>> which may touch both code and build system simultaneously without
>> >>> >>> being a hindrance to contributor or maintainer, I don't see how we
>> can
>> >>> >>> move forward.
>> >>> >>>
>> >>> >>> > That being said, I think it makes sense to merge the codebases
>> in the
>> >>> >>> short term with the express purpose of separating them in the near
>> >>> term.
>> >>> >>>
>> >>> >>> I would agree but only if separation can be demonstrated to be
>> >>> >>> practical and result in net improvements in productivity and
>> community
>> >>> >>> growth. I think experience has clearly demonstrated that the
>> current
>> >>> >>> separation is impractical, and is causing problems.
>> >>> >>>
>> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> >>> development process and ASF releases separately. My argument is as
>> >>> >>> follows:
>> >>> >>>
>> >>> >>> * Monorepo for development (for practicality)
>> >>> >>> * Releases structured according to the desires of the PMCs
>> >>> >>>
>> >>> >>> - Wes
>> >>> >>>
>> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> joshuastorck@gmail.com
>> >>> >
>> >>> >>> wrote:
>> >>> >>> > I recently worked on an issue that had to be implemented in
>> >>> parquet-cpp
>> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> (ARROW-2585,
>> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>> hard to
>> >>> work
>> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> (created on
>> >>> May
>> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> recently
>> >>> >>> merged.
>> >>> >>> > I couldn't even address any CI issues in the PR because the
>> change in
>> >>> >>> arrow
>> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> >>> run_clang_format.py
>> >>> >>> > script in the arrow project only to find out later that there
>> was an
>> >>> >>> exact
>> >>> >>> > copy of it in parquet-cpp.
>> >>> >>> >
>> >>> >>> > However, I don't think merging the codebases makes sense in the
>> long
>> >>> >>> term.
>> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> >>> tying
>> >>> >>> them
>> >>> >>> > together seems like the wrong choice. There will be other formats
>> >>> that
>> >>> >>> > arrow needs to support that will be kept separate (e.g. - Orc),
>> so I
>> >>> >>> don't
>> >>> >>> > see why parquet should be special. I also think build tooling
>> should
>> >>> be
>> >>> >>> > pulled into its own codebase. GNU has had a long history of
>> >>> developing
>> >>> >>> open
>> >>> >>> > source C/C++ projects that way and made projects like
>> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>> good
>> >>> >>> > counter-example since there have been lots of successful open
>> source
>> >>> >>> > projects that have used nightly build systems that pinned
>> versions of
>> >>> >>> > dependent software.
>> >>> >>> >
>> >>> >>> > That being said, I think it makes sense to merge the codebases
>> in the
>> >>> >>> short
>> >>> >>> > term with the express purpose of separating them in the near
>> term.
>> >>> My
>> >>> >>> > reasoning is as follows. By putting the codebases together, you
>> can
>> >>> more
>> >>> >>> > easily delineate the boundaries between the API's with a single
>> PR.
>> >>> >>> Second,
>> >>> >>> > it will force the build tooling to converge instead of diverge,
>> >>> which has
>> >>> >>> > already happened. Once the boundaries and tooling have been
>> sorted
>> >>> out,
>> >>> >>> it
>> >>> >>> > should be easy to separate them back into their own codebases.
>> >>> >>> >
>> >>> >>> > If the codebases are merged, I would ask that the C++ codebases
>> for
>> >>> arrow
>> >>> >>> > be separated from other languages. Looking at it from the
>> >>> perspective of
>> >>> >>> a
>> >>> >>> > parquet-cpp library user, having a dependency on Java is a large
>> tax
>> >>> to
>> >>> >>> pay
>> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>> 0.10.0
>> >>> >>> > release of arrow, many of which were holding up the release. I
>> hope
>> >>> that
>> >>> >>> > seems like a reasonable compromise, and I think it will help
>> reduce
>> >>> the
>> >>> >>> > complexity of the build/release tooling.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> ted.dunning@gmail.com>
>> >>> >>> wrote:
>> >>> >>> >
>> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >>> >>> wrote:
>> >>> >>> >>
>> >>> >>> >> >
>> >>> >>> >> > > The community will be less willing to accept large
>> >>> >>> >> > > changes that require multiple rounds of patches for
>> stability
>> >>> and
>> >>> >>> API
>> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> >>> community
>> >>> >>> took
>> >>> >>> >> a
>> >>> >>> >> > > significantly long time for the very same reason.
>> >>> >>> >> >
>> >>> >>> >> > Please don't use bad experiences from another open source
>> >>> community as
>> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't go
>> the
>> >>> way
>> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct community
>> which
>> >>> >>> >> > happens to operate under a similar open governance model.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> There are some more radical and community building options as
>> well.
>> >>> Take
>> >>> >>> >> the subversion project as a precedent. With subversion, any
>> Apache
>> >>> >>> >> committer can request and receive a commit bit on some large
>> >>> fraction of
>> >>> >>> >> subversion.
>> >>> >>> >>
>> >>> >>> >> So why not take this a bit further and give every parquet
>> committer
>> >>> a
>> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> committers in
>> >>> >>> Arrow?
>> >>> >>> >> Possibly even make it policy that every Parquet committer who
>> asks
>> >>> will
>> >>> >>> be
>> >>> >>> >> given committer status in Arrow.
>> >>> >>> >>
>> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> committers
>> >>> >>> can't be
>> >>> >>> >> worried at that point whether their patches will get merged;
>> they
>> >>> can
>> >>> >>> just
>> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> >>> Parquet
>> >>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>> >>> why not
>> >>> >>> >> invite them in?
>> >>> >>> >>
>> >>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> regards,
>> >> Deepak Majeti
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I think the circular dependency can be broken if we build a new library for
the platform code. This will also make it easy for other projects such as
ORC to use it.
I also remember your proposal a while ago of having a separate project for
the platform code.  That project can live in the arrow repo. However, one
has to clone the entire apache arrow repo but can just build the platform
code. This will be temporary until we can find a new home for it.

The dependency will look like:
libarrow(arrow core / bindings) <- libparquet (parquet core) <-
libplatform(platform api)

CI workflow will clone the arrow project twice, once for the platform
library and once for the arrow-core/bindings library.

There is no doubt that the collaborations between the Arrow and Parquet
communities so far have been very successful.
The reason to maintain this relationship moving forward is to continue to
reap the mutual benefits.
We should continue to take advantage of sharing code as well. However, I
don't see any code sharing opportunities between arrow-core and the
parquet-core. Both have different functions.

We are at a point where the parquet-cpp public API is pretty stable. We
already passed that difficult stage. My take at arrow and parquet is to
keep them nimble since we can.




On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:

> > The current Arrow adaptor code for parquet should live in the arrow
> repo. That will remove a majority of the dependency issues. Joshua's work
> would not have been blocked in parquet-cpp if that adapter was in the arrow
> repo.  This will be similar to the ORC adaptor.
>
> This has been suggested before, but I don't see how it would alleviate
> any issues because of the significant dependencies on other parts of
> the Arrow codebase. What you are proposing is:
>
> - (Arrow) arrow platform
> - (Parquet) parquet core
> - (Arrow) arrow columnar-parquet adapter interface
> - (Arrow) Python bindings
>
> To make this work, somehow Arrow core / libarrow would have to be
> built before invoking the Parquet core part of the build system. You
> would need to pass dependent targets across different CMake build
> systems; I don't know if it's possible (I spent some time looking into
> it earlier this year). This is what I meant by the lack of a "concrete
> and actionable plan". The only thing that would really work would be
> for the Parquet core to be "included" in the Arrow build system
> somehow rather than using ExternalProject. Currently Parquet builds
> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
> system because it's only depended upon by the Python bindings.
>
> And even if a solution could be devised, it would not wholly resolve
> the CI workflow issues.
>
> You could make Parquet completely independent of the Arrow codebase,
> but at that point there is little reason to maintain a relationship
> between the projects or their communities. We have spent a great deal
> of effort refactoring the two projects to enable as much code sharing
> as there is now.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
> >> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
> >
> > Yes, indeed. In my view, the next best option after a monorepo is to
> > fork. That would obviously be a bad outcome for the community.
> >
> > It doesn't look like I will be able to convince you that a monorepo is
> > a good idea; what I would ask instead is that you be willing to give
> > it a shot, and if it turns out in the way you're describing (which I
> > don't think it will) then I suggest that we fork at that point.
> >
> > - Wes
> >
> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> >> Wes,
> >>
> >> Unfortunately, I cannot show you any practical fact-based problems of a
> >> non-existent Arrow-Parquet mono-repo.
> >> Bringing in related Apache community experiences are more meaningful
> than
> >> how mono-repos work at Google and other big organizations.
> >> We solely depend on volunteers and cannot hire full-time developers.
> >> You are very well aware of how difficult it has been to find more
> >> contributors and maintainers for Arrow. parquet-cpp already has a low
> >> contribution rate to its core components.
> >>
> >> We should target to ensure that new volunteers who want to contribute
> >> bug-fixes/features should spend the least amount of time in figuring out
> >> the project repo. We can never come up with an automated build system
> that
> >> caters to every possible environment.
> >> My only concern is if the mono-repo will make it harder for new
> developers
> >> to work on parquet-cpp core just due to the additional code, build and
> test
> >> dependencies.
> >> I am not saying that the Arrow community/committers will be less
> >> co-operative.
> >> I just don't think the mono-repo structure model will be sustainable in
> an
> >> open source community unless there are long-term vested interests. We
> can't
> >> predict that.
> >>
> >> The current circular dependency problems between Arrow and Parquet is a
> >> major problem for the community and it is important.
> >>
> >> The current Arrow adaptor code for parquet should live in the arrow
> repo.
> >> That will remove a majority of the dependency issues.
> >> Joshua's work would not have been blocked in parquet-cpp if that adapter
> >> was in the arrow repo.  This will be similar to the ORC adaptor.
> >>
> >> The platform API code is pretty stable at this point. Minor changes in
> the
> >> future to this code should not be the main reason to combine the arrow
> >> parquet repos.
> >>
> >> "
> >> *I question whether it's worth the community's time long term to wear*
> >>
> >>
> >> *ourselves out defining custom "ports" / virtual interfaces in
> eachlibrary
> >> to plug components together rather than utilizing commonplatform APIs.*"
> >>
> >> My answer to your question below would be "Yes". Modularity/separation
> is
> >> very important in an open source community where priorities of
> contributors
> >> are often short term.
> >> The retention is low and therefore the acquisition costs should be low
> as
> >> well. This is the community over code approach according to me. Minor
> code
> >> duplication is not a deal breaker.
> >> ORC, Parquet, Arrow, etc. are all different components in the big data
> >> space serving their own functions.
> >>
> >> If you still strongly feel that the only way forward is to clone the
> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> >> parquet-cpp repos is no way a better approach.
> >>
> >>
> >>
> >>
> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> @Antoine
> >>>
> >>> > By the way, one concern with the monorepo approach: it would slightly
> >>> increase Arrow CI times (which are already too large).
> >>>
> >>> A typical CI run in Arrow is taking about 45 minutes:
> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >>>
> >>> Parquet run takes about 28
> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >>>
> >>> Inevitably we will need to create some kind of bot to run certain
> >>> builds on-demand based on commit / PR metadata or on request.
> >>>
> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
> >>> made substantially shorter by moving some of the slower parts (like
> >>> the Python ASV benchmarks) from being tested every-commit to nightly
> >>> or on demand. Using ASAN instead of valgrind in Travis would also
> >>> improve build times (valgrind build could be moved to a nightly
> >>> exhaustive test run)
> >>>
> >>> - Wes
> >>>
> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >> I would like to point out that arrow's use of orc is a great
> example of
> >>> how it would be possible to manage parquet-cpp as a separate codebase.
> That
> >>> gives me hope that the projects could be managed separately some day.
> >>> >
> >>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
> >>> > features several areas of duplicated logic which could be replaced by
> >>> > components from the Arrow platform for better platform-wide
> >>> > interoperability:
> >>> >
> >>> >
> >>>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >>> >
> >>>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
> >>> >
> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
> >>> > bugs that we had to fix in Arrow's build system to prevent them from
> >>> > leaking to third party linkers when statically linked (ORC is only
> >>> > available for static linking at the moment AFAIK).
> >>> >
> >>> > I question whether it's worth the community's time long term to wear
> >>> > ourselves out defining custom "ports" / virtual interfaces in each
> >>> > library to plug components together rather than utilizing common
> >>> > platform APIs.
> >>> >
> >>> > - Wes
> >>> >
> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> joshuastorck@gmail.com>
> >>> wrote:
> >>> >> You're point about the constraints of the ASF release process are
> well
> >>> >> taken and as a developer who's trying to work in the current
> >>> environment I
> >>> >> would be much happier if the codebases were merged. The main issues
> I
> >>> worry
> >>> >> about when you put codebases like these together are:
> >>> >>
> >>> >> 1. The delineation of API's become blurred and the code becomes too
> >>> coupled
> >>> >> 2. Release of artifacts that are lower in the dependency tree are
> >>> delayed
> >>> >> by artifacts higher in the dependency tree
> >>> >>
> >>> >> If the project/release management is structured well and someone
> keeps
> >>> an
> >>> >> eye on the coupling, then I don't have any concerns.
> >>> >>
> >>> >> I would like to point out that arrow's use of orc is a great
> example of
> >>> how
> >>> >> it would be possible to manage parquet-cpp as a separate codebase.
> That
> >>> >> gives me hope that the projects could be managed separately some
> day.
> >>> >>
> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> hi Josh,
> >>> >>>
> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
> >>> tying
> >>> >>> them together seems like the wrong choice.
> >>> >>>
> >>> >>> Apache is "Community over Code"; right now it's the same people
> >>> >>> building these projects -- my argument (which I think you agree
> with?)
> >>> >>> is that we should work more closely together until the community
> grows
> >>> >>> large enough to support larger-scope process than we have now. As
> >>> >>> you've seen, our process isn't serving developers of these
> projects.
> >>> >>>
> >>> >>> > I also think build tooling should be pulled into its own
> codebase.
> >>> >>>
> >>> >>> I don't see how this can possibly be practical taking into
> >>> >>> consideration the constraints imposed by the combination of the
> GitHub
> >>> >>> platform and the ASF release process. I'm all for being idealistic,
> >>> >>> but right now we need to be practical. Unless we can devise a
> >>> >>> practical procedure that can accommodate at least 1 patch per day
> >>> >>> which may touch both code and build system simultaneously without
> >>> >>> being a hindrance to contributor or maintainer, I don't see how we
> can
> >>> >>> move forward.
> >>> >>>
> >>> >>> > That being said, I think it makes sense to merge the codebases
> in the
> >>> >>> short term with the express purpose of separating them in the near
> >>> term.
> >>> >>>
> >>> >>> I would agree but only if separation can be demonstrated to be
> >>> >>> practical and result in net improvements in productivity and
> community
> >>> >>> growth. I think experience has clearly demonstrated that the
> current
> >>> >>> separation is impractical, and is causing problems.
> >>> >>>
> >>> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> >>> development process and ASF releases separately. My argument is as
> >>> >>> follows:
> >>> >>>
> >>> >>> * Monorepo for development (for practicality)
> >>> >>> * Releases structured according to the desires of the PMCs
> >>> >>>
> >>> >>> - Wes
> >>> >>>
> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> joshuastorck@gmail.com
> >>> >
> >>> >>> wrote:
> >>> >>> > I recently worked on an issue that had to be implemented in
> >>> parquet-cpp
> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> (ARROW-2585,
> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
> hard to
> >>> work
> >>> >>> > with. For example, I still have a PR open in parquet-cpp
> (created on
> >>> May
> >>> >>> > 10) because of a PR that it depended on in arrow that was
> recently
> >>> >>> merged.
> >>> >>> > I couldn't even address any CI issues in the PR because the
> change in
> >>> >>> arrow
> >>> >>> > was not yet in master. In a separate PR, I changed the
> >>> >>> run_clang_format.py
> >>> >>> > script in the arrow project only to find out later that there
> was an
> >>> >>> exact
> >>> >>> > copy of it in parquet-cpp.
> >>> >>> >
> >>> >>> > However, I don't think merging the codebases makes sense in the
> long
> >>> >>> term.
> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
> >>> tying
> >>> >>> them
> >>> >>> > together seems like the wrong choice. There will be other formats
> >>> that
> >>> >>> > arrow needs to support that will be kept separate (e.g. - Orc),
> so I
> >>> >>> don't
> >>> >>> > see why parquet should be special. I also think build tooling
> should
> >>> be
> >>> >>> > pulled into its own codebase. GNU has had a long history of
> >>> developing
> >>> >>> open
> >>> >>> > source C/C++ projects that way and made projects like
> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
> good
> >>> >>> > counter-example since there have been lots of successful open
> source
> >>> >>> > projects that have used nightly build systems that pinned
> versions of
> >>> >>> > dependent software.
> >>> >>> >
> >>> >>> > That being said, I think it makes sense to merge the codebases
> in the
> >>> >>> short
> >>> >>> > term with the express purpose of separating them in the near
> term.
> >>> My
> >>> >>> > reasoning is as follows. By putting the codebases together, you
> can
> >>> more
> >>> >>> > easily delineate the boundaries between the API's with a single
> PR.
> >>> >>> Second,
> >>> >>> > it will force the build tooling to converge instead of diverge,
> >>> which has
> >>> >>> > already happened. Once the boundaries and tooling have been
> sorted
> >>> out,
> >>> >>> it
> >>> >>> > should be easy to separate them back into their own codebases.
> >>> >>> >
> >>> >>> > If the codebases are merged, I would ask that the C++ codebases
> for
> >>> arrow
> >>> >>> > be separated from other languages. Looking at it from the
> >>> perspective of
> >>> >>> a
> >>> >>> > parquet-cpp library user, having a dependency on Java is a large
> tax
> >>> to
> >>> >>> pay
> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
> 0.10.0
> >>> >>> > release of arrow, many of which were holding up the release. I
> hope
> >>> that
> >>> >>> > seems like a reasonable compromise, and I think it will help
> reduce
> >>> the
> >>> >>> > complexity of the build/release tooling.
> >>> >>> >
> >>> >>> >
> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> ted.dunning@gmail.com>
> >>> >>> wrote:
> >>> >>> >
> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> wesmckinn@gmail.com>
> >>> >>> wrote:
> >>> >>> >>
> >>> >>> >> >
> >>> >>> >> > > The community will be less willing to accept large
> >>> >>> >> > > changes that require multiple rounds of patches for
> stability
> >>> and
> >>> >>> API
> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
> >>> community
> >>> >>> took
> >>> >>> >> a
> >>> >>> >> > > significantly long time for the very same reason.
> >>> >>> >> >
> >>> >>> >> > Please don't use bad experiences from another open source
> >>> community as
> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't go
> the
> >>> way
> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct community
> which
> >>> >>> >> > happens to operate under a similar open governance model.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> There are some more radical and community building options as
> well.
> >>> Take
> >>> >>> >> the subversion project as a precedent. With subversion, any
> Apache
> >>> >>> >> committer can request and receive a commit bit on some large
> >>> fraction of
> >>> >>> >> subversion.
> >>> >>> >>
> >>> >>> >> So why not take this a bit further and give every parquet
> committer
> >>> a
> >>> >>> >> commit bit in Arrow? Or even make them be first class
> committers in
> >>> >>> Arrow?
> >>> >>> >> Possibly even make it policy that every Parquet committer who
> asks
> >>> will
> >>> >>> be
> >>> >>> >> given committer status in Arrow.
> >>> >>> >>
> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
> committers
> >>> >>> can't be
> >>> >>> >> worried at that point whether their patches will get merged;
> they
> >>> can
> >>> >>> just
> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
> >>> Parquet
> >>> >>> >> committers. After all, Arrow already depends a lot on parquet so
> >>> why not
> >>> >>> >> invite them in?
> >>> >>> >>
> >>> >>>
> >>>
> >>
> >>
> >> --
> >> regards,
> >> Deepak Majeti
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I think the circular dependency can be broken if we build a new library for
the platform code. This will also make it easy for other projects such as
ORC to use it.
I also remember your proposal a while ago of having a separate project for
the platform code.  That project can live in the arrow repo. However, one
has to clone the entire apache arrow repo but can just build the platform
code. This will be temporary until we can find a new home for it.

The dependency will look like:
libarrow(arrow core / bindings) <- libparquet (parquet core) <-
libplatform(platform api)

CI workflow will clone the arrow project twice, once for the platform
library and once for the arrow-core/bindings library.

There is no doubt that the collaborations between the Arrow and Parquet
communities so far have been very successful.
The reason to maintain this relationship moving forward is to continue to
reap the mutual benefits.
We should continue to take advantage of sharing code as well. However, I
don't see any code sharing opportunities between arrow-core and the
parquet-core. Both have different functions.

We are at a point where the parquet-cpp public API is pretty stable. We
already passed that difficult stage. My take at arrow and parquet is to
keep them nimble since we can.




On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:

> > The current Arrow adaptor code for parquet should live in the arrow
> repo. That will remove a majority of the dependency issues. Joshua's work
> would not have been blocked in parquet-cpp if that adapter was in the arrow
> repo.  This will be similar to the ORC adaptor.
>
> This has been suggested before, but I don't see how it would alleviate
> any issues because of the significant dependencies on other parts of
> the Arrow codebase. What you are proposing is:
>
> - (Arrow) arrow platform
> - (Parquet) parquet core
> - (Arrow) arrow columnar-parquet adapter interface
> - (Arrow) Python bindings
>
> To make this work, somehow Arrow core / libarrow would have to be
> built before invoking the Parquet core part of the build system. You
> would need to pass dependent targets across different CMake build
> systems; I don't know if it's possible (I spent some time looking into
> it earlier this year). This is what I meant by the lack of a "concrete
> and actionable plan". The only thing that would really work would be
> for the Parquet core to be "included" in the Arrow build system
> somehow rather than using ExternalProject. Currently Parquet builds
> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
> system because it's only depended upon by the Python bindings.
>
> And even if a solution could be devised, it would not wholly resolve
> the CI workflow issues.
>
> You could make Parquet completely independent of the Arrow codebase,
> but at that point there is little reason to maintain a relationship
> between the projects or their communities. We have spent a great deal
> of effort refactoring the two projects to enable as much code sharing
> as there is now.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
> >> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
> >
> > Yes, indeed. In my view, the next best option after a monorepo is to
> > fork. That would obviously be a bad outcome for the community.
> >
> > It doesn't look like I will be able to convince you that a monorepo is
> > a good idea; what I would ask instead is that you be willing to give
> > it a shot, and if it turns out in the way you're describing (which I
> > don't think it will) then I suggest that we fork at that point.
> >
> > - Wes
> >
> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> >> Wes,
> >>
> >> Unfortunately, I cannot show you any practical fact-based problems of a
> >> non-existent Arrow-Parquet mono-repo.
> >> Bringing in related Apache community experiences are more meaningful
> than
> >> how mono-repos work at Google and other big organizations.
> >> We solely depend on volunteers and cannot hire full-time developers.
> >> You are very well aware of how difficult it has been to find more
> >> contributors and maintainers for Arrow. parquet-cpp already has a low
> >> contribution rate to its core components.
> >>
> >> We should target to ensure that new volunteers who want to contribute
> >> bug-fixes/features should spend the least amount of time in figuring out
> >> the project repo. We can never come up with an automated build system
> that
> >> caters to every possible environment.
> >> My only concern is if the mono-repo will make it harder for new
> developers
> >> to work on parquet-cpp core just due to the additional code, build and
> test
> >> dependencies.
> >> I am not saying that the Arrow community/committers will be less
> >> co-operative.
> >> I just don't think the mono-repo structure model will be sustainable in
> an
> >> open source community unless there are long-term vested interests. We
> can't
> >> predict that.
> >>
> >> The current circular dependency problems between Arrow and Parquet is a
> >> major problem for the community and it is important.
> >>
> >> The current Arrow adaptor code for parquet should live in the arrow
> repo.
> >> That will remove a majority of the dependency issues.
> >> Joshua's work would not have been blocked in parquet-cpp if that adapter
> >> was in the arrow repo.  This will be similar to the ORC adaptor.
> >>
> >> The platform API code is pretty stable at this point. Minor changes in
> the
> >> future to this code should not be the main reason to combine the arrow
> >> parquet repos.
> >>
> >> "
> >> *I question whether it's worth the community's time long term to wear*
> >>
> >>
> >> *ourselves out defining custom "ports" / virtual interfaces in
> eachlibrary
> >> to plug components together rather than utilizing commonplatform APIs.*"
> >>
> >> My answer to your question below would be "Yes". Modularity/separation
> is
> >> very important in an open source community where priorities of
> contributors
> >> are often short term.
> >> The retention is low and therefore the acquisition costs should be low
> as
> >> well. This is the community over code approach according to me. Minor
> code
> >> duplication is not a deal breaker.
> >> ORC, Parquet, Arrow, etc. are all different components in the big data
> >> space serving their own functions.
> >>
> >> If you still strongly feel that the only way forward is to clone the
> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> >> parquet-cpp repos is no way a better approach.
> >>
> >>
> >>
> >>
> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> @Antoine
> >>>
> >>> > By the way, one concern with the monorepo approach: it would slightly
> >>> increase Arrow CI times (which are already too large).
> >>>
> >>> A typical CI run in Arrow is taking about 45 minutes:
> >>> https://travis-ci.org/apache/arrow/builds/410119750
> >>>
> >>> Parquet run takes about 28
> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
> >>>
> >>> Inevitably we will need to create some kind of bot to run certain
> >>> builds on-demand based on commit / PR metadata or on request.
> >>>
> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
> >>> made substantially shorter by moving some of the slower parts (like
> >>> the Python ASV benchmarks) from being tested every-commit to nightly
> >>> or on demand. Using ASAN instead of valgrind in Travis would also
> >>> improve build times (valgrind build could be moved to a nightly
> >>> exhaustive test run)
> >>>
> >>> - Wes
> >>>
> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >> I would like to point out that arrow's use of orc is a great
> example of
> >>> how it would be possible to manage parquet-cpp as a separate codebase.
> That
> >>> gives me hope that the projects could be managed separately some day.
> >>> >
> >>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
> >>> > features several areas of duplicated logic which could be replaced by
> >>> > components from the Arrow platform for better platform-wide
> >>> > interoperability:
> >>> >
> >>> >
> >>>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >>> >
> >>>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> >>> >
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
> >>> >
> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
> >>> > bugs that we had to fix in Arrow's build system to prevent them from
> >>> > leaking to third party linkers when statically linked (ORC is only
> >>> > available for static linking at the moment AFAIK).
> >>> >
> >>> > I question whether it's worth the community's time long term to wear
> >>> > ourselves out defining custom "ports" / virtual interfaces in each
> >>> > library to plug components together rather than utilizing common
> >>> > platform APIs.
> >>> >
> >>> > - Wes
> >>> >
> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
> joshuastorck@gmail.com>
> >>> wrote:
> >>> >> You're point about the constraints of the ASF release process are
> well
> >>> >> taken and as a developer who's trying to work in the current
> >>> environment I
> >>> >> would be much happier if the codebases were merged. The main issues
> I
> >>> worry
> >>> >> about when you put codebases like these together are:
> >>> >>
> >>> >> 1. The delineation of API's become blurred and the code becomes too
> >>> coupled
> >>> >> 2. Release of artifacts that are lower in the dependency tree are
> >>> delayed
> >>> >> by artifacts higher in the dependency tree
> >>> >>
> >>> >> If the project/release management is structured well and someone
> keeps
> >>> an
> >>> >> eye on the coupling, then I don't have any concerns.
> >>> >>
> >>> >> I would like to point out that arrow's use of orc is a great
> example of
> >>> how
> >>> >> it would be possible to manage parquet-cpp as a separate codebase.
> That
> >>> >> gives me hope that the projects could be managed separately some
> day.
> >>> >>
> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> hi Josh,
> >>> >>>
> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
> >>> tying
> >>> >>> them together seems like the wrong choice.
> >>> >>>
> >>> >>> Apache is "Community over Code"; right now it's the same people
> >>> >>> building these projects -- my argument (which I think you agree
> with?)
> >>> >>> is that we should work more closely together until the community
> grows
> >>> >>> large enough to support larger-scope process than we have now. As
> >>> >>> you've seen, our process isn't serving developers of these
> projects.
> >>> >>>
> >>> >>> > I also think build tooling should be pulled into its own
> codebase.
> >>> >>>
> >>> >>> I don't see how this can possibly be practical taking into
> >>> >>> consideration the constraints imposed by the combination of the
> GitHub
> >>> >>> platform and the ASF release process. I'm all for being idealistic,
> >>> >>> but right now we need to be practical. Unless we can devise a
> >>> >>> practical procedure that can accommodate at least 1 patch per day
> >>> >>> which may touch both code and build system simultaneously without
> >>> >>> being a hindrance to contributor or maintainer, I don't see how we
> can
> >>> >>> move forward.
> >>> >>>
> >>> >>> > That being said, I think it makes sense to merge the codebases
> in the
> >>> >>> short term with the express purpose of separating them in the near
> >>> term.
> >>> >>>
> >>> >>> I would agree but only if separation can be demonstrated to be
> >>> >>> practical and result in net improvements in productivity and
> community
> >>> >>> growth. I think experience has clearly demonstrated that the
> current
> >>> >>> separation is impractical, and is causing problems.
> >>> >>>
> >>> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> >>> development process and ASF releases separately. My argument is as
> >>> >>> follows:
> >>> >>>
> >>> >>> * Monorepo for development (for practicality)
> >>> >>> * Releases structured according to the desires of the PMCs
> >>> >>>
> >>> >>> - Wes
> >>> >>>
> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
> joshuastorck@gmail.com
> >>> >
> >>> >>> wrote:
> >>> >>> > I recently worked on an issue that had to be implemented in
> >>> parquet-cpp
> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
> (ARROW-2585,
> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
> hard to
> >>> work
> >>> >>> > with. For example, I still have a PR open in parquet-cpp
> (created on
> >>> May
> >>> >>> > 10) because of a PR that it depended on in arrow that was
> recently
> >>> >>> merged.
> >>> >>> > I couldn't even address any CI issues in the PR because the
> change in
> >>> >>> arrow
> >>> >>> > was not yet in master. In a separate PR, I changed the
> >>> >>> run_clang_format.py
> >>> >>> > script in the arrow project only to find out later that there
> was an
> >>> >>> exact
> >>> >>> > copy of it in parquet-cpp.
> >>> >>> >
> >>> >>> > However, I don't think merging the codebases makes sense in the
> long
> >>> >>> term.
> >>> >>> > I can imagine use cases for parquet that don't involve arrow and
> >>> tying
> >>> >>> them
> >>> >>> > together seems like the wrong choice. There will be other formats
> >>> that
> >>> >>> > arrow needs to support that will be kept separate (e.g. - Orc),
> so I
> >>> >>> don't
> >>> >>> > see why parquet should be special. I also think build tooling
> should
> >>> be
> >>> >>> > pulled into its own codebase. GNU has had a long history of
> >>> developing
> >>> >>> open
> >>> >>> > source C/C++ projects that way and made projects like
> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
> good
> >>> >>> > counter-example since there have been lots of successful open
> source
> >>> >>> > projects that have used nightly build systems that pinned
> versions of
> >>> >>> > dependent software.
> >>> >>> >
> >>> >>> > That being said, I think it makes sense to merge the codebases
> in the
> >>> >>> short
> >>> >>> > term with the express purpose of separating them in the near
> term.
> >>> My
> >>> >>> > reasoning is as follows. By putting the codebases together, you
> can
> >>> more
> >>> >>> > easily delineate the boundaries between the API's with a single
> PR.
> >>> >>> Second,
> >>> >>> > it will force the build tooling to converge instead of diverge,
> >>> which has
> >>> >>> > already happened. Once the boundaries and tooling have been
> sorted
> >>> out,
> >>> >>> it
> >>> >>> > should be easy to separate them back into their own codebases.
> >>> >>> >
> >>> >>> > If the codebases are merged, I would ask that the C++ codebases
> for
> >>> arrow
> >>> >>> > be separated from other languages. Looking at it from the
> >>> perspective of
> >>> >>> a
> >>> >>> > parquet-cpp library user, having a dependency on Java is a large
> tax
> >>> to
> >>> >>> pay
> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
> 0.10.0
> >>> >>> > release of arrow, many of which were holding up the release. I
> hope
> >>> that
> >>> >>> > seems like a reasonable compromise, and I think it will help
> reduce
> >>> the
> >>> >>> > complexity of the build/release tooling.
> >>> >>> >
> >>> >>> >
> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
> ted.dunning@gmail.com>
> >>> >>> wrote:
> >>> >>> >
> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
> wesmckinn@gmail.com>
> >>> >>> wrote:
> >>> >>> >>
> >>> >>> >> >
> >>> >>> >> > > The community will be less willing to accept large
> >>> >>> >> > > changes that require multiple rounds of patches for
> stability
> >>> and
> >>> >>> API
> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
> >>> community
> >>> >>> took
> >>> >>> >> a
> >>> >>> >> > > significantly long time for the very same reason.
> >>> >>> >> >
> >>> >>> >> > Please don't use bad experiences from another open source
> >>> community as
> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't go
> the
> >>> way
> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct community
> which
> >>> >>> >> > happens to operate under a similar open governance model.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> There are some more radical and community building options as
> well.
> >>> Take
> >>> >>> >> the subversion project as a precedent. With subversion, any
> Apache
> >>> >>> >> committer can request and receive a commit bit on some large
> >>> fraction of
> >>> >>> >> subversion.
> >>> >>> >>
> >>> >>> >> So why not take this a bit further and give every parquet
> committer
> >>> a
> >>> >>> >> commit bit in Arrow? Or even make them be first class
> committers in
> >>> >>> Arrow?
> >>> >>> >> Possibly even make it policy that every Parquet committer who
> asks
> >>> will
> >>> >>> be
> >>> >>> >> given committer status in Arrow.
> >>> >>> >>
> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
> committers
> >>> >>> can't be
> >>> >>> >> worried at that point whether their patches will get merged;
> they
> >>> can
> >>> >>> just
> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
> >>> Parquet
> >>> >>> >> committers. After all, Arrow already depends a lot on parquet so
> >>> why not
> >>> >>> >> invite them in?
> >>> >>> >>
> >>> >>>
> >>>
> >>
> >>
> >> --
> >> regards,
> >> Deepak Majeti
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> The current Arrow adaptor code for parquet should live in the arrow repo. That will remove a majority of the dependency issues. Joshua's work would not have been blocked in parquet-cpp if that adapter was in the arrow repo.  This will be similar to the ORC adaptor.

This has been suggested before, but I don't see how it would alleviate
any issues because of the significant dependencies on other parts of
the Arrow codebase. What you are proposing is:

- (Arrow) arrow platform
- (Parquet) parquet core
- (Arrow) arrow columnar-parquet adapter interface
- (Arrow) Python bindings

To make this work, somehow Arrow core / libarrow would have to be
built before invoking the Parquet core part of the build system. You
would need to pass dependent targets across different CMake build
systems; I don't know if it's possible (I spent some time looking into
it earlier this year). This is what I meant by the lack of a "concrete
and actionable plan". The only thing that would really work would be
for the Parquet core to be "included" in the Arrow build system
somehow rather than using ExternalProject. Currently Parquet builds
Arrow using ExternalProject, and Parquet is unknown to the Arrow build
system because it's only depended upon by the Python bindings.

And even if a solution could be devised, it would not wholly resolve
the CI workflow issues.

You could make Parquet completely independent of the Arrow codebase,
but at that point there is little reason to maintain a relationship
between the projects or their communities. We have spent a great deal
of effort refactoring the two projects to enable as much code sharing
as there is now.

- Wes

On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
>> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.
>
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
>
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> Wes,
>>
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>>
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>>
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>>
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>>
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>>
>> "
>> *I question whether it's worth the community's time long term to wear*
>>
>>
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>>
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>>
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>>
>>
>>
>>
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> @Antoine
>>>
>>> > By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>>
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>>
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>>
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>>
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how it would be possible to manage parquet-cpp as a separate codebase. That
>>> gives me hope that the projects could be managed separately some day.
>>> >
>>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>>> > features several areas of duplicated logic which could be replaced by
>>> > components from the Arrow platform for better platform-wide
>>> > interoperability:
>>> >
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>>> >
>>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> > bugs that we had to fix in Arrow's build system to prevent them from
>>> > leaking to third party linkers when statically linked (ORC is only
>>> > available for static linking at the moment AFAIK).
>>> >
>>> > I question whether it's worth the community's time long term to wear
>>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> > library to plug components together rather than utilizing common
>>> > platform APIs.
>>> >
>>> > - Wes
>>> >
>>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>> >> You're point about the constraints of the ASF release process are well
>>> >> taken and as a developer who's trying to work in the current
>>> environment I
>>> >> would be much happier if the codebases were merged. The main issues I
>>> worry
>>> >> about when you put codebases like these together are:
>>> >>
>>> >> 1. The delineation of API's become blurred and the code becomes too
>>> coupled
>>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> delayed
>>> >> by artifacts higher in the dependency tree
>>> >>
>>> >> If the project/release management is structured well and someone keeps
>>> an
>>> >> eye on the coupling, then I don't have any concerns.
>>> >>
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how
>>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>>> >> gives me hope that the projects could be managed separately some day.
>>> >>
>>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>
>>> >>> hi Josh,
>>> >>>
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them together seems like the wrong choice.
>>> >>>
>>> >>> Apache is "Community over Code"; right now it's the same people
>>> >>> building these projects -- my argument (which I think you agree with?)
>>> >>> is that we should work more closely together until the community grows
>>> >>> large enough to support larger-scope process than we have now. As
>>> >>> you've seen, our process isn't serving developers of these projects.
>>> >>>
>>> >>> > I also think build tooling should be pulled into its own codebase.
>>> >>>
>>> >>> I don't see how this can possibly be practical taking into
>>> >>> consideration the constraints imposed by the combination of the GitHub
>>> >>> platform and the ASF release process. I'm all for being idealistic,
>>> >>> but right now we need to be practical. Unless we can devise a
>>> >>> practical procedure that can accommodate at least 1 patch per day
>>> >>> which may touch both code and build system simultaneously without
>>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>>> >>> move forward.
>>> >>>
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short term with the express purpose of separating them in the near
>>> term.
>>> >>>
>>> >>> I would agree but only if separation can be demonstrated to be
>>> >>> practical and result in net improvements in productivity and community
>>> >>> growth. I think experience has clearly demonstrated that the current
>>> >>> separation is impractical, and is causing problems.
>>> >>>
>>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >>> development process and ASF releases separately. My argument is as
>>> >>> follows:
>>> >>>
>>> >>> * Monorepo for development (for practicality)
>>> >>> * Releases structured according to the desires of the PMCs
>>> >>>
>>> >>> - Wes
>>> >>>
>>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > I recently worked on an issue that had to be implemented in
>>> parquet-cpp
>>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>>> work
>>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>>> May
>>> >>> > 10) because of a PR that it depended on in arrow that was recently
>>> >>> merged.
>>> >>> > I couldn't even address any CI issues in the PR because the change in
>>> >>> arrow
>>> >>> > was not yet in master. In a separate PR, I changed the
>>> >>> run_clang_format.py
>>> >>> > script in the arrow project only to find out later that there was an
>>> >>> exact
>>> >>> > copy of it in parquet-cpp.
>>> >>> >
>>> >>> > However, I don't think merging the codebases makes sense in the long
>>> >>> term.
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them
>>> >>> > together seems like the wrong choice. There will be other formats
>>> that
>>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> >>> don't
>>> >>> > see why parquet should be special. I also think build tooling should
>>> be
>>> >>> > pulled into its own codebase. GNU has had a long history of
>>> developing
>>> >>> open
>>> >>> > source C/C++ projects that way and made projects like
>>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>>> >>> > counter-example since there have been lots of successful open source
>>> >>> > projects that have used nightly build systems that pinned versions of
>>> >>> > dependent software.
>>> >>> >
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short
>>> >>> > term with the express purpose of separating them in the near  term.
>>> My
>>> >>> > reasoning is as follows. By putting the codebases together, you can
>>> more
>>> >>> > easily delineate the boundaries between the API's with a single PR.
>>> >>> Second,
>>> >>> > it will force the build tooling to converge instead of diverge,
>>> which has
>>> >>> > already happened. Once the boundaries and tooling have been sorted
>>> out,
>>> >>> it
>>> >>> > should be easy to separate them back into their own codebases.
>>> >>> >
>>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>>> arrow
>>> >>> > be separated from other languages. Looking at it from the
>>> perspective of
>>> >>> a
>>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>>> to
>>> >>> pay
>>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> >>> > release of arrow, many of which were holding up the release. I hope
>>> that
>>> >>> > seems like a reasonable compromise, and I think it will help reduce
>>> the
>>> >>> > complexity of the build/release tooling.
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>> >>> wrote:
>>> >>> >>
>>> >>> >> >
>>> >>> >> > > The community will be less willing to accept large
>>> >>> >> > > changes that require multiple rounds of patches for stability
>>> and
>>> >>> API
>>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> community
>>> >>> took
>>> >>> >> a
>>> >>> >> > > significantly long time for the very same reason.
>>> >>> >> >
>>> >>> >> > Please don't use bad experiences from another open source
>>> community as
>>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>>> way
>>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >>> >> > happens to operate under a similar open governance model.
>>> >>> >>
>>> >>> >>
>>> >>> >> There are some more radical and community building options as well.
>>> Take
>>> >>> >> the subversion project as a precedent. With subversion, any Apache
>>> >>> >> committer can request and receive a commit bit on some large
>>> fraction of
>>> >>> >> subversion.
>>> >>> >>
>>> >>> >> So why not take this a bit further and give every parquet committer
>>> a
>>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>>> >>> Arrow?
>>> >>> >> Possibly even make it policy that every Parquet committer who asks
>>> will
>>> >>> be
>>> >>> >> given committer status in Arrow.
>>> >>> >>
>>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> >>> can't be
>>> >>> >> worried at that point whether their patches will get merged; they
>>> can
>>> >>> just
>>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> Parquet
>>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>>> why not
>>> >>> >> invite them in?
>>> >>> >>
>>> >>>
>>>
>>
>>
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> The current Arrow adaptor code for parquet should live in the arrow repo. That will remove a majority of the dependency issues. Joshua's work would not have been blocked in parquet-cpp if that adapter was in the arrow repo.  This will be similar to the ORC adaptor.

This has been suggested before, but I don't see how it would alleviate
any issues because of the significant dependencies on other parts of
the Arrow codebase. What you are proposing is:

- (Arrow) arrow platform
- (Parquet) parquet core
- (Arrow) arrow columnar-parquet adapter interface
- (Arrow) Python bindings

To make this work, somehow Arrow core / libarrow would have to be
built before invoking the Parquet core part of the build system. You
would need to pass dependent targets across different CMake build
systems; I don't know if it's possible (I spent some time looking into
it earlier this year). This is what I meant by the lack of a "concrete
and actionable plan". The only thing that would really work would be
for the Parquet core to be "included" in the Arrow build system
somehow rather than using ExternalProject. Currently Parquet builds
Arrow using ExternalProject, and Parquet is unknown to the Arrow build
system because it's only depended upon by the Python bindings.

And even if a solution could be devised, it would not wholly resolve
the CI workflow issues.

You could make Parquet completely independent of the Arrow codebase,
but at that point there is little reason to maintain a relationship
between the projects or their communities. We have spent a great deal
of effort refactoring the two projects to enable as much code sharing
as there is now.

- Wes

On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <we...@gmail.com> wrote:
>> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.
>
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
>
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> Wes,
>>
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>>
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>>
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>>
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>>
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>>
>> "
>> *I question whether it's worth the community's time long term to wear*
>>
>>
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>>
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>>
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>>
>>
>>
>>
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> @Antoine
>>>
>>> > By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>>
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>>
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>>
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>>
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how it would be possible to manage parquet-cpp as a separate codebase. That
>>> gives me hope that the projects could be managed separately some day.
>>> >
>>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>>> > features several areas of duplicated logic which could be replaced by
>>> > components from the Arrow platform for better platform-wide
>>> > interoperability:
>>> >
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>>> >
>>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>>> >
>>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>>> > bugs that we had to fix in Arrow's build system to prevent them from
>>> > leaking to third party linkers when statically linked (ORC is only
>>> > available for static linking at the moment AFAIK).
>>> >
>>> > I question whether it's worth the community's time long term to wear
>>> > ourselves out defining custom "ports" / virtual interfaces in each
>>> > library to plug components together rather than utilizing common
>>> > platform APIs.
>>> >
>>> > - Wes
>>> >
>>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>> >> You're point about the constraints of the ASF release process are well
>>> >> taken and as a developer who's trying to work in the current
>>> environment I
>>> >> would be much happier if the codebases were merged. The main issues I
>>> worry
>>> >> about when you put codebases like these together are:
>>> >>
>>> >> 1. The delineation of API's become blurred and the code becomes too
>>> coupled
>>> >> 2. Release of artifacts that are lower in the dependency tree are
>>> delayed
>>> >> by artifacts higher in the dependency tree
>>> >>
>>> >> If the project/release management is structured well and someone keeps
>>> an
>>> >> eye on the coupling, then I don't have any concerns.
>>> >>
>>> >> I would like to point out that arrow's use of orc is a great example of
>>> how
>>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>>> >> gives me hope that the projects could be managed separately some day.
>>> >>
>>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>
>>> >>> hi Josh,
>>> >>>
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them together seems like the wrong choice.
>>> >>>
>>> >>> Apache is "Community over Code"; right now it's the same people
>>> >>> building these projects -- my argument (which I think you agree with?)
>>> >>> is that we should work more closely together until the community grows
>>> >>> large enough to support larger-scope process than we have now. As
>>> >>> you've seen, our process isn't serving developers of these projects.
>>> >>>
>>> >>> > I also think build tooling should be pulled into its own codebase.
>>> >>>
>>> >>> I don't see how this can possibly be practical taking into
>>> >>> consideration the constraints imposed by the combination of the GitHub
>>> >>> platform and the ASF release process. I'm all for being idealistic,
>>> >>> but right now we need to be practical. Unless we can devise a
>>> >>> practical procedure that can accommodate at least 1 patch per day
>>> >>> which may touch both code and build system simultaneously without
>>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>>> >>> move forward.
>>> >>>
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short term with the express purpose of separating them in the near
>>> term.
>>> >>>
>>> >>> I would agree but only if separation can be demonstrated to be
>>> >>> practical and result in net improvements in productivity and community
>>> >>> growth. I think experience has clearly demonstrated that the current
>>> >>> separation is impractical, and is causing problems.
>>> >>>
>>> >>> Per Julian's and Ted's comments, I think we need to consider
>>> >>> development process and ASF releases separately. My argument is as
>>> >>> follows:
>>> >>>
>>> >>> * Monorepo for development (for practicality)
>>> >>> * Releases structured according to the desires of the PMCs
>>> >>>
>>> >>> - Wes
>>> >>>
>>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > I recently worked on an issue that had to be implemented in
>>> parquet-cpp
>>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>>> work
>>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>>> May
>>> >>> > 10) because of a PR that it depended on in arrow that was recently
>>> >>> merged.
>>> >>> > I couldn't even address any CI issues in the PR because the change in
>>> >>> arrow
>>> >>> > was not yet in master. In a separate PR, I changed the
>>> >>> run_clang_format.py
>>> >>> > script in the arrow project only to find out later that there was an
>>> >>> exact
>>> >>> > copy of it in parquet-cpp.
>>> >>> >
>>> >>> > However, I don't think merging the codebases makes sense in the long
>>> >>> term.
>>> >>> > I can imagine use cases for parquet that don't involve arrow and
>>> tying
>>> >>> them
>>> >>> > together seems like the wrong choice. There will be other formats
>>> that
>>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> >>> don't
>>> >>> > see why parquet should be special. I also think build tooling should
>>> be
>>> >>> > pulled into its own codebase. GNU has had a long history of
>>> developing
>>> >>> open
>>> >>> > source C/C++ projects that way and made projects like
>>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>>> >>> > counter-example since there have been lots of successful open source
>>> >>> > projects that have used nightly build systems that pinned versions of
>>> >>> > dependent software.
>>> >>> >
>>> >>> > That being said, I think it makes sense to merge the codebases in the
>>> >>> short
>>> >>> > term with the express purpose of separating them in the near  term.
>>> My
>>> >>> > reasoning is as follows. By putting the codebases together, you can
>>> more
>>> >>> > easily delineate the boundaries between the API's with a single PR.
>>> >>> Second,
>>> >>> > it will force the build tooling to converge instead of diverge,
>>> which has
>>> >>> > already happened. Once the boundaries and tooling have been sorted
>>> out,
>>> >>> it
>>> >>> > should be easy to separate them back into their own codebases.
>>> >>> >
>>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>>> arrow
>>> >>> > be separated from other languages. Looking at it from the
>>> perspective of
>>> >>> a
>>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>>> to
>>> >>> pay
>>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> >>> > release of arrow, many of which were holding up the release. I hope
>>> that
>>> >>> > seems like a reasonable compromise, and I think it will help reduce
>>> the
>>> >>> > complexity of the build/release tooling.
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>> >>> wrote:
>>> >>> >
>>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>> >>> wrote:
>>> >>> >>
>>> >>> >> >
>>> >>> >> > > The community will be less willing to accept large
>>> >>> >> > > changes that require multiple rounds of patches for stability
>>> and
>>> >>> API
>>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>>> community
>>> >>> took
>>> >>> >> a
>>> >>> >> > > significantly long time for the very same reason.
>>> >>> >> >
>>> >>> >> > Please don't use bad experiences from another open source
>>> community as
>>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>>> way
>>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >>> >> > happens to operate under a similar open governance model.
>>> >>> >>
>>> >>> >>
>>> >>> >> There are some more radical and community building options as well.
>>> Take
>>> >>> >> the subversion project as a precedent. With subversion, any Apache
>>> >>> >> committer can request and receive a commit bit on some large
>>> fraction of
>>> >>> >> subversion.
>>> >>> >>
>>> >>> >> So why not take this a bit further and give every parquet committer
>>> a
>>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>>> >>> Arrow?
>>> >>> >> Possibly even make it policy that every Parquet committer who asks
>>> will
>>> >>> be
>>> >>> >> given committer status in Arrow.
>>> >>> >>
>>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> >>> can't be
>>> >>> >> worried at that point whether their patches will get merged; they
>>> can
>>> >>> just
>>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>>> Parquet
>>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>>> why not
>>> >>> >> invite them in?
>>> >>> >>
>>> >>>
>>>
>>
>>
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.

Yes, indeed. In my view, the next best option after a monorepo is to
fork. That would obviously be a bad outcome for the community.

It doesn't look like I will be able to convince you that a monorepo is
a good idea; what I would ask instead is that you be willing to give
it a shot, and if it turns out in the way you're describing (which I
don't think it will) then I suggest that we fork at that point.

- Wes

On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
> Wes,
>
> Unfortunately, I cannot show you any practical fact-based problems of a
> non-existent Arrow-Parquet mono-repo.
> Bringing in related Apache community experiences are more meaningful than
> how mono-repos work at Google and other big organizations.
> We solely depend on volunteers and cannot hire full-time developers.
> You are very well aware of how difficult it has been to find more
> contributors and maintainers for Arrow. parquet-cpp already has a low
> contribution rate to its core components.
>
> We should target to ensure that new volunteers who want to contribute
> bug-fixes/features should spend the least amount of time in figuring out
> the project repo. We can never come up with an automated build system that
> caters to every possible environment.
> My only concern is if the mono-repo will make it harder for new developers
> to work on parquet-cpp core just due to the additional code, build and test
> dependencies.
> I am not saying that the Arrow community/committers will be less
> co-operative.
> I just don't think the mono-repo structure model will be sustainable in an
> open source community unless there are long-term vested interests. We can't
> predict that.
>
> The current circular dependency problems between Arrow and Parquet is a
> major problem for the community and it is important.
>
> The current Arrow adaptor code for parquet should live in the arrow repo.
> That will remove a majority of the dependency issues.
> Joshua's work would not have been blocked in parquet-cpp if that adapter
> was in the arrow repo.  This will be similar to the ORC adaptor.
>
> The platform API code is pretty stable at this point. Minor changes in the
> future to this code should not be the main reason to combine the arrow
> parquet repos.
>
> "
> *I question whether it's worth the community's time long term to wear*
>
>
> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
> to plug components together rather than utilizing commonplatform APIs.*"
>
> My answer to your question below would be "Yes". Modularity/separation is
> very important in an open source community where priorities of contributors
> are often short term.
> The retention is low and therefore the acquisition costs should be low as
> well. This is the community over code approach according to me. Minor code
> duplication is not a deal breaker.
> ORC, Parquet, Arrow, etc. are all different components in the big data
> space serving their own functions.
>
> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
>
>
>
>
> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>
>> @Antoine
>>
>> > By the way, one concern with the monorepo approach: it would slightly
>> increase Arrow CI times (which are already too large).
>>
>> A typical CI run in Arrow is taking about 45 minutes:
>> https://travis-ci.org/apache/arrow/builds/410119750
>>
>> Parquet run takes about 28
>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>
>> Inevitably we will need to create some kind of bot to run certain
>> builds on-demand based on commit / PR metadata or on request.
>>
>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> made substantially shorter by moving some of the slower parts (like
>> the Python ASV benchmarks) from being tested every-commit to nightly
>> or on demand. Using ASAN instead of valgrind in Travis would also
>> improve build times (valgrind build could be moved to a nightly
>> exhaustive test run)
>>
>> - Wes
>>
>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> I would like to point out that arrow's use of orc is a great example of
>> how it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>> >
>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>> > features several areas of duplicated logic which could be replaced by
>> > components from the Arrow platform for better platform-wide
>> > interoperability:
>> >
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>> >
>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> > bugs that we had to fix in Arrow's build system to prevent them from
>> > leaking to third party linkers when statically linked (ORC is only
>> > available for static linking at the moment AFAIK).
>> >
>> > I question whether it's worth the community's time long term to wear
>> > ourselves out defining custom "ports" / virtual interfaces in each
>> > library to plug components together rather than utilizing common
>> > platform APIs.
>> >
>> > - Wes
>> >
>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>> wrote:
>> >> You're point about the constraints of the ASF release process are well
>> >> taken and as a developer who's trying to work in the current
>> environment I
>> >> would be much happier if the codebases were merged. The main issues I
>> worry
>> >> about when you put codebases like these together are:
>> >>
>> >> 1. The delineation of API's become blurred and the code becomes too
>> coupled
>> >> 2. Release of artifacts that are lower in the dependency tree are
>> delayed
>> >> by artifacts higher in the dependency tree
>> >>
>> >> If the project/release management is structured well and someone keeps
>> an
>> >> eye on the coupling, then I don't have any concerns.
>> >>
>> >> I would like to point out that arrow's use of orc is a great example of
>> how
>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>> >> gives me hope that the projects could be managed separately some day.
>> >>
>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> hi Josh,
>> >>>
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them together seems like the wrong choice.
>> >>>
>> >>> Apache is "Community over Code"; right now it's the same people
>> >>> building these projects -- my argument (which I think you agree with?)
>> >>> is that we should work more closely together until the community grows
>> >>> large enough to support larger-scope process than we have now. As
>> >>> you've seen, our process isn't serving developers of these projects.
>> >>>
>> >>> > I also think build tooling should be pulled into its own codebase.
>> >>>
>> >>> I don't see how this can possibly be practical taking into
>> >>> consideration the constraints imposed by the combination of the GitHub
>> >>> platform and the ASF release process. I'm all for being idealistic,
>> >>> but right now we need to be practical. Unless we can devise a
>> >>> practical procedure that can accommodate at least 1 patch per day
>> >>> which may touch both code and build system simultaneously without
>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>> >>> move forward.
>> >>>
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short term with the express purpose of separating them in the near
>> term.
>> >>>
>> >>> I would agree but only if separation can be demonstrated to be
>> >>> practical and result in net improvements in productivity and community
>> >>> growth. I think experience has clearly demonstrated that the current
>> >>> separation is impractical, and is causing problems.
>> >>>
>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> development process and ASF releases separately. My argument is as
>> >>> follows:
>> >>>
>> >>> * Monorepo for development (for practicality)
>> >>> * Releases structured according to the desires of the PMCs
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>> >
>> >>> wrote:
>> >>> > I recently worked on an issue that had to be implemented in
>> parquet-cpp
>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>> work
>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>> May
>> >>> > 10) because of a PR that it depended on in arrow that was recently
>> >>> merged.
>> >>> > I couldn't even address any CI issues in the PR because the change in
>> >>> arrow
>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> run_clang_format.py
>> >>> > script in the arrow project only to find out later that there was an
>> >>> exact
>> >>> > copy of it in parquet-cpp.
>> >>> >
>> >>> > However, I don't think merging the codebases makes sense in the long
>> >>> term.
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them
>> >>> > together seems like the wrong choice. There will be other formats
>> that
>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>> >>> don't
>> >>> > see why parquet should be special. I also think build tooling should
>> be
>> >>> > pulled into its own codebase. GNU has had a long history of
>> developing
>> >>> open
>> >>> > source C/C++ projects that way and made projects like
>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>> >>> > counter-example since there have been lots of successful open source
>> >>> > projects that have used nightly build systems that pinned versions of
>> >>> > dependent software.
>> >>> >
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short
>> >>> > term with the express purpose of separating them in the near  term.
>> My
>> >>> > reasoning is as follows. By putting the codebases together, you can
>> more
>> >>> > easily delineate the boundaries between the API's with a single PR.
>> >>> Second,
>> >>> > it will force the build tooling to converge instead of diverge,
>> which has
>> >>> > already happened. Once the boundaries and tooling have been sorted
>> out,
>> >>> it
>> >>> > should be easy to separate them back into their own codebases.
>> >>> >
>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>> arrow
>> >>> > be separated from other languages. Looking at it from the
>> perspective of
>> >>> a
>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>> to
>> >>> pay
>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>> >>> > release of arrow, many of which were holding up the release. I hope
>> that
>> >>> > seems like a reasonable compromise, and I think it will help reduce
>> the
>> >>> > complexity of the build/release tooling.
>> >>> >
>> >>> >
>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >> >
>> >>> >> > > The community will be less willing to accept large
>> >>> >> > > changes that require multiple rounds of patches for stability
>> and
>> >>> API
>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> community
>> >>> took
>> >>> >> a
>> >>> >> > > significantly long time for the very same reason.
>> >>> >> >
>> >>> >> > Please don't use bad experiences from another open source
>> community as
>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>> way
>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>> >>> >> > happens to operate under a similar open governance model.
>> >>> >>
>> >>> >>
>> >>> >> There are some more radical and community building options as well.
>> Take
>> >>> >> the subversion project as a precedent. With subversion, any Apache
>> >>> >> committer can request and receive a commit bit on some large
>> fraction of
>> >>> >> subversion.
>> >>> >>
>> >>> >> So why not take this a bit further and give every parquet committer
>> a
>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>> >>> Arrow?
>> >>> >> Possibly even make it policy that every Parquet committer who asks
>> will
>> >>> be
>> >>> >> given committer status in Arrow.
>> >>> >>
>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>> >>> can't be
>> >>> >> worried at that point whether their patches will get merged; they
>> can
>> >>> just
>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> Parquet
>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>> why not
>> >>> >> invite them in?
>> >>> >>
>> >>>
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach.

Yes, indeed. In my view, the next best option after a monorepo is to
fork. That would obviously be a bad outcome for the community.

It doesn't look like I will be able to convince you that a monorepo is
a good idea; what I would ask instead is that you be willing to give
it a shot, and if it turns out in the way you're describing (which I
don't think it will) then I suggest that we fork at that point.

- Wes

On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <ma...@gmail.com> wrote:
> Wes,
>
> Unfortunately, I cannot show you any practical fact-based problems of a
> non-existent Arrow-Parquet mono-repo.
> Bringing in related Apache community experiences are more meaningful than
> how mono-repos work at Google and other big organizations.
> We solely depend on volunteers and cannot hire full-time developers.
> You are very well aware of how difficult it has been to find more
> contributors and maintainers for Arrow. parquet-cpp already has a low
> contribution rate to its core components.
>
> We should target to ensure that new volunteers who want to contribute
> bug-fixes/features should spend the least amount of time in figuring out
> the project repo. We can never come up with an automated build system that
> caters to every possible environment.
> My only concern is if the mono-repo will make it harder for new developers
> to work on parquet-cpp core just due to the additional code, build and test
> dependencies.
> I am not saying that the Arrow community/committers will be less
> co-operative.
> I just don't think the mono-repo structure model will be sustainable in an
> open source community unless there are long-term vested interests. We can't
> predict that.
>
> The current circular dependency problems between Arrow and Parquet is a
> major problem for the community and it is important.
>
> The current Arrow adaptor code for parquet should live in the arrow repo.
> That will remove a majority of the dependency issues.
> Joshua's work would not have been blocked in parquet-cpp if that adapter
> was in the arrow repo.  This will be similar to the ORC adaptor.
>
> The platform API code is pretty stable at this point. Minor changes in the
> future to this code should not be the main reason to combine the arrow
> parquet repos.
>
> "
> *I question whether it's worth the community's time long term to wear*
>
>
> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
> to plug components together rather than utilizing commonplatform APIs.*"
>
> My answer to your question below would be "Yes". Modularity/separation is
> very important in an open source community where priorities of contributors
> are often short term.
> The retention is low and therefore the acquisition costs should be low as
> well. This is the community over code approach according to me. Minor code
> duplication is not a deal breaker.
> ORC, Parquet, Arrow, etc. are all different components in the big data
> space serving their own functions.
>
> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
>
>
>
>
> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:
>
>> @Antoine
>>
>> > By the way, one concern with the monorepo approach: it would slightly
>> increase Arrow CI times (which are already too large).
>>
>> A typical CI run in Arrow is taking about 45 minutes:
>> https://travis-ci.org/apache/arrow/builds/410119750
>>
>> Parquet run takes about 28
>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>
>> Inevitably we will need to create some kind of bot to run certain
>> builds on-demand based on commit / PR metadata or on request.
>>
>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> made substantially shorter by moving some of the slower parts (like
>> the Python ASV benchmarks) from being tested every-commit to nightly
>> or on demand. Using ASAN instead of valgrind in Travis would also
>> improve build times (valgrind build could be moved to a nightly
>> exhaustive test run)
>>
>> - Wes
>>
>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >> I would like to point out that arrow's use of orc is a great example of
>> how it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>> >
>> > Well, I don't know that ORC is the best example. The ORC C++ codebase
>> > features several areas of duplicated logic which could be replaced by
>> > components from the Arrow platform for better platform-wide
>> > interoperability:
>> >
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
>> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >
>> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>> >
>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> > bugs that we had to fix in Arrow's build system to prevent them from
>> > leaking to third party linkers when statically linked (ORC is only
>> > available for static linking at the moment AFAIK).
>> >
>> > I question whether it's worth the community's time long term to wear
>> > ourselves out defining custom "ports" / virtual interfaces in each
>> > library to plug components together rather than utilizing common
>> > platform APIs.
>> >
>> > - Wes
>> >
>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
>> wrote:
>> >> You're point about the constraints of the ASF release process are well
>> >> taken and as a developer who's trying to work in the current
>> environment I
>> >> would be much happier if the codebases were merged. The main issues I
>> worry
>> >> about when you put codebases like these together are:
>> >>
>> >> 1. The delineation of API's become blurred and the code becomes too
>> coupled
>> >> 2. Release of artifacts that are lower in the dependency tree are
>> delayed
>> >> by artifacts higher in the dependency tree
>> >>
>> >> If the project/release management is structured well and someone keeps
>> an
>> >> eye on the coupling, then I don't have any concerns.
>> >>
>> >> I would like to point out that arrow's use of orc is a great example of
>> how
>> >> it would be possible to manage parquet-cpp as a separate codebase. That
>> >> gives me hope that the projects could be managed separately some day.
>> >>
>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >>> hi Josh,
>> >>>
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them together seems like the wrong choice.
>> >>>
>> >>> Apache is "Community over Code"; right now it's the same people
>> >>> building these projects -- my argument (which I think you agree with?)
>> >>> is that we should work more closely together until the community grows
>> >>> large enough to support larger-scope process than we have now. As
>> >>> you've seen, our process isn't serving developers of these projects.
>> >>>
>> >>> > I also think build tooling should be pulled into its own codebase.
>> >>>
>> >>> I don't see how this can possibly be practical taking into
>> >>> consideration the constraints imposed by the combination of the GitHub
>> >>> platform and the ASF release process. I'm all for being idealistic,
>> >>> but right now we need to be practical. Unless we can devise a
>> >>> practical procedure that can accommodate at least 1 patch per day
>> >>> which may touch both code and build system simultaneously without
>> >>> being a hindrance to contributor or maintainer, I don't see how we can
>> >>> move forward.
>> >>>
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short term with the express purpose of separating them in the near
>> term.
>> >>>
>> >>> I would agree but only if separation can be demonstrated to be
>> >>> practical and result in net improvements in productivity and community
>> >>> growth. I think experience has clearly demonstrated that the current
>> >>> separation is impractical, and is causing problems.
>> >>>
>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >>> development process and ASF releases separately. My argument is as
>> >>> follows:
>> >>>
>> >>> * Monorepo for development (for practicality)
>> >>> * Releases structured according to the desires of the PMCs
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
>> >
>> >>> wrote:
>> >>> > I recently worked on an issue that had to be implemented in
>> parquet-cpp
>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
>> work
>> >>> > with. For example, I still have a PR open in parquet-cpp (created on
>> May
>> >>> > 10) because of a PR that it depended on in arrow that was recently
>> >>> merged.
>> >>> > I couldn't even address any CI issues in the PR because the change in
>> >>> arrow
>> >>> > was not yet in master. In a separate PR, I changed the
>> >>> run_clang_format.py
>> >>> > script in the arrow project only to find out later that there was an
>> >>> exact
>> >>> > copy of it in parquet-cpp.
>> >>> >
>> >>> > However, I don't think merging the codebases makes sense in the long
>> >>> term.
>> >>> > I can imagine use cases for parquet that don't involve arrow and
>> tying
>> >>> them
>> >>> > together seems like the wrong choice. There will be other formats
>> that
>> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>> >>> don't
>> >>> > see why parquet should be special. I also think build tooling should
>> be
>> >>> > pulled into its own codebase. GNU has had a long history of
>> developing
>> >>> open
>> >>> > source C/C++ projects that way and made projects like
>> >>> > autoconf/automake/make to support them. I don't think CI is a good
>> >>> > counter-example since there have been lots of successful open source
>> >>> > projects that have used nightly build systems that pinned versions of
>> >>> > dependent software.
>> >>> >
>> >>> > That being said, I think it makes sense to merge the codebases in the
>> >>> short
>> >>> > term with the express purpose of separating them in the near  term.
>> My
>> >>> > reasoning is as follows. By putting the codebases together, you can
>> more
>> >>> > easily delineate the boundaries between the API's with a single PR.
>> >>> Second,
>> >>> > it will force the build tooling to converge instead of diverge,
>> which has
>> >>> > already happened. Once the boundaries and tooling have been sorted
>> out,
>> >>> it
>> >>> > should be easy to separate them back into their own codebases.
>> >>> >
>> >>> > If the codebases are merged, I would ask that the C++ codebases for
>> arrow
>> >>> > be separated from other languages. Looking at it from the
>> perspective of
>> >>> a
>> >>> > parquet-cpp library user, having a dependency on Java is a large tax
>> to
>> >>> pay
>> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>> >>> > release of arrow, many of which were holding up the release. I hope
>> that
>> >>> > seems like a reasonable compromise, and I think it will help reduce
>> the
>> >>> > complexity of the build/release tooling.
>> >>> >
>> >>> >
>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >> >
>> >>> >> > > The community will be less willing to accept large
>> >>> >> > > changes that require multiple rounds of patches for stability
>> and
>> >>> API
>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> community
>> >>> took
>> >>> >> a
>> >>> >> > > significantly long time for the very same reason.
>> >>> >> >
>> >>> >> > Please don't use bad experiences from another open source
>> community as
>> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
>> way
>> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
>> >>> >> > happens to operate under a similar open governance model.
>> >>> >>
>> >>> >>
>> >>> >> There are some more radical and community building options as well.
>> Take
>> >>> >> the subversion project as a precedent. With subversion, any Apache
>> >>> >> committer can request and receive a commit bit on some large
>> fraction of
>> >>> >> subversion.
>> >>> >>
>> >>> >> So why not take this a bit further and give every parquet committer
>> a
>> >>> >> commit bit in Arrow? Or even make them be first class committers in
>> >>> Arrow?
>> >>> >> Possibly even make it policy that every Parquet committer who asks
>> will
>> >>> be
>> >>> >> given committer status in Arrow.
>> >>> >>
>> >>> >> That relieves a lot of the social anxiety here. Parquet committers
>> >>> can't be
>> >>> >> worried at that point whether their patches will get merged; they
>> can
>> >>> just
>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> Parquet
>> >>> >> committers. After all, Arrow already depends a lot on parquet so
>> why not
>> >>> >> invite them in?
>> >>> >>
>> >>>
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

Wes,

Unfortunately, I cannot show you any practical fact-based problems of a
non-existent Arrow-Parquet mono-repo.
Bringing in related Apache community experiences are more meaningful than
how mono-repos work at Google and other big organizations.
We solely depend on volunteers and cannot hire full-time developers.
You are very well aware of how difficult it has been to find more
contributors and maintainers for Arrow. parquet-cpp already has a low
contribution rate to its core components.

We should target to ensure that new volunteers who want to contribute
bug-fixes/features should spend the least amount of time in figuring out
the project repo. We can never come up with an automated build system that
caters to every possible environment.
My only concern is if the mono-repo will make it harder for new developers
to work on parquet-cpp core just due to the additional code, build and test
dependencies.
I am not saying that the Arrow community/committers will be less
co-operative.
I just don't think the mono-repo structure model will be sustainable in an
open source community unless there are long-term vested interests. We can't
predict that.

The current circular dependency problems between Arrow and Parquet is a
major problem for the community and it is important.

The current Arrow adaptor code for parquet should live in the arrow repo.
That will remove a majority of the dependency issues.
Joshua's work would not have been blocked in parquet-cpp if that adapter
was in the arrow repo.  This will be similar to the ORC adaptor.

The platform API code is pretty stable at this point. Minor changes in the
future to this code should not be the main reason to combine the arrow
parquet repos.

"
*I question whether it's worth the community's time long term to wear*


*ourselves out defining custom "ports" / virtual interfaces in eachlibrary
to plug components together rather than utilizing commonplatform APIs.*"

My answer to your question below would be "Yes". Modularity/separation is
very important in an open source community where priorities of contributors
are often short term.
The retention is low and therefore the acquisition costs should be low as
well. This is the community over code approach according to me. Minor code
duplication is not a deal breaker.
ORC, Parquet, Arrow, etc. are all different components in the big data
space serving their own functions.

If you still strongly feel that the only way forward is to clone the
parquet-cpp repo and part ways, I will withdraw my concern. Having two
parquet-cpp repos is no way a better approach.




On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:

> @Antoine
>
> > By the way, one concern with the monorepo approach: it would slightly
> increase Arrow CI times (which are already too large).
>
> A typical CI run in Arrow is taking about 45 minutes:
> https://travis-ci.org/apache/arrow/builds/410119750
>
> Parquet run takes about 28
> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>
> Inevitably we will need to create some kind of bot to run certain
> builds on-demand based on commit / PR metadata or on request.
>
> The slowest build in Arrow (the Arrow C++/Python one) build could be
> made substantially shorter by moving some of the slower parts (like
> the Python ASV benchmarks) from being tested every-commit to nightly
> or on demand. Using ASAN instead of valgrind in Travis would also
> improve build times (valgrind build could be moved to a nightly
> exhaustive test run)
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >> I would like to point out that arrow's use of orc is a great example of
> how it would be possible to manage parquet-cpp as a separate codebase. That
> gives me hope that the projects could be managed separately some day.
> >
> > Well, I don't know that ORC is the best example. The ORC C++ codebase
> > features several areas of duplicated logic which could be replaced by
> > components from the Arrow platform for better platform-wide
> > interoperability:
> >
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
> >
> > ORC's use of symbols from Protocol Buffers was actually a cause of
> > bugs that we had to fix in Arrow's build system to prevent them from
> > leaking to third party linkers when statically linked (ORC is only
> > available for static linking at the moment AFAIK).
> >
> > I question whether it's worth the community's time long term to wear
> > ourselves out defining custom "ports" / virtual interfaces in each
> > library to plug components together rather than utilizing common
> > platform APIs.
> >
> > - Wes
> >
> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> >> You're point about the constraints of the ASF release process are well
> >> taken and as a developer who's trying to work in the current
> environment I
> >> would be much happier if the codebases were merged. The main issues I
> worry
> >> about when you put codebases like these together are:
> >>
> >> 1. The delineation of API's become blurred and the code becomes too
> coupled
> >> 2. Release of artifacts that are lower in the dependency tree are
> delayed
> >> by artifacts higher in the dependency tree
> >>
> >> If the project/release management is structured well and someone keeps
> an
> >> eye on the coupling, then I don't have any concerns.
> >>
> >> I would like to point out that arrow's use of orc is a great example of
> how
> >> it would be possible to manage parquet-cpp as a separate codebase. That
> >> gives me hope that the projects could be managed separately some day.
> >>
> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> hi Josh,
> >>>
> >>> > I can imagine use cases for parquet that don't involve arrow and
> tying
> >>> them together seems like the wrong choice.
> >>>
> >>> Apache is "Community over Code"; right now it's the same people
> >>> building these projects -- my argument (which I think you agree with?)
> >>> is that we should work more closely together until the community grows
> >>> large enough to support larger-scope process than we have now. As
> >>> you've seen, our process isn't serving developers of these projects.
> >>>
> >>> > I also think build tooling should be pulled into its own codebase.
> >>>
> >>> I don't see how this can possibly be practical taking into
> >>> consideration the constraints imposed by the combination of the GitHub
> >>> platform and the ASF release process. I'm all for being idealistic,
> >>> but right now we need to be practical. Unless we can devise a
> >>> practical procedure that can accommodate at least 1 patch per day
> >>> which may touch both code and build system simultaneously without
> >>> being a hindrance to contributor or maintainer, I don't see how we can
> >>> move forward.
> >>>
> >>> > That being said, I think it makes sense to merge the codebases in the
> >>> short term with the express purpose of separating them in the near
> term.
> >>>
> >>> I would agree but only if separation can be demonstrated to be
> >>> practical and result in net improvements in productivity and community
> >>> growth. I think experience has clearly demonstrated that the current
> >>> separation is impractical, and is causing problems.
> >>>
> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> development process and ASF releases separately. My argument is as
> >>> follows:
> >>>
> >>> * Monorepo for development (for practicality)
> >>> * Releases structured according to the desires of the PMCs
> >>>
> >>> - Wes
> >>>
> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
> >
> >>> wrote:
> >>> > I recently worked on an issue that had to be implemented in
> parquet-cpp
> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
> work
> >>> > with. For example, I still have a PR open in parquet-cpp (created on
> May
> >>> > 10) because of a PR that it depended on in arrow that was recently
> >>> merged.
> >>> > I couldn't even address any CI issues in the PR because the change in
> >>> arrow
> >>> > was not yet in master. In a separate PR, I changed the
> >>> run_clang_format.py
> >>> > script in the arrow project only to find out later that there was an
> >>> exact
> >>> > copy of it in parquet-cpp.
> >>> >
> >>> > However, I don't think merging the codebases makes sense in the long
> >>> term.
> >>> > I can imagine use cases for parquet that don't involve arrow and
> tying
> >>> them
> >>> > together seems like the wrong choice. There will be other formats
> that
> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
> >>> don't
> >>> > see why parquet should be special. I also think build tooling should
> be
> >>> > pulled into its own codebase. GNU has had a long history of
> developing
> >>> open
> >>> > source C/C++ projects that way and made projects like
> >>> > autoconf/automake/make to support them. I don't think CI is a good
> >>> > counter-example since there have been lots of successful open source
> >>> > projects that have used nightly build systems that pinned versions of
> >>> > dependent software.
> >>> >
> >>> > That being said, I think it makes sense to merge the codebases in the
> >>> short
> >>> > term with the express purpose of separating them in the near  term.
> My
> >>> > reasoning is as follows. By putting the codebases together, you can
> more
> >>> > easily delineate the boundaries between the API's with a single PR.
> >>> Second,
> >>> > it will force the build tooling to converge instead of diverge,
> which has
> >>> > already happened. Once the boundaries and tooling have been sorted
> out,
> >>> it
> >>> > should be easy to separate them back into their own codebases.
> >>> >
> >>> > If the codebases are merged, I would ask that the C++ codebases for
> arrow
> >>> > be separated from other languages. Looking at it from the
> perspective of
> >>> a
> >>> > parquet-cpp library user, having a dependency on Java is a large tax
> to
> >>> pay
> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> >>> > release of arrow, many of which were holding up the release. I hope
> that
> >>> > seems like a reasonable compromise, and I think it will help reduce
> the
> >>> > complexity of the build/release tooling.
> >>> >
> >>> >
> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >> >
> >>> >> > > The community will be less willing to accept large
> >>> >> > > changes that require multiple rounds of patches for stability
> and
> >>> API
> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
> community
> >>> took
> >>> >> a
> >>> >> > > significantly long time for the very same reason.
> >>> >> >
> >>> >> > Please don't use bad experiences from another open source
> community as
> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
> way
> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
> >>> >> > happens to operate under a similar open governance model.
> >>> >>
> >>> >>
> >>> >> There are some more radical and community building options as well.
> Take
> >>> >> the subversion project as a precedent. With subversion, any Apache
> >>> >> committer can request and receive a commit bit on some large
> fraction of
> >>> >> subversion.
> >>> >>
> >>> >> So why not take this a bit further and give every parquet committer
> a
> >>> >> commit bit in Arrow? Or even make them be first class committers in
> >>> Arrow?
> >>> >> Possibly even make it policy that every Parquet committer who asks
> will
> >>> be
> >>> >> given committer status in Arrow.
> >>> >>
> >>> >> That relieves a lot of the social anxiety here. Parquet committers
> >>> can't be
> >>> >> worried at that point whether their patches will get merged; they
> can
> >>> just
> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
> Parquet
> >>> >> committers. After all, Arrow already depends a lot on parquet so
> why not
> >>> >> invite them in?
> >>> >>
> >>>
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

Wes,

Unfortunately, I cannot show you any practical fact-based problems of a
non-existent Arrow-Parquet mono-repo.
Bringing in related Apache community experiences are more meaningful than
how mono-repos work at Google and other big organizations.
We solely depend on volunteers and cannot hire full-time developers.
You are very well aware of how difficult it has been to find more
contributors and maintainers for Arrow. parquet-cpp already has a low
contribution rate to its core components.

We should target to ensure that new volunteers who want to contribute
bug-fixes/features should spend the least amount of time in figuring out
the project repo. We can never come up with an automated build system that
caters to every possible environment.
My only concern is if the mono-repo will make it harder for new developers
to work on parquet-cpp core just due to the additional code, build and test
dependencies.
I am not saying that the Arrow community/committers will be less
co-operative.
I just don't think the mono-repo structure model will be sustainable in an
open source community unless there are long-term vested interests. We can't
predict that.

The current circular dependency problems between Arrow and Parquet is a
major problem for the community and it is important.

The current Arrow adaptor code for parquet should live in the arrow repo.
That will remove a majority of the dependency issues.
Joshua's work would not have been blocked in parquet-cpp if that adapter
was in the arrow repo.  This will be similar to the ORC adaptor.

The platform API code is pretty stable at this point. Minor changes in the
future to this code should not be the main reason to combine the arrow
parquet repos.

"
*I question whether it's worth the community's time long term to wear*


*ourselves out defining custom "ports" / virtual interfaces in eachlibrary
to plug components together rather than utilizing commonplatform APIs.*"

My answer to your question below would be "Yes". Modularity/separation is
very important in an open source community where priorities of contributors
are often short term.
The retention is low and therefore the acquisition costs should be low as
well. This is the community over code approach according to me. Minor code
duplication is not a deal breaker.
ORC, Parquet, Arrow, etc. are all different components in the big data
space serving their own functions.

If you still strongly feel that the only way forward is to clone the
parquet-cpp repo and part ways, I will withdraw my concern. Having two
parquet-cpp repos is no way a better approach.




On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <we...@gmail.com> wrote:

> @Antoine
>
> > By the way, one concern with the monorepo approach: it would slightly
> increase Arrow CI times (which are already too large).
>
> A typical CI run in Arrow is taking about 45 minutes:
> https://travis-ci.org/apache/arrow/builds/410119750
>
> Parquet run takes about 28
> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>
> Inevitably we will need to create some kind of bot to run certain
> builds on-demand based on commit / PR metadata or on request.
>
> The slowest build in Arrow (the Arrow C++/Python one) build could be
> made substantially shorter by moving some of the slower parts (like
> the Python ASV benchmarks) from being tested every-commit to nightly
> or on demand. Using ASAN instead of valgrind in Travis would also
> improve build times (valgrind build could be moved to a nightly
> exhaustive test run)
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >> I would like to point out that arrow's use of orc is a great example of
> how it would be possible to manage parquet-cpp as a separate codebase. That
> gives me hope that the projects could be managed separately some day.
> >
> > Well, I don't know that ORC is the best example. The ORC C++ codebase
> > features several areas of duplicated logic which could be replaced by
> > components from the Arrow platform for better platform-wide
> > interoperability:
> >
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
> >
> > ORC's use of symbols from Protocol Buffers was actually a cause of
> > bugs that we had to fix in Arrow's build system to prevent them from
> > leaking to third party linkers when statically linked (ORC is only
> > available for static linking at the moment AFAIK).
> >
> > I question whether it's worth the community's time long term to wear
> > ourselves out defining custom "ports" / virtual interfaces in each
> > library to plug components together rather than utilizing common
> > platform APIs.
> >
> > - Wes
> >
> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> >> You're point about the constraints of the ASF release process are well
> >> taken and as a developer who's trying to work in the current
> environment I
> >> would be much happier if the codebases were merged. The main issues I
> worry
> >> about when you put codebases like these together are:
> >>
> >> 1. The delineation of API's become blurred and the code becomes too
> coupled
> >> 2. Release of artifacts that are lower in the dependency tree are
> delayed
> >> by artifacts higher in the dependency tree
> >>
> >> If the project/release management is structured well and someone keeps
> an
> >> eye on the coupling, then I don't have any concerns.
> >>
> >> I would like to point out that arrow's use of orc is a great example of
> how
> >> it would be possible to manage parquet-cpp as a separate codebase. That
> >> gives me hope that the projects could be managed separately some day.
> >>
> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>> hi Josh,
> >>>
> >>> > I can imagine use cases for parquet that don't involve arrow and
> tying
> >>> them together seems like the wrong choice.
> >>>
> >>> Apache is "Community over Code"; right now it's the same people
> >>> building these projects -- my argument (which I think you agree with?)
> >>> is that we should work more closely together until the community grows
> >>> large enough to support larger-scope process than we have now. As
> >>> you've seen, our process isn't serving developers of these projects.
> >>>
> >>> > I also think build tooling should be pulled into its own codebase.
> >>>
> >>> I don't see how this can possibly be practical taking into
> >>> consideration the constraints imposed by the combination of the GitHub
> >>> platform and the ASF release process. I'm all for being idealistic,
> >>> but right now we need to be practical. Unless we can devise a
> >>> practical procedure that can accommodate at least 1 patch per day
> >>> which may touch both code and build system simultaneously without
> >>> being a hindrance to contributor or maintainer, I don't see how we can
> >>> move forward.
> >>>
> >>> > That being said, I think it makes sense to merge the codebases in the
> >>> short term with the express purpose of separating them in the near
> term.
> >>>
> >>> I would agree but only if separation can be demonstrated to be
> >>> practical and result in net improvements in productivity and community
> >>> growth. I think experience has clearly demonstrated that the current
> >>> separation is impractical, and is causing problems.
> >>>
> >>> Per Julian's and Ted's comments, I think we need to consider
> >>> development process and ASF releases separately. My argument is as
> >>> follows:
> >>>
> >>> * Monorepo for development (for practicality)
> >>> * Releases structured according to the desires of the PMCs
> >>>
> >>> - Wes
> >>>
> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuastorck@gmail.com
> >
> >>> wrote:
> >>> > I recently worked on an issue that had to be implemented in
> parquet-cpp
> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> >>> > ARROW-2586). I found the circular dependencies confusing and hard to
> work
> >>> > with. For example, I still have a PR open in parquet-cpp (created on
> May
> >>> > 10) because of a PR that it depended on in arrow that was recently
> >>> merged.
> >>> > I couldn't even address any CI issues in the PR because the change in
> >>> arrow
> >>> > was not yet in master. In a separate PR, I changed the
> >>> run_clang_format.py
> >>> > script in the arrow project only to find out later that there was an
> >>> exact
> >>> > copy of it in parquet-cpp.
> >>> >
> >>> > However, I don't think merging the codebases makes sense in the long
> >>> term.
> >>> > I can imagine use cases for parquet that don't involve arrow and
> tying
> >>> them
> >>> > together seems like the wrong choice. There will be other formats
> that
> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
> >>> don't
> >>> > see why parquet should be special. I also think build tooling should
> be
> >>> > pulled into its own codebase. GNU has had a long history of
> developing
> >>> open
> >>> > source C/C++ projects that way and made projects like
> >>> > autoconf/automake/make to support them. I don't think CI is a good
> >>> > counter-example since there have been lots of successful open source
> >>> > projects that have used nightly build systems that pinned versions of
> >>> > dependent software.
> >>> >
> >>> > That being said, I think it makes sense to merge the codebases in the
> >>> short
> >>> > term with the express purpose of separating them in the near  term.
> My
> >>> > reasoning is as follows. By putting the codebases together, you can
> more
> >>> > easily delineate the boundaries between the API's with a single PR.
> >>> Second,
> >>> > it will force the build tooling to converge instead of diverge,
> which has
> >>> > already happened. Once the boundaries and tooling have been sorted
> out,
> >>> it
> >>> > should be easy to separate them back into their own codebases.
> >>> >
> >>> > If the codebases are merged, I would ask that the C++ codebases for
> arrow
> >>> > be separated from other languages. Looking at it from the
> perspective of
> >>> a
> >>> > parquet-cpp library user, having a dependency on Java is a large tax
> to
> >>> pay
> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> >>> > release of arrow, many of which were holding up the release. I hope
> that
> >>> > seems like a reasonable compromise, and I think it will help reduce
> the
> >>> > complexity of the build/release tooling.
> >>> >
> >>> >
> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >> >
> >>> >> > > The community will be less willing to accept large
> >>> >> > > changes that require multiple rounds of patches for stability
> and
> >>> API
> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
> community
> >>> took
> >>> >> a
> >>> >> > > significantly long time for the very same reason.
> >>> >> >
> >>> >> > Please don't use bad experiences from another open source
> community as
> >>> >> > leverage in this discussion. I'm sorry that things didn't go the
> way
> >>> >> > you wanted in Apache Hadoop but this is a distinct community which
> >>> >> > happens to operate under a similar open governance model.
> >>> >>
> >>> >>
> >>> >> There are some more radical and community building options as well.
> Take
> >>> >> the subversion project as a precedent. With subversion, any Apache
> >>> >> committer can request and receive a commit bit on some large
> fraction of
> >>> >> subversion.
> >>> >>
> >>> >> So why not take this a bit further and give every parquet committer
> a
> >>> >> commit bit in Arrow? Or even make them be first class committers in
> >>> Arrow?
> >>> >> Possibly even make it policy that every Parquet committer who asks
> will
> >>> be
> >>> >> given committer status in Arrow.
> >>> >>
> >>> >> That relieves a lot of the social anxiety here. Parquet committers
> >>> can't be
> >>> >> worried at that point whether their patches will get merged; they
> can
> >>> just
> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
> Parquet
> >>> >> committers. After all, Arrow already depends a lot on parquet so
> why not
> >>> >> invite them in?
> >>> >>
> >>>
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

@Antoine

> By the way, one concern with the monorepo approach: it would slightly increase Arrow CI times (which are already too large).

A typical CI run in Arrow is taking about 45 minutes:
https://travis-ci.org/apache/arrow/builds/410119750

Parquet run takes about 28
https://travis-ci.org/apache/parquet-cpp/builds/410147208

Inevitably we will need to create some kind of bot to run certain
builds on-demand based on commit / PR metadata or on request.

The slowest build in Arrow (the Arrow C++/Python one) build could be
made substantially shorter by moving some of the slower parts (like
the Python ASV benchmarks) from being tested every-commit to nightly
or on demand. Using ASAN instead of valgrind in Travis would also
improve build times (valgrind build could be moved to a nightly
exhaustive test run)

- Wes

On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com> wrote:
>> I would like to point out that arrow's use of orc is a great example of how it would be possible to manage parquet-cpp as a separate codebase. That gives me hope that the projects could be managed separately some day.
>
> Well, I don't know that ORC is the best example. The ORC C++ codebase
> features several areas of duplicated logic which could be replaced by
> components from the Arrow platform for better platform-wide
> interoperability:
>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>
> ORC's use of symbols from Protocol Buffers was actually a cause of
> bugs that we had to fix in Arrow's build system to prevent them from
> leaking to third party linkers when statically linked (ORC is only
> available for static linking at the moment AFAIK).
>
> I question whether it's worth the community's time long term to wear
> ourselves out defining custom "ports" / virtual interfaces in each
> library to plug components together rather than utilizing common
> platform APIs.
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com> wrote:
>> You're point about the constraints of the ASF release process are well
>> taken and as a developer who's trying to work in the current environment I
>> would be much happier if the codebases were merged. The main issues I worry
>> about when you put codebases like these together are:
>>
>> 1. The delineation of API's become blurred and the code becomes too coupled
>> 2. Release of artifacts that are lower in the dependency tree are delayed
>> by artifacts higher in the dependency tree
>>
>> If the project/release management is structured well and someone keeps an
>> eye on the coupling, then I don't have any concerns.
>>
>> I would like to point out that arrow's use of orc is a great example of how
>> it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>>
>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Josh,
>>>
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them together seems like the wrong choice.
>>>
>>> Apache is "Community over Code"; right now it's the same people
>>> building these projects -- my argument (which I think you agree with?)
>>> is that we should work more closely together until the community grows
>>> large enough to support larger-scope process than we have now. As
>>> you've seen, our process isn't serving developers of these projects.
>>>
>>> > I also think build tooling should be pulled into its own codebase.
>>>
>>> I don't see how this can possibly be practical taking into
>>> consideration the constraints imposed by the combination of the GitHub
>>> platform and the ASF release process. I'm all for being idealistic,
>>> but right now we need to be practical. Unless we can devise a
>>> practical procedure that can accommodate at least 1 patch per day
>>> which may touch both code and build system simultaneously without
>>> being a hindrance to contributor or maintainer, I don't see how we can
>>> move forward.
>>>
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short term with the express purpose of separating them in the near  term.
>>>
>>> I would agree but only if separation can be demonstrated to be
>>> practical and result in net improvements in productivity and community
>>> growth. I think experience has clearly demonstrated that the current
>>> separation is impractical, and is causing problems.
>>>
>>> Per Julian's and Ted's comments, I think we need to consider
>>> development process and ASF releases separately. My argument is as
>>> follows:
>>>
>>> * Monorepo for development (for practicality)
>>> * Releases structured according to the desires of the PMCs
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>> > I recently worked on an issue that had to be implemented in parquet-cpp
>>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> > ARROW-2586). I found the circular dependencies confusing and hard to work
>>> > with. For example, I still have a PR open in parquet-cpp (created on May
>>> > 10) because of a PR that it depended on in arrow that was recently
>>> merged.
>>> > I couldn't even address any CI issues in the PR because the change in
>>> arrow
>>> > was not yet in master. In a separate PR, I changed the
>>> run_clang_format.py
>>> > script in the arrow project only to find out later that there was an
>>> exact
>>> > copy of it in parquet-cpp.
>>> >
>>> > However, I don't think merging the codebases makes sense in the long
>>> term.
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them
>>> > together seems like the wrong choice. There will be other formats that
>>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> don't
>>> > see why parquet should be special. I also think build tooling should be
>>> > pulled into its own codebase. GNU has had a long history of developing
>>> open
>>> > source C/C++ projects that way and made projects like
>>> > autoconf/automake/make to support them. I don't think CI is a good
>>> > counter-example since there have been lots of successful open source
>>> > projects that have used nightly build systems that pinned versions of
>>> > dependent software.
>>> >
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short
>>> > term with the express purpose of separating them in the near  term. My
>>> > reasoning is as follows. By putting the codebases together, you can more
>>> > easily delineate the boundaries between the API's with a single PR.
>>> Second,
>>> > it will force the build tooling to converge instead of diverge, which has
>>> > already happened. Once the boundaries and tooling have been sorted out,
>>> it
>>> > should be easy to separate them back into their own codebases.
>>> >
>>> > If the codebases are merged, I would ask that the C++ codebases for arrow
>>> > be separated from other languages. Looking at it from the perspective of
>>> a
>>> > parquet-cpp library user, having a dependency on Java is a large tax to
>>> pay
>>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> > release of arrow, many of which were holding up the release. I hope that
>>> > seems like a reasonable compromise, and I think it will help reduce the
>>> > complexity of the build/release tooling.
>>> >
>>> >
>>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >
>>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>
>>> >> >
>>> >> > > The community will be less willing to accept large
>>> >> > > changes that require multiple rounds of patches for stability and
>>> API
>>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
>>> took
>>> >> a
>>> >> > > significantly long time for the very same reason.
>>> >> >
>>> >> > Please don't use bad experiences from another open source community as
>>> >> > leverage in this discussion. I'm sorry that things didn't go the way
>>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >> > happens to operate under a similar open governance model.
>>> >>
>>> >>
>>> >> There are some more radical and community building options as well. Take
>>> >> the subversion project as a precedent. With subversion, any Apache
>>> >> committer can request and receive a commit bit on some large fraction of
>>> >> subversion.
>>> >>
>>> >> So why not take this a bit further and give every parquet committer a
>>> >> commit bit in Arrow? Or even make them be first class committers in
>>> Arrow?
>>> >> Possibly even make it policy that every Parquet committer who asks will
>>> be
>>> >> given committer status in Arrow.
>>> >>
>>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> can't be
>>> >> worried at that point whether their patches will get merged; they can
>>> just
>>> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>>> >> committers. After all, Arrow already depends a lot on parquet so why not
>>> >> invite them in?
>>> >>
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

@Antoine

> By the way, one concern with the monorepo approach: it would slightly increase Arrow CI times (which are already too large).

A typical CI run in Arrow is taking about 45 minutes:
https://travis-ci.org/apache/arrow/builds/410119750

Parquet run takes about 28
https://travis-ci.org/apache/parquet-cpp/builds/410147208

Inevitably we will need to create some kind of bot to run certain
builds on-demand based on commit / PR metadata or on request.

The slowest build in Arrow (the Arrow C++/Python one) build could be
made substantially shorter by moving some of the slower parts (like
the Python ASV benchmarks) from being tested every-commit to nightly
or on demand. Using ASAN instead of valgrind in Travis would also
improve build times (valgrind build could be moved to a nightly
exhaustive test run)

- Wes

On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <we...@gmail.com> wrote:
>> I would like to point out that arrow's use of orc is a great example of how it would be possible to manage parquet-cpp as a separate codebase. That gives me hope that the projects could be managed separately some day.
>
> Well, I don't know that ORC is the best example. The ORC C++ codebase
> features several areas of duplicated logic which could be replaced by
> components from the Arrow platform for better platform-wide
> interoperability:
>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>
> ORC's use of symbols from Protocol Buffers was actually a cause of
> bugs that we had to fix in Arrow's build system to prevent them from
> leaking to third party linkers when statically linked (ORC is only
> available for static linking at the moment AFAIK).
>
> I question whether it's worth the community's time long term to wear
> ourselves out defining custom "ports" / virtual interfaces in each
> library to plug components together rather than utilizing common
> platform APIs.
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com> wrote:
>> You're point about the constraints of the ASF release process are well
>> taken and as a developer who's trying to work in the current environment I
>> would be much happier if the codebases were merged. The main issues I worry
>> about when you put codebases like these together are:
>>
>> 1. The delineation of API's become blurred and the code becomes too coupled
>> 2. Release of artifacts that are lower in the dependency tree are delayed
>> by artifacts higher in the dependency tree
>>
>> If the project/release management is structured well and someone keeps an
>> eye on the coupling, then I don't have any concerns.
>>
>> I would like to point out that arrow's use of orc is a great example of how
>> it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>>
>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Josh,
>>>
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them together seems like the wrong choice.
>>>
>>> Apache is "Community over Code"; right now it's the same people
>>> building these projects -- my argument (which I think you agree with?)
>>> is that we should work more closely together until the community grows
>>> large enough to support larger-scope process than we have now. As
>>> you've seen, our process isn't serving developers of these projects.
>>>
>>> > I also think build tooling should be pulled into its own codebase.
>>>
>>> I don't see how this can possibly be practical taking into
>>> consideration the constraints imposed by the combination of the GitHub
>>> platform and the ASF release process. I'm all for being idealistic,
>>> but right now we need to be practical. Unless we can devise a
>>> practical procedure that can accommodate at least 1 patch per day
>>> which may touch both code and build system simultaneously without
>>> being a hindrance to contributor or maintainer, I don't see how we can
>>> move forward.
>>>
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short term with the express purpose of separating them in the near  term.
>>>
>>> I would agree but only if separation can be demonstrated to be
>>> practical and result in net improvements in productivity and community
>>> growth. I think experience has clearly demonstrated that the current
>>> separation is impractical, and is causing problems.
>>>
>>> Per Julian's and Ted's comments, I think we need to consider
>>> development process and ASF releases separately. My argument is as
>>> follows:
>>>
>>> * Monorepo for development (for practicality)
>>> * Releases structured according to the desires of the PMCs
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>> > I recently worked on an issue that had to be implemented in parquet-cpp
>>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> > ARROW-2586). I found the circular dependencies confusing and hard to work
>>> > with. For example, I still have a PR open in parquet-cpp (created on May
>>> > 10) because of a PR that it depended on in arrow that was recently
>>> merged.
>>> > I couldn't even address any CI issues in the PR because the change in
>>> arrow
>>> > was not yet in master. In a separate PR, I changed the
>>> run_clang_format.py
>>> > script in the arrow project only to find out later that there was an
>>> exact
>>> > copy of it in parquet-cpp.
>>> >
>>> > However, I don't think merging the codebases makes sense in the long
>>> term.
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them
>>> > together seems like the wrong choice. There will be other formats that
>>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> don't
>>> > see why parquet should be special. I also think build tooling should be
>>> > pulled into its own codebase. GNU has had a long history of developing
>>> open
>>> > source C/C++ projects that way and made projects like
>>> > autoconf/automake/make to support them. I don't think CI is a good
>>> > counter-example since there have been lots of successful open source
>>> > projects that have used nightly build systems that pinned versions of
>>> > dependent software.
>>> >
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short
>>> > term with the express purpose of separating them in the near  term. My
>>> > reasoning is as follows. By putting the codebases together, you can more
>>> > easily delineate the boundaries between the API's with a single PR.
>>> Second,
>>> > it will force the build tooling to converge instead of diverge, which has
>>> > already happened. Once the boundaries and tooling have been sorted out,
>>> it
>>> > should be easy to separate them back into their own codebases.
>>> >
>>> > If the codebases are merged, I would ask that the C++ codebases for arrow
>>> > be separated from other languages. Looking at it from the perspective of
>>> a
>>> > parquet-cpp library user, having a dependency on Java is a large tax to
>>> pay
>>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> > release of arrow, many of which were holding up the release. I hope that
>>> > seems like a reasonable compromise, and I think it will help reduce the
>>> > complexity of the build/release tooling.
>>> >
>>> >
>>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >
>>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >>
>>> >> >
>>> >> > > The community will be less willing to accept large
>>> >> > > changes that require multiple rounds of patches for stability and
>>> API
>>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
>>> took
>>> >> a
>>> >> > > significantly long time for the very same reason.
>>> >> >
>>> >> > Please don't use bad experiences from another open source community as
>>> >> > leverage in this discussion. I'm sorry that things didn't go the way
>>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >> > happens to operate under a similar open governance model.
>>> >>
>>> >>
>>> >> There are some more radical and community building options as well. Take
>>> >> the subversion project as a precedent. With subversion, any Apache
>>> >> committer can request and receive a commit bit on some large fraction of
>>> >> subversion.
>>> >>
>>> >> So why not take this a bit further and give every parquet committer a
>>> >> commit bit in Arrow? Or even make them be first class committers in
>>> Arrow?
>>> >> Possibly even make it policy that every Parquet committer who asks will
>>> be
>>> >> given committer status in Arrow.
>>> >>
>>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> can't be
>>> >> worried at that point whether their patches will get merged; they can
>>> just
>>> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>>> >> committers. After all, Arrow already depends a lot on parquet so why not
>>> >> invite them in?
>>> >>
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> I would like to point out that arrow's use of orc is a great example of how it would be possible to manage parquet-cpp as a separate codebase. That gives me hope that the projects could be managed separately some day.

Well, I don't know that ORC is the best example. The ORC C++ codebase
features several areas of duplicated logic which could be replaced by
components from the Arrow platform for better platform-wide
interoperability:

https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh

ORC's use of symbols from Protocol Buffers was actually a cause of
bugs that we had to fix in Arrow's build system to prevent them from
leaking to third party linkers when statically linked (ORC is only
available for static linking at the moment AFAIK).

I question whether it's worth the community's time long term to wear
ourselves out defining custom "ports" / virtual interfaces in each
library to plug components together rather than utilizing common
platform APIs.

- Wes

On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com> wrote:
> You're point about the constraints of the ASF release process are well
> taken and as a developer who's trying to work in the current environment I
> would be much happier if the codebases were merged. The main issues I worry
> about when you put codebases like these together are:
>
> 1. The delineation of API's become blurred and the code becomes too coupled
> 2. Release of artifacts that are lower in the dependency tree are delayed
> by artifacts higher in the dependency tree
>
> If the project/release management is structured well and someone keeps an
> eye on the coupling, then I don't have any concerns.
>
> I would like to point out that arrow's use of orc is a great example of how
> it would be possible to manage parquet-cpp as a separate codebase. That
> gives me hope that the projects could be managed separately some day.
>
> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Josh,
>>
>> > I can imagine use cases for parquet that don't involve arrow and tying
>> them together seems like the wrong choice.
>>
>> Apache is "Community over Code"; right now it's the same people
>> building these projects -- my argument (which I think you agree with?)
>> is that we should work more closely together until the community grows
>> large enough to support larger-scope process than we have now. As
>> you've seen, our process isn't serving developers of these projects.
>>
>> > I also think build tooling should be pulled into its own codebase.
>>
>> I don't see how this can possibly be practical taking into
>> consideration the constraints imposed by the combination of the GitHub
>> platform and the ASF release process. I'm all for being idealistic,
>> but right now we need to be practical. Unless we can devise a
>> practical procedure that can accommodate at least 1 patch per day
>> which may touch both code and build system simultaneously without
>> being a hindrance to contributor or maintainer, I don't see how we can
>> move forward.
>>
>> > That being said, I think it makes sense to merge the codebases in the
>> short term with the express purpose of separating them in the near  term.
>>
>> I would agree but only if separation can be demonstrated to be
>> practical and result in net improvements in productivity and community
>> growth. I think experience has clearly demonstrated that the current
>> separation is impractical, and is causing problems.
>>
>> Per Julian's and Ted's comments, I think we need to consider
>> development process and ASF releases separately. My argument is as
>> follows:
>>
>> * Monorepo for development (for practicality)
>> * Releases structured according to the desires of the PMCs
>>
>> - Wes
>>
>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
>> wrote:
>> > I recently worked on an issue that had to be implemented in parquet-cpp
>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>> > ARROW-2586). I found the circular dependencies confusing and hard to work
>> > with. For example, I still have a PR open in parquet-cpp (created on May
>> > 10) because of a PR that it depended on in arrow that was recently
>> merged.
>> > I couldn't even address any CI issues in the PR because the change in
>> arrow
>> > was not yet in master. In a separate PR, I changed the
>> run_clang_format.py
>> > script in the arrow project only to find out later that there was an
>> exact
>> > copy of it in parquet-cpp.
>> >
>> > However, I don't think merging the codebases makes sense in the long
>> term.
>> > I can imagine use cases for parquet that don't involve arrow and tying
>> them
>> > together seems like the wrong choice. There will be other formats that
>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>> don't
>> > see why parquet should be special. I also think build tooling should be
>> > pulled into its own codebase. GNU has had a long history of developing
>> open
>> > source C/C++ projects that way and made projects like
>> > autoconf/automake/make to support them. I don't think CI is a good
>> > counter-example since there have been lots of successful open source
>> > projects that have used nightly build systems that pinned versions of
>> > dependent software.
>> >
>> > That being said, I think it makes sense to merge the codebases in the
>> short
>> > term with the express purpose of separating them in the near  term. My
>> > reasoning is as follows. By putting the codebases together, you can more
>> > easily delineate the boundaries between the API's with a single PR.
>> Second,
>> > it will force the build tooling to converge instead of diverge, which has
>> > already happened. Once the boundaries and tooling have been sorted out,
>> it
>> > should be easy to separate them back into their own codebases.
>> >
>> > If the codebases are merged, I would ask that the C++ codebases for arrow
>> > be separated from other languages. Looking at it from the perspective of
>> a
>> > parquet-cpp library user, having a dependency on Java is a large tax to
>> pay
>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>> > release of arrow, many of which were holding up the release. I hope that
>> > seems like a reasonable compromise, and I think it will help reduce the
>> > complexity of the build/release tooling.
>> >
>> >
>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> >
>> >> > > The community will be less willing to accept large
>> >> > > changes that require multiple rounds of patches for stability and
>> API
>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
>> took
>> >> a
>> >> > > significantly long time for the very same reason.
>> >> >
>> >> > Please don't use bad experiences from another open source community as
>> >> > leverage in this discussion. I'm sorry that things didn't go the way
>> >> > you wanted in Apache Hadoop but this is a distinct community which
>> >> > happens to operate under a similar open governance model.
>> >>
>> >>
>> >> There are some more radical and community building options as well. Take
>> >> the subversion project as a precedent. With subversion, any Apache
>> >> committer can request and receive a commit bit on some large fraction of
>> >> subversion.
>> >>
>> >> So why not take this a bit further and give every parquet committer a
>> >> commit bit in Arrow? Or even make them be first class committers in
>> Arrow?
>> >> Possibly even make it policy that every Parquet committer who asks will
>> be
>> >> given committer status in Arrow.
>> >>
>> >> That relieves a lot of the social anxiety here. Parquet committers
>> can't be
>> >> worried at that point whether their patches will get merged; they can
>> just
>> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>> >> committers. After all, Arrow already depends a lot on parquet so why not
>> >> invite them in?
>> >>
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

> I would like to point out that arrow's use of orc is a great example of how it would be possible to manage parquet-cpp as a separate codebase. That gives me hope that the projects could be managed separately some day.

Well, I don't know that ORC is the best example. The ORC C++ codebase
features several areas of duplicated logic which could be replaced by
components from the Arrow platform for better platform-wide
interoperability:

https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh

ORC's use of symbols from Protocol Buffers was actually a cause of
bugs that we had to fix in Arrow's build system to prevent them from
leaking to third party linkers when statically linked (ORC is only
available for static linking at the moment AFAIK).

I question whether it's worth the community's time long term to wear
ourselves out defining custom "ports" / virtual interfaces in each
library to plug components together rather than utilizing common
platform APIs.

- Wes

On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <jo...@gmail.com> wrote:
> You're point about the constraints of the ASF release process are well
> taken and as a developer who's trying to work in the current environment I
> would be much happier if the codebases were merged. The main issues I worry
> about when you put codebases like these together are:
>
> 1. The delineation of API's become blurred and the code becomes too coupled
> 2. Release of artifacts that are lower in the dependency tree are delayed
> by artifacts higher in the dependency tree
>
> If the project/release management is structured well and someone keeps an
> eye on the coupling, then I don't have any concerns.
>
> I would like to point out that arrow's use of orc is a great example of how
> it would be possible to manage parquet-cpp as a separate codebase. That
> gives me hope that the projects could be managed separately some day.
>
> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Josh,
>>
>> > I can imagine use cases for parquet that don't involve arrow and tying
>> them together seems like the wrong choice.
>>
>> Apache is "Community over Code"; right now it's the same people
>> building these projects -- my argument (which I think you agree with?)
>> is that we should work more closely together until the community grows
>> large enough to support larger-scope process than we have now. As
>> you've seen, our process isn't serving developers of these projects.
>>
>> > I also think build tooling should be pulled into its own codebase.
>>
>> I don't see how this can possibly be practical taking into
>> consideration the constraints imposed by the combination of the GitHub
>> platform and the ASF release process. I'm all for being idealistic,
>> but right now we need to be practical. Unless we can devise a
>> practical procedure that can accommodate at least 1 patch per day
>> which may touch both code and build system simultaneously without
>> being a hindrance to contributor or maintainer, I don't see how we can
>> move forward.
>>
>> > That being said, I think it makes sense to merge the codebases in the
>> short term with the express purpose of separating them in the near  term.
>>
>> I would agree but only if separation can be demonstrated to be
>> practical and result in net improvements in productivity and community
>> growth. I think experience has clearly demonstrated that the current
>> separation is impractical, and is causing problems.
>>
>> Per Julian's and Ted's comments, I think we need to consider
>> development process and ASF releases separately. My argument is as
>> follows:
>>
>> * Monorepo for development (for practicality)
>> * Releases structured according to the desires of the PMCs
>>
>> - Wes
>>
>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
>> wrote:
>> > I recently worked on an issue that had to be implemented in parquet-cpp
>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>> > ARROW-2586). I found the circular dependencies confusing and hard to work
>> > with. For example, I still have a PR open in parquet-cpp (created on May
>> > 10) because of a PR that it depended on in arrow that was recently
>> merged.
>> > I couldn't even address any CI issues in the PR because the change in
>> arrow
>> > was not yet in master. In a separate PR, I changed the
>> run_clang_format.py
>> > script in the arrow project only to find out later that there was an
>> exact
>> > copy of it in parquet-cpp.
>> >
>> > However, I don't think merging the codebases makes sense in the long
>> term.
>> > I can imagine use cases for parquet that don't involve arrow and tying
>> them
>> > together seems like the wrong choice. There will be other formats that
>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>> don't
>> > see why parquet should be special. I also think build tooling should be
>> > pulled into its own codebase. GNU has had a long history of developing
>> open
>> > source C/C++ projects that way and made projects like
>> > autoconf/automake/make to support them. I don't think CI is a good
>> > counter-example since there have been lots of successful open source
>> > projects that have used nightly build systems that pinned versions of
>> > dependent software.
>> >
>> > That being said, I think it makes sense to merge the codebases in the
>> short
>> > term with the express purpose of separating them in the near  term. My
>> > reasoning is as follows. By putting the codebases together, you can more
>> > easily delineate the boundaries between the API's with a single PR.
>> Second,
>> > it will force the build tooling to converge instead of diverge, which has
>> > already happened. Once the boundaries and tooling have been sorted out,
>> it
>> > should be easy to separate them back into their own codebases.
>> >
>> > If the codebases are merged, I would ask that the C++ codebases for arrow
>> > be separated from other languages. Looking at it from the perspective of
>> a
>> > parquet-cpp library user, having a dependency on Java is a large tax to
>> pay
>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>> > release of arrow, many of which were holding up the release. I hope that
>> > seems like a reasonable compromise, and I think it will help reduce the
>> > complexity of the build/release tooling.
>> >
>> >
>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> >
>> >> > > The community will be less willing to accept large
>> >> > > changes that require multiple rounds of patches for stability and
>> API
>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
>> took
>> >> a
>> >> > > significantly long time for the very same reason.
>> >> >
>> >> > Please don't use bad experiences from another open source community as
>> >> > leverage in this discussion. I'm sorry that things didn't go the way
>> >> > you wanted in Apache Hadoop but this is a distinct community which
>> >> > happens to operate under a similar open governance model.
>> >>
>> >>
>> >> There are some more radical and community building options as well. Take
>> >> the subversion project as a precedent. With subversion, any Apache
>> >> committer can request and receive a commit bit on some large fraction of
>> >> subversion.
>> >>
>> >> So why not take this a bit further and give every parquet committer a
>> >> commit bit in Arrow? Or even make them be first class committers in
>> Arrow?
>> >> Possibly even make it policy that every Parquet committer who asks will
>> be
>> >> given committer status in Arrow.
>> >>
>> >> That relieves a lot of the social anxiety here. Parquet committers
>> can't be
>> >> worried at that point whether their patches will get merged; they can
>> just
>> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>> >> committers. After all, Arrow already depends a lot on parquet so why not
>> >> invite them in?
>> >>
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Joshua Storck <jo...@gmail.com>.

You're point about the constraints of the ASF release process are well
taken and as a developer who's trying to work in the current environment I
would be much happier if the codebases were merged. The main issues I worry
about when you put codebases like these together are:

1. The delineation of API's become blurred and the code becomes too coupled
2. Release of artifacts that are lower in the dependency tree are delayed
by artifacts higher in the dependency tree

If the project/release management is structured well and someone keeps an
eye on the coupling, then I don't have any concerns.

I would like to point out that arrow's use of orc is a great example of how
it would be possible to manage parquet-cpp as a separate codebase. That
gives me hope that the projects could be managed separately some day.

On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:

> hi Josh,
>
> > I can imagine use cases for parquet that don't involve arrow and tying
> them together seems like the wrong choice.
>
> Apache is "Community over Code"; right now it's the same people
> building these projects -- my argument (which I think you agree with?)
> is that we should work more closely together until the community grows
> large enough to support larger-scope process than we have now. As
> you've seen, our process isn't serving developers of these projects.
>
> > I also think build tooling should be pulled into its own codebase.
>
> I don't see how this can possibly be practical taking into
> consideration the constraints imposed by the combination of the GitHub
> platform and the ASF release process. I'm all for being idealistic,
> but right now we need to be practical. Unless we can devise a
> practical procedure that can accommodate at least 1 patch per day
> which may touch both code and build system simultaneously without
> being a hindrance to contributor or maintainer, I don't see how we can
> move forward.
>
> > That being said, I think it makes sense to merge the codebases in the
> short term with the express purpose of separating them in the near  term.
>
> I would agree but only if separation can be demonstrated to be
> practical and result in net improvements in productivity and community
> growth. I think experience has clearly demonstrated that the current
> separation is impractical, and is causing problems.
>
> Per Julian's and Ted's comments, I think we need to consider
> development process and ASF releases separately. My argument is as
> follows:
>
> * Monorepo for development (for practicality)
> * Releases structured according to the desires of the PMCs
>
> - Wes
>
> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> > I recently worked on an issue that had to be implemented in parquet-cpp
> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> > ARROW-2586). I found the circular dependencies confusing and hard to work
> > with. For example, I still have a PR open in parquet-cpp (created on May
> > 10) because of a PR that it depended on in arrow that was recently
> merged.
> > I couldn't even address any CI issues in the PR because the change in
> arrow
> > was not yet in master. In a separate PR, I changed the
> run_clang_format.py
> > script in the arrow project only to find out later that there was an
> exact
> > copy of it in parquet-cpp.
> >
> > However, I don't think merging the codebases makes sense in the long
> term.
> > I can imagine use cases for parquet that don't involve arrow and tying
> them
> > together seems like the wrong choice. There will be other formats that
> > arrow needs to support that will be kept separate (e.g. - Orc), so I
> don't
> > see why parquet should be special. I also think build tooling should be
> > pulled into its own codebase. GNU has had a long history of developing
> open
> > source C/C++ projects that way and made projects like
> > autoconf/automake/make to support them. I don't think CI is a good
> > counter-example since there have been lots of successful open source
> > projects that have used nightly build systems that pinned versions of
> > dependent software.
> >
> > That being said, I think it makes sense to merge the codebases in the
> short
> > term with the express purpose of separating them in the near  term. My
> > reasoning is as follows. By putting the codebases together, you can more
> > easily delineate the boundaries between the API's with a single PR.
> Second,
> > it will force the build tooling to converge instead of diverge, which has
> > already happened. Once the boundaries and tooling have been sorted out,
> it
> > should be easy to separate them back into their own codebases.
> >
> > If the codebases are merged, I would ask that the C++ codebases for arrow
> > be separated from other languages. Looking at it from the perspective of
> a
> > parquet-cpp library user, having a dependency on Java is a large tax to
> pay
> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> > release of arrow, many of which were holding up the release. I hope that
> > seems like a reasonable compromise, and I think it will help reduce the
> > complexity of the build/release tooling.
> >
> >
> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> >
> >> > > The community will be less willing to accept large
> >> > > changes that require multiple rounds of patches for stability and
> API
> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
> took
> >> a
> >> > > significantly long time for the very same reason.
> >> >
> >> > Please don't use bad experiences from another open source community as
> >> > leverage in this discussion. I'm sorry that things didn't go the way
> >> > you wanted in Apache Hadoop but this is a distinct community which
> >> > happens to operate under a similar open governance model.
> >>
> >>
> >> There are some more radical and community building options as well. Take
> >> the subversion project as a precedent. With subversion, any Apache
> >> committer can request and receive a commit bit on some large fraction of
> >> subversion.
> >>
> >> So why not take this a bit further and give every parquet committer a
> >> commit bit in Arrow? Or even make them be first class committers in
> Arrow?
> >> Possibly even make it policy that every Parquet committer who asks will
> be
> >> given committer status in Arrow.
> >>
> >> That relieves a lot of the social anxiety here. Parquet committers
> can't be
> >> worried at that point whether their patches will get merged; they can
> just
> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> >> committers. After all, Arrow already depends a lot on parquet so why not
> >> invite them in?
> >>
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Joshua Storck <jo...@gmail.com>.

You're point about the constraints of the ASF release process are well
taken and as a developer who's trying to work in the current environment I
would be much happier if the codebases were merged. The main issues I worry
about when you put codebases like these together are:

1. The delineation of API's become blurred and the code becomes too coupled
2. Release of artifacts that are lower in the dependency tree are delayed
by artifacts higher in the dependency tree

If the project/release management is structured well and someone keeps an
eye on the coupling, then I don't have any concerns.

I would like to point out that arrow's use of orc is a great example of how
it would be possible to manage parquet-cpp as a separate codebase. That
gives me hope that the projects could be managed separately some day.

On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <we...@gmail.com> wrote:

> hi Josh,
>
> > I can imagine use cases for parquet that don't involve arrow and tying
> them together seems like the wrong choice.
>
> Apache is "Community over Code"; right now it's the same people
> building these projects -- my argument (which I think you agree with?)
> is that we should work more closely together until the community grows
> large enough to support larger-scope process than we have now. As
> you've seen, our process isn't serving developers of these projects.
>
> > I also think build tooling should be pulled into its own codebase.
>
> I don't see how this can possibly be practical taking into
> consideration the constraints imposed by the combination of the GitHub
> platform and the ASF release process. I'm all for being idealistic,
> but right now we need to be practical. Unless we can devise a
> practical procedure that can accommodate at least 1 patch per day
> which may touch both code and build system simultaneously without
> being a hindrance to contributor or maintainer, I don't see how we can
> move forward.
>
> > That being said, I think it makes sense to merge the codebases in the
> short term with the express purpose of separating them in the near  term.
>
> I would agree but only if separation can be demonstrated to be
> practical and result in net improvements in productivity and community
> growth. I think experience has clearly demonstrated that the current
> separation is impractical, and is causing problems.
>
> Per Julian's and Ted's comments, I think we need to consider
> development process and ASF releases separately. My argument is as
> follows:
>
> * Monorepo for development (for practicality)
> * Releases structured according to the desires of the PMCs
>
> - Wes
>
> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> > I recently worked on an issue that had to be implemented in parquet-cpp
> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> > ARROW-2586). I found the circular dependencies confusing and hard to work
> > with. For example, I still have a PR open in parquet-cpp (created on May
> > 10) because of a PR that it depended on in arrow that was recently
> merged.
> > I couldn't even address any CI issues in the PR because the change in
> arrow
> > was not yet in master. In a separate PR, I changed the
> run_clang_format.py
> > script in the arrow project only to find out later that there was an
> exact
> > copy of it in parquet-cpp.
> >
> > However, I don't think merging the codebases makes sense in the long
> term.
> > I can imagine use cases for parquet that don't involve arrow and tying
> them
> > together seems like the wrong choice. There will be other formats that
> > arrow needs to support that will be kept separate (e.g. - Orc), so I
> don't
> > see why parquet should be special. I also think build tooling should be
> > pulled into its own codebase. GNU has had a long history of developing
> open
> > source C/C++ projects that way and made projects like
> > autoconf/automake/make to support them. I don't think CI is a good
> > counter-example since there have been lots of successful open source
> > projects that have used nightly build systems that pinned versions of
> > dependent software.
> >
> > That being said, I think it makes sense to merge the codebases in the
> short
> > term with the express purpose of separating them in the near  term. My
> > reasoning is as follows. By putting the codebases together, you can more
> > easily delineate the boundaries between the API's with a single PR.
> Second,
> > it will force the build tooling to converge instead of diverge, which has
> > already happened. Once the boundaries and tooling have been sorted out,
> it
> > should be easy to separate them back into their own codebases.
> >
> > If the codebases are merged, I would ask that the C++ codebases for arrow
> > be separated from other languages. Looking at it from the perspective of
> a
> > parquet-cpp library user, having a dependency on Java is a large tax to
> pay
> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> > release of arrow, many of which were holding up the release. I hope that
> > seems like a reasonable compromise, and I think it will help reduce the
> > complexity of the build/release tooling.
> >
> >
> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> >
> >> > > The community will be less willing to accept large
> >> > > changes that require multiple rounds of patches for stability and
> API
> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
> took
> >> a
> >> > > significantly long time for the very same reason.
> >> >
> >> > Please don't use bad experiences from another open source community as
> >> > leverage in this discussion. I'm sorry that things didn't go the way
> >> > you wanted in Apache Hadoop but this is a distinct community which
> >> > happens to operate under a similar open governance model.
> >>
> >>
> >> There are some more radical and community building options as well. Take
> >> the subversion project as a precedent. With subversion, any Apache
> >> committer can request and receive a commit bit on some large fraction of
> >> subversion.
> >>
> >> So why not take this a bit further and give every parquet committer a
> >> commit bit in Arrow? Or even make them be first class committers in
> Arrow?
> >> Possibly even make it policy that every Parquet committer who asks will
> be
> >> given committer status in Arrow.
> >>
> >> That relieves a lot of the social anxiety here. Parquet committers
> can't be
> >> worried at that point whether their patches will get merged; they can
> just
> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> >> committers. After all, Arrow already depends a lot on parquet so why not
> >> invite them in?
> >>
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Josh,

> I can imagine use cases for parquet that don't involve arrow and tying them together seems like the wrong choice.

Apache is "Community over Code"; right now it's the same people
building these projects -- my argument (which I think you agree with?)
is that we should work more closely together until the community grows
large enough to support larger-scope process than we have now. As
you've seen, our process isn't serving developers of these projects.

> I also think build tooling should be pulled into its own codebase.

I don't see how this can possibly be practical taking into
consideration the constraints imposed by the combination of the GitHub
platform and the ASF release process. I'm all for being idealistic,
but right now we need to be practical. Unless we can devise a
practical procedure that can accommodate at least 1 patch per day
which may touch both code and build system simultaneously without
being a hindrance to contributor or maintainer, I don't see how we can
move forward.

> That being said, I think it makes sense to merge the codebases in the short term with the express purpose of separating them in the near  term.

I would agree but only if separation can be demonstrated to be
practical and result in net improvements in productivity and community
growth. I think experience has clearly demonstrated that the current
separation is impractical, and is causing problems.

Per Julian's and Ted's comments, I think we need to consider
development process and ASF releases separately. My argument is as
follows:

* Monorepo for development (for practicality)
* Releases structured according to the desires of the PMCs

- Wes

On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com> wrote:
> I recently worked on an issue that had to be implemented in parquet-cpp
> (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> ARROW-2586). I found the circular dependencies confusing and hard to work
> with. For example, I still have a PR open in parquet-cpp (created on May
> 10) because of a PR that it depended on in arrow that was recently merged.
> I couldn't even address any CI issues in the PR because the change in arrow
> was not yet in master. In a separate PR, I changed the run_clang_format.py
> script in the arrow project only to find out later that there was an exact
> copy of it in parquet-cpp.
>
> However, I don't think merging the codebases makes sense in the long term.
> I can imagine use cases for parquet that don't involve arrow and tying them
> together seems like the wrong choice. There will be other formats that
> arrow needs to support that will be kept separate (e.g. - Orc), so I don't
> see why parquet should be special. I also think build tooling should be
> pulled into its own codebase. GNU has had a long history of developing open
> source C/C++ projects that way and made projects like
> autoconf/automake/make to support them. I don't think CI is a good
> counter-example since there have been lots of successful open source
> projects that have used nightly build systems that pinned versions of
> dependent software.
>
> That being said, I think it makes sense to merge the codebases in the short
> term with the express purpose of separating them in the near  term. My
> reasoning is as follows. By putting the codebases together, you can more
> easily delineate the boundaries between the API's with a single PR. Second,
> it will force the build tooling to converge instead of diverge, which has
> already happened. Once the boundaries and tooling have been sorted out, it
> should be easy to separate them back into their own codebases.
>
> If the codebases are merged, I would ask that the C++ codebases for arrow
> be separated from other languages. Looking at it from the perspective of a
> parquet-cpp library user, having a dependency on Java is a large tax to pay
> if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> release of arrow, many of which were holding up the release. I hope that
> seems like a reasonable compromise, and I think it will help reduce the
> complexity of the build/release tooling.
>
>
> On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com> wrote:
>
>> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> >
>> > > The community will be less willing to accept large
>> > > changes that require multiple rounds of patches for stability and API
>> > > convergence. Our contributions to Libhdfs++ in the HDFS community took
>> a
>> > > significantly long time for the very same reason.
>> >
>> > Please don't use bad experiences from another open source community as
>> > leverage in this discussion. I'm sorry that things didn't go the way
>> > you wanted in Apache Hadoop but this is a distinct community which
>> > happens to operate under a similar open governance model.
>>
>>
>> There are some more radical and community building options as well. Take
>> the subversion project as a precedent. With subversion, any Apache
>> committer can request and receive a commit bit on some large fraction of
>> subversion.
>>
>> So why not take this a bit further and give every parquet committer a
>> commit bit in Arrow? Or even make them be first class committers in Arrow?
>> Possibly even make it policy that every Parquet committer who asks will be
>> given committer status in Arrow.
>>
>> That relieves a lot of the social anxiety here. Parquet committers can't be
>> worried at that point whether their patches will get merged; they can just
>> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>> committers. After all, Arrow already depends a lot on parquet so why not
>> invite them in?
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Josh,

> I can imagine use cases for parquet that don't involve arrow and tying them together seems like the wrong choice.

Apache is "Community over Code"; right now it's the same people
building these projects -- my argument (which I think you agree with?)
is that we should work more closely together until the community grows
large enough to support larger-scope process than we have now. As
you've seen, our process isn't serving developers of these projects.

> I also think build tooling should be pulled into its own codebase.

I don't see how this can possibly be practical taking into
consideration the constraints imposed by the combination of the GitHub
platform and the ASF release process. I'm all for being idealistic,
but right now we need to be practical. Unless we can devise a
practical procedure that can accommodate at least 1 patch per day
which may touch both code and build system simultaneously without
being a hindrance to contributor or maintainer, I don't see how we can
move forward.

> That being said, I think it makes sense to merge the codebases in the short term with the express purpose of separating them in the near  term.

I would agree but only if separation can be demonstrated to be
practical and result in net improvements in productivity and community
growth. I think experience has clearly demonstrated that the current
separation is impractical, and is causing problems.

Per Julian's and Ted's comments, I think we need to consider
development process and ASF releases separately. My argument is as
follows:

* Monorepo for development (for practicality)
* Releases structured according to the desires of the PMCs

- Wes

On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <jo...@gmail.com> wrote:
> I recently worked on an issue that had to be implemented in parquet-cpp
> (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
> ARROW-2586). I found the circular dependencies confusing and hard to work
> with. For example, I still have a PR open in parquet-cpp (created on May
> 10) because of a PR that it depended on in arrow that was recently merged.
> I couldn't even address any CI issues in the PR because the change in arrow
> was not yet in master. In a separate PR, I changed the run_clang_format.py
> script in the arrow project only to find out later that there was an exact
> copy of it in parquet-cpp.
>
> However, I don't think merging the codebases makes sense in the long term.
> I can imagine use cases for parquet that don't involve arrow and tying them
> together seems like the wrong choice. There will be other formats that
> arrow needs to support that will be kept separate (e.g. - Orc), so I don't
> see why parquet should be special. I also think build tooling should be
> pulled into its own codebase. GNU has had a long history of developing open
> source C/C++ projects that way and made projects like
> autoconf/automake/make to support them. I don't think CI is a good
> counter-example since there have been lots of successful open source
> projects that have used nightly build systems that pinned versions of
> dependent software.
>
> That being said, I think it makes sense to merge the codebases in the short
> term with the express purpose of separating them in the near  term. My
> reasoning is as follows. By putting the codebases together, you can more
> easily delineate the boundaries between the API's with a single PR. Second,
> it will force the build tooling to converge instead of diverge, which has
> already happened. Once the boundaries and tooling have been sorted out, it
> should be easy to separate them back into their own codebases.
>
> If the codebases are merged, I would ask that the C++ codebases for arrow
> be separated from other languages. Looking at it from the perspective of a
> parquet-cpp library user, having a dependency on Java is a large tax to pay
> if you don't need it. For example, there were 25 JIRA's in the 0.10.0
> release of arrow, many of which were holding up the release. I hope that
> seems like a reasonable compromise, and I think it will help reduce the
> complexity of the build/release tooling.
>
>
> On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com> wrote:
>
>> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> >
>> > > The community will be less willing to accept large
>> > > changes that require multiple rounds of patches for stability and API
>> > > convergence. Our contributions to Libhdfs++ in the HDFS community took
>> a
>> > > significantly long time for the very same reason.
>> >
>> > Please don't use bad experiences from another open source community as
>> > leverage in this discussion. I'm sorry that things didn't go the way
>> > you wanted in Apache Hadoop but this is a distinct community which
>> > happens to operate under a similar open governance model.
>>
>>
>> There are some more radical and community building options as well. Take
>> the subversion project as a precedent. With subversion, any Apache
>> committer can request and receive a commit bit on some large fraction of
>> subversion.
>>
>> So why not take this a bit further and give every parquet committer a
>> commit bit in Arrow? Or even make them be first class committers in Arrow?
>> Possibly even make it policy that every Parquet committer who asks will be
>> given committer status in Arrow.
>>
>> That relieves a lot of the social anxiety here. Parquet committers can't be
>> worried at that point whether their patches will get merged; they can just
>> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>> committers. After all, Arrow already depends a lot on parquet so why not
>> invite them in?
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Joshua Storck <jo...@gmail.com>.

I recently worked on an issue that had to be implemented in parquet-cpp
(ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
ARROW-2586). I found the circular dependencies confusing and hard to work
with. For example, I still have a PR open in parquet-cpp (created on May
10) because of a PR that it depended on in arrow that was recently merged.
I couldn't even address any CI issues in the PR because the change in arrow
was not yet in master. In a separate PR, I changed the run_clang_format.py
script in the arrow project only to find out later that there was an exact
copy of it in parquet-cpp.

However, I don't think merging the codebases makes sense in the long term.
I can imagine use cases for parquet that don't involve arrow and tying them
together seems like the wrong choice. There will be other formats that
arrow needs to support that will be kept separate (e.g. - Orc), so I don't
see why parquet should be special. I also think build tooling should be
pulled into its own codebase. GNU has had a long history of developing open
source C/C++ projects that way and made projects like
autoconf/automake/make to support them. I don't think CI is a good
counter-example since there have been lots of successful open source
projects that have used nightly build systems that pinned versions of
dependent software.

That being said, I think it makes sense to merge the codebases in the short
term with the express purpose of separating them in the near  term. My
reasoning is as follows. By putting the codebases together, you can more
easily delineate the boundaries between the API's with a single PR. Second,
it will force the build tooling to converge instead of diverge, which has
already happened. Once the boundaries and tooling have been sorted out, it
should be easy to separate them back into their own codebases.

If the codebases are merged, I would ask that the C++ codebases for arrow
be separated from other languages. Looking at it from the perspective of a
parquet-cpp library user, having a dependency on Java is a large tax to pay
if you don't need it. For example, there were 25 JIRA's in the 0.10.0
release of arrow, many of which were holding up the release. I hope that
seems like a reasonable compromise, and I think it will help reduce the
complexity of the build/release tooling.

On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> >
> > > The community will be less willing to accept large
> > > changes that require multiple rounds of patches for stability and API
> > > convergence. Our contributions to Libhdfs++ in the HDFS community took
> a
> > > significantly long time for the very same reason.
> >
> > Please don't use bad experiences from another open source community as
> > leverage in this discussion. I'm sorry that things didn't go the way
> > you wanted in Apache Hadoop but this is a distinct community which
> > happens to operate under a similar open governance model.
>
>
> There are some more radical and community building options as well. Take
> the subversion project as a precedent. With subversion, any Apache
> committer can request and receive a commit bit on some large fraction of
> subversion.
>
> So why not take this a bit further and give every parquet committer a
> commit bit in Arrow? Or even make them be first class committers in Arrow?
> Possibly even make it policy that every Parquet committer who asks will be
> given committer status in Arrow.
>
> That relieves a lot of the social anxiety here. Parquet committers can't be
> worried at that point whether their patches will get merged; they can just
> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> committers. After all, Arrow already depends a lot on parquet so why not
> invite them in?
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Joshua Storck <jo...@gmail.com>.

I recently worked on an issue that had to be implemented in parquet-cpp
(ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
ARROW-2586). I found the circular dependencies confusing and hard to work
with. For example, I still have a PR open in parquet-cpp (created on May
10) because of a PR that it depended on in arrow that was recently merged.
I couldn't even address any CI issues in the PR because the change in arrow
was not yet in master. In a separate PR, I changed the run_clang_format.py
script in the arrow project only to find out later that there was an exact
copy of it in parquet-cpp.

However, I don't think merging the codebases makes sense in the long term.
I can imagine use cases for parquet that don't involve arrow and tying them
together seems like the wrong choice. There will be other formats that
arrow needs to support that will be kept separate (e.g. - Orc), so I don't
see why parquet should be special. I also think build tooling should be
pulled into its own codebase. GNU has had a long history of developing open
source C/C++ projects that way and made projects like
autoconf/automake/make to support them. I don't think CI is a good
counter-example since there have been lots of successful open source
projects that have used nightly build systems that pinned versions of
dependent software.

That being said, I think it makes sense to merge the codebases in the short
term with the express purpose of separating them in the near  term. My
reasoning is as follows. By putting the codebases together, you can more
easily delineate the boundaries between the API's with a single PR. Second,
it will force the build tooling to converge instead of diverge, which has
already happened. Once the boundaries and tooling have been sorted out, it
should be easy to separate them back into their own codebases.

If the codebases are merged, I would ask that the C++ codebases for arrow
be separated from other languages. Looking at it from the perspective of a
parquet-cpp library user, having a dependency on Java is a large tax to pay
if you don't need it. For example, there were 25 JIRA's in the 0.10.0
release of arrow, many of which were holding up the release. I hope that
seems like a reasonable compromise, and I think it will help reduce the
complexity of the build/release tooling.

On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> >
> > > The community will be less willing to accept large
> > > changes that require multiple rounds of patches for stability and API
> > > convergence. Our contributions to Libhdfs++ in the HDFS community took
> a
> > > significantly long time for the very same reason.
> >
> > Please don't use bad experiences from another open source community as
> > leverage in this discussion. I'm sorry that things didn't go the way
> > you wanted in Apache Hadoop but this is a distinct community which
> > happens to operate under a similar open governance model.
>
>
> There are some more radical and community building options as well. Take
> the subversion project as a precedent. With subversion, any Apache
> committer can request and receive a commit bit on some large fraction of
> subversion.
>
> So why not take this a bit further and give every parquet committer a
> commit bit in Arrow? Or even make them be first class committers in Arrow?
> Possibly even make it policy that every Parquet committer who asks will be
> given committer status in Arrow.
>
> That relieves a lot of the social anxiety here. Parquet committers can't be
> worried at that point whether their patches will get merged; they can just
> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> committers. After all, Arrow already depends a lot on parquet so why not
> invite them in?
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jul 30, 2018 at 8:50 PM, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>
>>
>> > The community will be less willing to accept large
>> > changes that require multiple rounds of patches for stability and API
>> > convergence. Our contributions to Libhdfs++ in the HDFS community took a
>> > significantly long time for the very same reason.
>>
>> Please don't use bad experiences from another open source community as
>> leverage in this discussion. I'm sorry that things didn't go the way
>> you wanted in Apache Hadoop but this is a distinct community which
>> happens to operate under a similar open governance model.
>
>
> There are some more radical and community building options as well. Take
> the subversion project as a precedent. With subversion, any Apache
> committer can request and receive a commit bit on some large fraction of
> subversion.
>
> So why not take this a bit further and give every parquet committer a
> commit bit in Arrow? Or even make them be first class committers in Arrow?
> Possibly even make it policy that every Parquet committer who asks will be
> given committer status in Arrow.
>
> That relieves a lot of the social anxiety here. Parquet committers can't be
> worried at that point whether their patches will get merged; they can just
> merge them.  Arrow shouldn't worry much about inviting in the Parquet
> committers. After all, Arrow already depends a lot on parquet so why not
> invite them in?

hi Ted,

I for one am with you on this idea, and don't see it as all that
radical. The Arrow and Parquet communities are working toward the same
goals: open standards for storage and in-memory analytics. This is
part of why so there is so much overlap already amongst the committers
and PMC members.

We are stronger working together than fragmented.

- Wes

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:

>
> > The community will be less willing to accept large
> > changes that require multiple rounds of patches for stability and API
> > convergence. Our contributions to Libhdfs++ in the HDFS community took a
> > significantly long time for the very same reason.
>
> Please don't use bad experiences from another open source community as
> leverage in this discussion. I'm sorry that things didn't go the way
> you wanted in Apache Hadoop but this is a distinct community which
> happens to operate under a similar open governance model.

There are some more radical and community building options as well. Take
the subversion project as a precedent. With subversion, any Apache
committer can request and receive a commit bit on some large fraction of
subversion.

So why not take this a bit further and give every parquet committer a
commit bit in Arrow? Or even make them be first class committers in Arrow?
Possibly even make it policy that every Parquet committer who asks will be
given committer status in Arrow.

That relieves a lot of the social anxiety here. Parquet committers can't be
worried at that point whether their patches will get merged; they can just
merge them.  Arrow shouldn't worry much about inviting in the Parquet
committers. After all, Arrow already depends a lot on parquet so why not
invite them in?

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:

>
> > The community will be less willing to accept large
> > changes that require multiple rounds of patches for stability and API
> > convergence. Our contributions to Libhdfs++ in the HDFS community took a
> > significantly long time for the very same reason.
>
> Please don't use bad experiences from another open source community as
> leverage in this discussion. I'm sorry that things didn't go the way
> you wanted in Apache Hadoop but this is a distinct community which
> happens to operate under a similar open governance model.

There are some more radical and community building options as well. Take
the subversion project as a precedent. With subversion, any Apache
committer can request and receive a commit bit on some large fraction of
subversion.

So why not take this a bit further and give every parquet committer a
commit bit in Arrow? Or even make them be first class committers in Arrow?
Possibly even make it policy that every Parquet committer who asks will be
given committer status in Arrow.

That relieves a lot of the social anxiety here. Parquet committers can't be
worried at that point whether their patches will get merged; they can just
merge them.  Arrow shouldn't worry much about inviting in the Parquet
committers. After all, Arrow already depends a lot on parquet so why not
invite them in?

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi,

On Mon, Jul 30, 2018 at 6:52 PM, Deepak Majeti <ma...@gmail.com> wrote:
> Wes,
>
> I definitely appreciate and do see the impact of contributions made by
> everyone. I made this statement not to rate any contributions but solely to
> support my concern.
> The contribution barrier is higher simply because of the increased code,
> build, and test dependencies. If the community has lesser interest on a
> certain component (parquet-cpp core in this case), it becomes very hard to
> make big changes.

This is a FUD-based argument rather than a fact-based one. If there
are committers in Arrow (via Parquet) who approve changes, why would
they not be merged? The community will be incentivized to make sure
that developers are productive and able to work efficiently on the
part of the project that is relevant to them. parquet-cpp developers
are already building most of Arrow's C++ codebase en route to
development; by building with a single build system development
environments will be simpler to manage in general.

On the subject of code velocity: Arrow has a diverse community and
nearly 200 unique contributors at this point. Parquet has about 30.
The Arrow codebase history starts on February 5, 2016. Since then:

* Arrow has has 2055 patches
* parquet-cpp has had 425

So the patch volume is about 5x as high on average. This does not look
like a project that is struggling to merge patches. We are all
invested in the success of Parquet and changing the structure of the
code and helping the community to work more productively would not
change that.

> The community will be less willing to accept large
> changes that require multiple rounds of patches for stability and API
> convergence. Our contributions to Libhdfs++ in the HDFS community took a
> significantly long time for the very same reason.

Please don't use bad experiences from another open source community as
leverage in this discussion. I'm sorry that things didn't go the way
you wanted in Apache Hadoop but this is a distinct community which
happens to operate under a similar open governance model.

After significant time thinking about it, I think unfortunately that
the next-best option after a monorepo structure would be for the Arrow
community to _fork_ the parquet-cpp codebase and go our separate ways.
Beyond these two options I fail to see a pragmatic solution to the
problems we've been having.

- Wes

>
>
>
> On Mon, Jul 30, 2018 at 6:05 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Deepak
>>
>> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > @Wes
>> > My observation is that most of the parquet-cpp contributors you listed
>> that
>> > overlap with the Arrow community mainly contribute to the Arrow
>> > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> > repo. Very few of them review/contribute patches to the parquet-cpp core.
>> >
>>
>> So, what are you saying exactly, that some contributions or
>> contributors to Apache Parquet matter more than others? I don't
>> follow.
>>
>> As a result of these individual's efforts, the parquet-cpp libraries
>> are being installed well over 100,000 times per month on a single
>> install path (Python) alone.
>>
>
>> > I believe improvements to the parquet-cpp core will be negatively
>> impacted
>> > since merging the parquet-cpp and arrow-cpp repos will increase the
>> barrier
>> > of entry to new contributors interested in the parquet-cpp core. The
>> > current extensions to the parquet-cpp core related to bloom-filters, and
>> > column encryption are all being done by first-time contributors.
>>
>> I don't understand why this would "increase the barrier of entry".
>> Could you explain?
>>
> It is true that there would be more code in the codebase, but the
>> build and test procedure would be no more complex. If anything,
>> community productivity will be improved by having a more cohesive /
>> centralized development platform (large amounts of code that Parquet
>> depends on are in Apache Arrow already).
>>
>> >
>> > If you believe there will be new interest in the parquet-cpp core with
>> the
>> > mono-repo approach, I am all up for it.
>>
>> Yes, I believe that this change will result in more and higher quality
>> code review to Parquet core changes and general improvements to
>> developer productivity across the board. Developer productivity is
>> what this is all about.
>>
>> - Wes
>>
>> >
>> >
>> > On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com>
>> wrote:
>> >
>> >> I do not claim to have insight into parquet-cpp development. However,
>> from
>> >> our experience developing Ray, I can say that the monorepo approach (for
>> >> Ray) has improved things a lot. Before we tried various schemes to split
>> >> the project into multiple repos, but the build system and test
>> >> infrastructure duplications and overhead from synchronizing changes
>> slowed
>> >> development down significantly (and fixing bugs that touch the subrepos
>> and
>> >> the main repo is inconvenient).
>> >>
>> >> Also the decision to put arrow and parquet-cpp into a common repo is
>> >> independent of how tightly coupled the two projects are (and there
>> could be
>> >> a matrix entry in travis which tests that PRs keep them decoupled, or
>> >> rather that they both just depend on a small common "base"). Google and
>> >> Facebook demonstrate such independence by having many many projects in
>> the
>> >> same repo of course. It would be great if the open source community
>> would
>> >> move more into this direction too I think.
>> >>
>> >> Best,
>> >> Philipp.
>> >>
>> >> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> > hi Donald,
>> >> >
>> >> > This would make things worse, not better. Code changes routinely
>> >> > involve changes to the build system, and so you could be talking about
>> >> > having to making changes to 2 or 3 git repositories as the result of a
>> >> > single new feature or bug fix. There isn't really a cross-repo CI
>> >> > solution available
>> >> >
>> >> > I've seen some approaches to the monorepo problem using multiple git
>> >> > repositories, such as
>> >> >
>> >> > https://github.com/twosigma/git-meta
>> >> >
>> >> > Until something like this has first class support by the GitHub
>> >> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> >> > will work for us.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <
>> donald.foss@gmail.com>
>> >> > wrote:
>> >> > > Could this work as each module gets configured as sub-git repots.
>> Top
>> >> > level
>> >> > > build tool go into each sub-repo, pick the correct release version
>> to
>> >> > test.
>> >> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> >> > pass.
>> >> > >
>> >> > > This should be the best of both worlds, if sub-repo are supposed
>> >> option.
>> >> > >
>> >> > > --Donald E. Foss
>> >> > >
>> >> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > >> I dislike the current build system complications as well.
>> >> > >>
>> >> > >> However, in my opinion, combining the code bases will severely
>> impact
>> >> > the
>> >> > >> progress of the parquet-cpp project and implicitly the progress of
>> the
>> >> > >> entire parquet project.
>> >> > >> Combining would have made much more sense if parquet-cpp is a
>> mature
>> >> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> >> project)
>> >> > is
>> >> > >> evolving continuously with new features being added including bloom
>> >> > >> filters,  column encryption, and indexes.
>> >> > >>
>> >> > >> If the two code bases merged, it will be much more difficult to
>> >> > contribute
>> >> > >> to the parquet-cpp project since now Arrow bindings have to be
>> >> > supported as
>> >> > >> well. Please correct me if I am wrong here.
>> >> > >>
>> >> > >> Out of the two evils, I think handling the build system, packaging
>> >> > >> duplication is much more manageable since they are quite stable at
>> >> this
>> >> > >> point.
>> >> > >>
>> >> > >> Regarding "* API changes cause awkward release coordination issues
>> >> > between
>> >> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp
>> (with
>> >> API
>> >> > >> changes needed) as and when Arrow is released?
>> >> > >>
>> >> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> >> > >> converting between Arrow columnar memory format and Parquet". Can
>> this
>> >> > be
>> >> > >> moved to the Arrow project and expose the more stable low-level
>> APIs
>> >> in
>> >> > >> parquet-cpp?
>> >> > >>
>> >> > >> I am also curious if the Arrow and Parquet Java implementations
>> have
>> >> > >> similar API compatibility issues.
>> >> > >>
>> >> > >>
>> >> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>> >> > wrote:
>> >> > >>
>> >> > >> > hi folks,
>> >> > >> >
>> >> > >> > We've been struggling for quite some time with the development
>> >> > >> > workflow between the Arrow and Parquet C++ (and Python)
>> codebases.
>> >> > >> >
>> >> > >> > To explain the root issues:
>> >> > >> >
>> >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> >> > >> > includes file interfaces, memory management, miscellaneous
>> >> algorithms
>> >> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> >> > >> > dependency was introduced, there was significant duplicated code
>> >> > >> > between these codebases and incompatible abstract interfaces for
>> >> > >> > things like files
>> >> > >> >
>> >> > >> > * we maintain a Arrow conversion code in parquet-cpp for
>> converting
>> >> > >> > between Arrow columnar memory format and Parquet
>> >> > >> >
>> >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> >> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> >> > >> >
>> >> > >> > * Substantial portions of our CMake build system and related
>> tooling
>> >> > >> > are duplicated between the Arrow and Parquet repos
>> >> > >> >
>> >> > >> > * API changes cause awkward release coordination issues between
>> >> Arrow
>> >> > >> > and Parquet
>> >> > >> >
>> >> > >> > I believe the best way to remedy the situation is to adopt a
>> >> > >> > "Community over Code" approach and find a way for the Parquet and
>> >> > >> > Arrow C++ development communities to operate out of the same code
>> >> > >> > repository, i.e. the apache/arrow git repository.
>> >> > >> >
>> >> > >> > This would bring major benefits:
>> >> > >> >
>> >> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> >> > >> > infrastructure (Parquet is already being built as a dependency in
>> >> > >> > Arrow's CI systems)
>> >> > >> >
>> >> > >> > * Share packaging and release management infrastructure
>> >> > >> >
>> >> > >> > * Reduce / eliminate problems due to API changes (where we
>> currently
>> >> > >> > introduce breakage into our CI workflow when there is a breaking
>> /
>> >> > >> > incompatible change)
>> >> > >> >
>> >> > >> > * Arrow releases would include a coordinated snapshot of the
>> Parquet
>> >> > >> > implementation as it stands
>> >> > >> >
>> >> > >> > Continuing with the status quo has become unsatisfactory to me
>> and
>> >> as
>> >> > >> > a result I've become less motivated to work on the parquet-cpp
>> >> > >> > codebase.
>> >> > >> >
>> >> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> >> Deepak
>> >> > >> > Majeti. I think the issue of commit privileges could be resolved
>> >> > >> > without too much difficulty or time.
>> >> > >> >
>> >> > >> > I also think if it is truly necessary that the Apache Parquet
>> >> > >> > community could create release scripts to cut a miniml versioned
>> >> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> >> > >> >
>> >> > >> > I know that some people are wary of monorepos and megaprojects,
>> but
>> >> as
>> >> > >> > an example TensorFlow is at least 10 times as large of a
>> projects in
>> >> > >> > terms of LOCs and number of different platform components, and it
>> >> > >> > seems to be getting along just fine. I think we should be able to
>> >> work
>> >> > >> > together as a community to function just as well.
>> >> > >> >
>> >> > >> > Interested in the opinions of others, and any other ideas for
>> >> > >> > practical solutions to the above problems.
>> >> > >> >
>> >> > >> > Thanks,
>> >> > >> > Wes
>> >> > >> >
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> regards,
>> >> > >> Deepak Majeti
>> >> > >>
>> >> >
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi,

On Mon, Jul 30, 2018 at 6:52 PM, Deepak Majeti <ma...@gmail.com> wrote:
> Wes,
>
> I definitely appreciate and do see the impact of contributions made by
> everyone. I made this statement not to rate any contributions but solely to
> support my concern.
> The contribution barrier is higher simply because of the increased code,
> build, and test dependencies. If the community has lesser interest on a
> certain component (parquet-cpp core in this case), it becomes very hard to
> make big changes.

This is a FUD-based argument rather than a fact-based one. If there
are committers in Arrow (via Parquet) who approve changes, why would
they not be merged? The community will be incentivized to make sure
that developers are productive and able to work efficiently on the
part of the project that is relevant to them. parquet-cpp developers
are already building most of Arrow's C++ codebase en route to
development; by building with a single build system development
environments will be simpler to manage in general.

On the subject of code velocity: Arrow has a diverse community and
nearly 200 unique contributors at this point. Parquet has about 30.
The Arrow codebase history starts on February 5, 2016. Since then:

* Arrow has has 2055 patches
* parquet-cpp has had 425

So the patch volume is about 5x as high on average. This does not look
like a project that is struggling to merge patches. We are all
invested in the success of Parquet and changing the structure of the
code and helping the community to work more productively would not
change that.

> The community will be less willing to accept large
> changes that require multiple rounds of patches for stability and API
> convergence. Our contributions to Libhdfs++ in the HDFS community took a
> significantly long time for the very same reason.

Please don't use bad experiences from another open source community as
leverage in this discussion. I'm sorry that things didn't go the way
you wanted in Apache Hadoop but this is a distinct community which
happens to operate under a similar open governance model.

After significant time thinking about it, I think unfortunately that
the next-best option after a monorepo structure would be for the Arrow
community to _fork_ the parquet-cpp codebase and go our separate ways.
Beyond these two options I fail to see a pragmatic solution to the
problems we've been having.

- Wes

>
>
>
> On Mon, Jul 30, 2018 at 6:05 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Deepak
>>
>> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com>
>> wrote:
>> > @Wes
>> > My observation is that most of the parquet-cpp contributors you listed
>> that
>> > overlap with the Arrow community mainly contribute to the Arrow
>> > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> > repo. Very few of them review/contribute patches to the parquet-cpp core.
>> >
>>
>> So, what are you saying exactly, that some contributions or
>> contributors to Apache Parquet matter more than others? I don't
>> follow.
>>
>> As a result of these individual's efforts, the parquet-cpp libraries
>> are being installed well over 100,000 times per month on a single
>> install path (Python) alone.
>>
>
>> > I believe improvements to the parquet-cpp core will be negatively
>> impacted
>> > since merging the parquet-cpp and arrow-cpp repos will increase the
>> barrier
>> > of entry to new contributors interested in the parquet-cpp core. The
>> > current extensions to the parquet-cpp core related to bloom-filters, and
>> > column encryption are all being done by first-time contributors.
>>
>> I don't understand why this would "increase the barrier of entry".
>> Could you explain?
>>
> It is true that there would be more code in the codebase, but the
>> build and test procedure would be no more complex. If anything,
>> community productivity will be improved by having a more cohesive /
>> centralized development platform (large amounts of code that Parquet
>> depends on are in Apache Arrow already).
>>
>> >
>> > If you believe there will be new interest in the parquet-cpp core with
>> the
>> > mono-repo approach, I am all up for it.
>>
>> Yes, I believe that this change will result in more and higher quality
>> code review to Parquet core changes and general improvements to
>> developer productivity across the board. Developer productivity is
>> what this is all about.
>>
>> - Wes
>>
>> >
>> >
>> > On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com>
>> wrote:
>> >
>> >> I do not claim to have insight into parquet-cpp development. However,
>> from
>> >> our experience developing Ray, I can say that the monorepo approach (for
>> >> Ray) has improved things a lot. Before we tried various schemes to split
>> >> the project into multiple repos, but the build system and test
>> >> infrastructure duplications and overhead from synchronizing changes
>> slowed
>> >> development down significantly (and fixing bugs that touch the subrepos
>> and
>> >> the main repo is inconvenient).
>> >>
>> >> Also the decision to put arrow and parquet-cpp into a common repo is
>> >> independent of how tightly coupled the two projects are (and there
>> could be
>> >> a matrix entry in travis which tests that PRs keep them decoupled, or
>> >> rather that they both just depend on a small common "base"). Google and
>> >> Facebook demonstrate such independence by having many many projects in
>> the
>> >> same repo of course. It would be great if the open source community
>> would
>> >> move more into this direction too I think.
>> >>
>> >> Best,
>> >> Philipp.
>> >>
>> >> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>
>> >> > hi Donald,
>> >> >
>> >> > This would make things worse, not better. Code changes routinely
>> >> > involve changes to the build system, and so you could be talking about
>> >> > having to making changes to 2 or 3 git repositories as the result of a
>> >> > single new feature or bug fix. There isn't really a cross-repo CI
>> >> > solution available
>> >> >
>> >> > I've seen some approaches to the monorepo problem using multiple git
>> >> > repositories, such as
>> >> >
>> >> > https://github.com/twosigma/git-meta
>> >> >
>> >> > Until something like this has first class support by the GitHub
>> >> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> >> > will work for us.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <
>> donald.foss@gmail.com>
>> >> > wrote:
>> >> > > Could this work as each module gets configured as sub-git repots.
>> Top
>> >> > level
>> >> > > build tool go into each sub-repo, pick the correct release version
>> to
>> >> > test.
>> >> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> >> > pass.
>> >> > >
>> >> > > This should be the best of both worlds, if sub-repo are supposed
>> >> option.
>> >> > >
>> >> > > --Donald E. Foss
>> >> > >
>> >> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <
>> majeti.deepak@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > >> I dislike the current build system complications as well.
>> >> > >>
>> >> > >> However, in my opinion, combining the code bases will severely
>> impact
>> >> > the
>> >> > >> progress of the parquet-cpp project and implicitly the progress of
>> the
>> >> > >> entire parquet project.
>> >> > >> Combining would have made much more sense if parquet-cpp is a
>> mature
>> >> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> >> project)
>> >> > is
>> >> > >> evolving continuously with new features being added including bloom
>> >> > >> filters,  column encryption, and indexes.
>> >> > >>
>> >> > >> If the two code bases merged, it will be much more difficult to
>> >> > contribute
>> >> > >> to the parquet-cpp project since now Arrow bindings have to be
>> >> > supported as
>> >> > >> well. Please correct me if I am wrong here.
>> >> > >>
>> >> > >> Out of the two evils, I think handling the build system, packaging
>> >> > >> duplication is much more manageable since they are quite stable at
>> >> this
>> >> > >> point.
>> >> > >>
>> >> > >> Regarding "* API changes cause awkward release coordination issues
>> >> > between
>> >> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp
>> (with
>> >> API
>> >> > >> changes needed) as and when Arrow is released?
>> >> > >>
>> >> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> >> > >> converting between Arrow columnar memory format and Parquet". Can
>> this
>> >> > be
>> >> > >> moved to the Arrow project and expose the more stable low-level
>> APIs
>> >> in
>> >> > >> parquet-cpp?
>> >> > >>
>> >> > >> I am also curious if the Arrow and Parquet Java implementations
>> have
>> >> > >> similar API compatibility issues.
>> >> > >>
>> >> > >>
>> >> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>> >> > wrote:
>> >> > >>
>> >> > >> > hi folks,
>> >> > >> >
>> >> > >> > We've been struggling for quite some time with the development
>> >> > >> > workflow between the Arrow and Parquet C++ (and Python)
>> codebases.
>> >> > >> >
>> >> > >> > To explain the root issues:
>> >> > >> >
>> >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> >> > >> > includes file interfaces, memory management, miscellaneous
>> >> algorithms
>> >> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> >> > >> > dependency was introduced, there was significant duplicated code
>> >> > >> > between these codebases and incompatible abstract interfaces for
>> >> > >> > things like files
>> >> > >> >
>> >> > >> > * we maintain a Arrow conversion code in parquet-cpp for
>> converting
>> >> > >> > between Arrow columnar memory format and Parquet
>> >> > >> >
>> >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> >> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> >> > >> >
>> >> > >> > * Substantial portions of our CMake build system and related
>> tooling
>> >> > >> > are duplicated between the Arrow and Parquet repos
>> >> > >> >
>> >> > >> > * API changes cause awkward release coordination issues between
>> >> Arrow
>> >> > >> > and Parquet
>> >> > >> >
>> >> > >> > I believe the best way to remedy the situation is to adopt a
>> >> > >> > "Community over Code" approach and find a way for the Parquet and
>> >> > >> > Arrow C++ development communities to operate out of the same code
>> >> > >> > repository, i.e. the apache/arrow git repository.
>> >> > >> >
>> >> > >> > This would bring major benefits:
>> >> > >> >
>> >> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> >> > >> > infrastructure (Parquet is already being built as a dependency in
>> >> > >> > Arrow's CI systems)
>> >> > >> >
>> >> > >> > * Share packaging and release management infrastructure
>> >> > >> >
>> >> > >> > * Reduce / eliminate problems due to API changes (where we
>> currently
>> >> > >> > introduce breakage into our CI workflow when there is a breaking
>> /
>> >> > >> > incompatible change)
>> >> > >> >
>> >> > >> > * Arrow releases would include a coordinated snapshot of the
>> Parquet
>> >> > >> > implementation as it stands
>> >> > >> >
>> >> > >> > Continuing with the status quo has become unsatisfactory to me
>> and
>> >> as
>> >> > >> > a result I've become less motivated to work on the parquet-cpp
>> >> > >> > codebase.
>> >> > >> >
>> >> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> >> Deepak
>> >> > >> > Majeti. I think the issue of commit privileges could be resolved
>> >> > >> > without too much difficulty or time.
>> >> > >> >
>> >> > >> > I also think if it is truly necessary that the Apache Parquet
>> >> > >> > community could create release scripts to cut a miniml versioned
>> >> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> >> > >> >
>> >> > >> > I know that some people are wary of monorepos and megaprojects,
>> but
>> >> as
>> >> > >> > an example TensorFlow is at least 10 times as large of a
>> projects in
>> >> > >> > terms of LOCs and number of different platform components, and it
>> >> > >> > seems to be getting along just fine. I think we should be able to
>> >> work
>> >> > >> > together as a community to function just as well.
>> >> > >> >
>> >> > >> > Interested in the opinions of others, and any other ideas for
>> >> > >> > practical solutions to the above problems.
>> >> > >> >
>> >> > >> > Thanks,
>> >> > >> > Wes
>> >> > >> >
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> regards,
>> >> > >> Deepak Majeti
>> >> > >>
>> >> >
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

Wes,

I definitely appreciate and do see the impact of contributions made by
everyone. I made this statement not to rate any contributions but solely to
support my concern.
The contribution barrier is higher simply because of the increased code,
build, and test dependencies. If the community has lesser interest on a
certain component (parquet-cpp core in this case), it becomes very hard to
make big changes. The community will be less willing to accept large
changes that require multiple rounds of patches for stability and API
convergence. Our contributions to Libhdfs++ in the HDFS community took a
significantly long time for the very same reason.



On Mon, Jul 30, 2018 at 6:05 PM Wes McKinney <we...@gmail.com> wrote:

> hi Deepak
>
> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > @Wes
> > My observation is that most of the parquet-cpp contributors you listed
> that
> > overlap with the Arrow community mainly contribute to the Arrow
> > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> > repo. Very few of them review/contribute patches to the parquet-cpp core.
> >
>
> So, what are you saying exactly, that some contributions or
> contributors to Apache Parquet matter more than others? I don't
> follow.
>
> As a result of these individual's efforts, the parquet-cpp libraries
> are being installed well over 100,000 times per month on a single
> install path (Python) alone.
>

> > I believe improvements to the parquet-cpp core will be negatively
> impacted
> > since merging the parquet-cpp and arrow-cpp repos will increase the
> barrier
> > of entry to new contributors interested in the parquet-cpp core. The
> > current extensions to the parquet-cpp core related to bloom-filters, and
> > column encryption are all being done by first-time contributors.
>
> I don't understand why this would "increase the barrier of entry".
> Could you explain?
>
It is true that there would be more code in the codebase, but the
> build and test procedure would be no more complex. If anything,
> community productivity will be improved by having a more cohesive /
> centralized development platform (large amounts of code that Parquet
> depends on are in Apache Arrow already).
>
> >
> > If you believe there will be new interest in the parquet-cpp core with
> the
> > mono-repo approach, I am all up for it.
>
> Yes, I believe that this change will result in more and higher quality
> code review to Parquet core changes and general improvements to
> developer productivity across the board. Developer productivity is
> what this is all about.
>
> - Wes
>
> >
> >
> > On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com>
> wrote:
> >
> >> I do not claim to have insight into parquet-cpp development. However,
> from
> >> our experience developing Ray, I can say that the monorepo approach (for
> >> Ray) has improved things a lot. Before we tried various schemes to split
> >> the project into multiple repos, but the build system and test
> >> infrastructure duplications and overhead from synchronizing changes
> slowed
> >> development down significantly (and fixing bugs that touch the subrepos
> and
> >> the main repo is inconvenient).
> >>
> >> Also the decision to put arrow and parquet-cpp into a common repo is
> >> independent of how tightly coupled the two projects are (and there
> could be
> >> a matrix entry in travis which tests that PRs keep them decoupled, or
> >> rather that they both just depend on a small common "base"). Google and
> >> Facebook demonstrate such independence by having many many projects in
> the
> >> same repo of course. It would be great if the open source community
> would
> >> move more into this direction too I think.
> >>
> >> Best,
> >> Philipp.
> >>
> >> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> > hi Donald,
> >> >
> >> > This would make things worse, not better. Code changes routinely
> >> > involve changes to the build system, and so you could be talking about
> >> > having to making changes to 2 or 3 git repositories as the result of a
> >> > single new feature or bug fix. There isn't really a cross-repo CI
> >> > solution available
> >> >
> >> > I've seen some approaches to the monorepo problem using multiple git
> >> > repositories, such as
> >> >
> >> > https://github.com/twosigma/git-meta
> >> >
> >> > Until something like this has first class support by the GitHub
> >> > platform and its CI services (Travis CI, Appveyor), I don't think it
> >> > will work for us.
> >> >
> >> > - Wes
> >> >
> >> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <
> donald.foss@gmail.com>
> >> > wrote:
> >> > > Could this work as each module gets configured as sub-git repots.
> Top
> >> > level
> >> > > build tool go into each sub-repo, pick the correct release version
> to
> >> > test.
> >> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
> >> > pass.
> >> > >
> >> > > This should be the best of both worlds, if sub-repo are supposed
> >> option.
> >> > >
> >> > > --Donald E. Foss
> >> > >
> >> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <
> majeti.deepak@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> I dislike the current build system complications as well.
> >> > >>
> >> > >> However, in my opinion, combining the code bases will severely
> impact
> >> > the
> >> > >> progress of the parquet-cpp project and implicitly the progress of
> the
> >> > >> entire parquet project.
> >> > >> Combining would have made much more sense if parquet-cpp is a
> mature
> >> > >> project and codebase.  But parquet-cpp (and the entire parquet
> >> project)
> >> > is
> >> > >> evolving continuously with new features being added including bloom
> >> > >> filters,  column encryption, and indexes.
> >> > >>
> >> > >> If the two code bases merged, it will be much more difficult to
> >> > contribute
> >> > >> to the parquet-cpp project since now Arrow bindings have to be
> >> > supported as
> >> > >> well. Please correct me if I am wrong here.
> >> > >>
> >> > >> Out of the two evils, I think handling the build system, packaging
> >> > >> duplication is much more manageable since they are quite stable at
> >> this
> >> > >> point.
> >> > >>
> >> > >> Regarding "* API changes cause awkward release coordination issues
> >> > between
> >> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp
> (with
> >> API
> >> > >> changes needed) as and when Arrow is released?
> >> > >>
> >> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> >> > >> converting between Arrow columnar memory format and Parquet". Can
> this
> >> > be
> >> > >> moved to the Arrow project and expose the more stable low-level
> APIs
> >> in
> >> > >> parquet-cpp?
> >> > >>
> >> > >> I am also curious if the Arrow and Parquet Java implementations
> have
> >> > >> similar API compatibility issues.
> >> > >>
> >> > >>
> >> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> >> > wrote:
> >> > >>
> >> > >> > hi folks,
> >> > >> >
> >> > >> > We've been struggling for quite some time with the development
> >> > >> > workflow between the Arrow and Parquet C++ (and Python)
> codebases.
> >> > >> >
> >> > >> > To explain the root issues:
> >> > >> >
> >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> >> > >> > includes file interfaces, memory management, miscellaneous
> >> algorithms
> >> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> >> > >> > dependency was introduced, there was significant duplicated code
> >> > >> > between these codebases and incompatible abstract interfaces for
> >> > >> > things like files
> >> > >> >
> >> > >> > * we maintain a Arrow conversion code in parquet-cpp for
> converting
> >> > >> > between Arrow columnar memory format and Parquet
> >> > >> >
> >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> >> > >> > Apache Arrow. This introduces a circular dependency into our CI.
> >> > >> >
> >> > >> > * Substantial portions of our CMake build system and related
> tooling
> >> > >> > are duplicated between the Arrow and Parquet repos
> >> > >> >
> >> > >> > * API changes cause awkward release coordination issues between
> >> Arrow
> >> > >> > and Parquet
> >> > >> >
> >> > >> > I believe the best way to remedy the situation is to adopt a
> >> > >> > "Community over Code" approach and find a way for the Parquet and
> >> > >> > Arrow C++ development communities to operate out of the same code
> >> > >> > repository, i.e. the apache/arrow git repository.
> >> > >> >
> >> > >> > This would bring major benefits:
> >> > >> >
> >> > >> > * Shared CMake build infrastructure, developer tools, and CI
> >> > >> > infrastructure (Parquet is already being built as a dependency in
> >> > >> > Arrow's CI systems)
> >> > >> >
> >> > >> > * Share packaging and release management infrastructure
> >> > >> >
> >> > >> > * Reduce / eliminate problems due to API changes (where we
> currently
> >> > >> > introduce breakage into our CI workflow when there is a breaking
> /
> >> > >> > incompatible change)
> >> > >> >
> >> > >> > * Arrow releases would include a coordinated snapshot of the
> Parquet
> >> > >> > implementation as it stands
> >> > >> >
> >> > >> > Continuing with the status quo has become unsatisfactory to me
> and
> >> as
> >> > >> > a result I've become less motivated to work on the parquet-cpp
> >> > >> > codebase.
> >> > >> >
> >> > >> > The only Parquet C++ committer who is not an Arrow committer is
> >> Deepak
> >> > >> > Majeti. I think the issue of commit privileges could be resolved
> >> > >> > without too much difficulty or time.
> >> > >> >
> >> > >> > I also think if it is truly necessary that the Apache Parquet
> >> > >> > community could create release scripts to cut a miniml versioned
> >> > >> > Apache Parquet C++ release if that is deemed truly necessary.
> >> > >> >
> >> > >> > I know that some people are wary of monorepos and megaprojects,
> but
> >> as
> >> > >> > an example TensorFlow is at least 10 times as large of a
> projects in
> >> > >> > terms of LOCs and number of different platform components, and it
> >> > >> > seems to be getting along just fine. I think we should be able to
> >> work
> >> > >> > together as a community to function just as well.
> >> > >> >
> >> > >> > Interested in the opinions of others, and any other ideas for
> >> > >> > practical solutions to the above problems.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Wes
> >> > >> >
> >> > >>
> >> > >>
> >> > >> --
> >> > >> regards,
> >> > >> Deepak Majeti
> >> > >>
> >> >
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Julian Hyde <jh...@apache.org>.

I'm not going to comment on the design of the parquet-cpp module and whether it is “closer” to parquet or arrow.

But I do think Wes’s proposal is consistent with Apache policy. PMCs make releases and govern communities; they don’t exist to manage code bases, except as a means to the end of creating releases of known provenance. The Parquet PMC can continue to make parquet-cpp releases, and to end-users those releases will look the same as they do today, even if the code for those releases were to move to a different git repo in the ASF.

Julian



> On Jul 30, 2018, at 3:05 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> hi Deepak
> 
> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> @Wes
>> My observation is that most of the parquet-cpp contributors you listed that
>> overlap with the Arrow community mainly contribute to the Arrow
>> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> repo. Very few of them review/contribute patches to the parquet-cpp core.
>> 
> 
> So, what are you saying exactly, that some contributions or
> contributors to Apache Parquet matter more than others? I don't
> follow.
> 
> As a result of these individual's efforts, the parquet-cpp libraries
> are being installed well over 100,000 times per month on a single
> install path (Python) alone.
> 
>> I believe improvements to the parquet-cpp core will be negatively impacted
>> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
>> of entry to new contributors interested in the parquet-cpp core. The
>> current extensions to the parquet-cpp core related to bloom-filters, and
>> column encryption are all being done by first-time contributors.
> 
> I don't understand why this would "increase the barrier of entry".
> Could you explain?
> 
> It is true that there would be more code in the codebase, but the
> build and test procedure would be no more complex. If anything,
> community productivity will be improved by having a more cohesive /
> centralized development platform (large amounts of code that Parquet
> depends on are in Apache Arrow already).
> 
>> 
>> If you believe there will be new interest in the parquet-cpp core with the
>> mono-repo approach, I am all up for it.
> 
> Yes, I believe that this change will result in more and higher quality
> code review to Parquet core changes and general improvements to
> developer productivity across the board. Developer productivity is
> what this is all about.
> 
> - Wes
> 
>> 
>> 
>> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:
>> 
>>> I do not claim to have insight into parquet-cpp development. However, from
>>> our experience developing Ray, I can say that the monorepo approach (for
>>> Ray) has improved things a lot. Before we tried various schemes to split
>>> the project into multiple repos, but the build system and test
>>> infrastructure duplications and overhead from synchronizing changes slowed
>>> development down significantly (and fixing bugs that touch the subrepos and
>>> the main repo is inconvenient).
>>> 
>>> Also the decision to put arrow and parquet-cpp into a common repo is
>>> independent of how tightly coupled the two projects are (and there could be
>>> a matrix entry in travis which tests that PRs keep them decoupled, or
>>> rather that they both just depend on a small common "base"). Google and
>>> Facebook demonstrate such independence by having many many projects in the
>>> same repo of course. It would be great if the open source community would
>>> move more into this direction too I think.
>>> 
>>> Best,
>>> Philipp.
>>> 
>>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>>> hi Donald,
>>>> 
>>>> This would make things worse, not better. Code changes routinely
>>>> involve changes to the build system, and so you could be talking about
>>>> having to making changes to 2 or 3 git repositories as the result of a
>>>> single new feature or bug fix. There isn't really a cross-repo CI
>>>> solution available
>>>> 
>>>> I've seen some approaches to the monorepo problem using multiple git
>>>> repositories, such as
>>>> 
>>>> https://github.com/twosigma/git-meta
>>>> 
>>>> Until something like this has first class support by the GitHub
>>>> platform and its CI services (Travis CI, Appveyor), I don't think it
>>>> will work for us.
>>>> 
>>>> - Wes
>>>> 
>>>> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
>>>> wrote:
>>>>> Could this work as each module gets configured as sub-git repots. Top
>>>> level
>>>>> build tool go into each sub-repo, pick the correct release version to
>>>> test.
>>>>> Tests in Python is dependent on cpp sub-repo to ensure the API still
>>>> pass.
>>>>> 
>>>>> This should be the best of both worlds, if sub-repo are supposed
>>> option.
>>>>> 
>>>>> --Donald E. Foss
>>>>> 
>>>>> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I dislike the current build system complications as well.
>>>>>> 
>>>>>> However, in my opinion, combining the code bases will severely impact
>>>> the
>>>>>> progress of the parquet-cpp project and implicitly the progress of the
>>>>>> entire parquet project.
>>>>>> Combining would have made much more sense if parquet-cpp is a mature
>>>>>> project and codebase.  But parquet-cpp (and the entire parquet
>>> project)
>>>> is
>>>>>> evolving continuously with new features being added including bloom
>>>>>> filters,  column encryption, and indexes.
>>>>>> 
>>>>>> If the two code bases merged, it will be much more difficult to
>>>> contribute
>>>>>> to the parquet-cpp project since now Arrow bindings have to be
>>>> supported as
>>>>>> well. Please correct me if I am wrong here.
>>>>>> 
>>>>>> Out of the two evils, I think handling the build system, packaging
>>>>>> duplication is much more manageable since they are quite stable at
>>> this
>>>>>> point.
>>>>>> 
>>>>>> Regarding "* API changes cause awkward release coordination issues
>>>> between
>>>>>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>>> API
>>>>>> changes needed) as and when Arrow is released?
>>>>>> 
>>>>>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>>>>>> converting between Arrow columnar memory format and Parquet". Can this
>>>> be
>>>>>> moved to the Arrow project and expose the more stable low-level APIs
>>> in
>>>>>> parquet-cpp?
>>>>>> 
>>>>>> I am also curious if the Arrow and Parquet Java implementations have
>>>>>> similar API compatibility issues.
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> hi folks,
>>>>>>> 
>>>>>>> We've been struggling for quite some time with the development
>>>>>>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>>>>>> 
>>>>>>> To explain the root issues:
>>>>>>> 
>>>>>>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>>>>>>> includes file interfaces, memory management, miscellaneous
>>> algorithms
>>>>>>> (e.g. dictionary encoding), etc. Note that before this "platform"
>>>>>>> dependency was introduced, there was significant duplicated code
>>>>>>> between these codebases and incompatible abstract interfaces for
>>>>>>> things like files
>>>>>>> 
>>>>>>> * we maintain a Arrow conversion code in parquet-cpp for converting
>>>>>>> between Arrow columnar memory format and Parquet
>>>>>>> 
>>>>>>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>>>>>>> Apache Arrow. This introduces a circular dependency into our CI.
>>>>>>> 
>>>>>>> * Substantial portions of our CMake build system and related tooling
>>>>>>> are duplicated between the Arrow and Parquet repos
>>>>>>> 
>>>>>>> * API changes cause awkward release coordination issues between
>>> Arrow
>>>>>>> and Parquet
>>>>>>> 
>>>>>>> I believe the best way to remedy the situation is to adopt a
>>>>>>> "Community over Code" approach and find a way for the Parquet and
>>>>>>> Arrow C++ development communities to operate out of the same code
>>>>>>> repository, i.e. the apache/arrow git repository.
>>>>>>> 
>>>>>>> This would bring major benefits:
>>>>>>> 
>>>>>>> * Shared CMake build infrastructure, developer tools, and CI
>>>>>>> infrastructure (Parquet is already being built as a dependency in
>>>>>>> Arrow's CI systems)
>>>>>>> 
>>>>>>> * Share packaging and release management infrastructure
>>>>>>> 
>>>>>>> * Reduce / eliminate problems due to API changes (where we currently
>>>>>>> introduce breakage into our CI workflow when there is a breaking /
>>>>>>> incompatible change)
>>>>>>> 
>>>>>>> * Arrow releases would include a coordinated snapshot of the Parquet
>>>>>>> implementation as it stands
>>>>>>> 
>>>>>>> Continuing with the status quo has become unsatisfactory to me and
>>> as
>>>>>>> a result I've become less motivated to work on the parquet-cpp
>>>>>>> codebase.
>>>>>>> 
>>>>>>> The only Parquet C++ committer who is not an Arrow committer is
>>> Deepak
>>>>>>> Majeti. I think the issue of commit privileges could be resolved
>>>>>>> without too much difficulty or time.
>>>>>>> 
>>>>>>> I also think if it is truly necessary that the Apache Parquet
>>>>>>> community could create release scripts to cut a miniml versioned
>>>>>>> Apache Parquet C++ release if that is deemed truly necessary.
>>>>>>> 
>>>>>>> I know that some people are wary of monorepos and megaprojects, but
>>> as
>>>>>>> an example TensorFlow is at least 10 times as large of a projects in
>>>>>>> terms of LOCs and number of different platform components, and it
>>>>>>> seems to be getting along just fine. I think we should be able to
>>> work
>>>>>>> together as a community to function just as well.
>>>>>>> 
>>>>>>> Interested in the opinions of others, and any other ideas for
>>>>>>> practical solutions to the above problems.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> regards,
>>>>>> Deepak Majeti
>>>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

Wes,

I definitely appreciate and do see the impact of contributions made by
everyone. I made this statement not to rate any contributions but solely to
support my concern.
The contribution barrier is higher simply because of the increased code,
build, and test dependencies. If the community has lesser interest on a
certain component (parquet-cpp core in this case), it becomes very hard to
make big changes. The community will be less willing to accept large
changes that require multiple rounds of patches for stability and API
convergence. Our contributions to Libhdfs++ in the HDFS community took a
significantly long time for the very same reason.



On Mon, Jul 30, 2018 at 6:05 PM Wes McKinney <we...@gmail.com> wrote:

> hi Deepak
>
> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > @Wes
> > My observation is that most of the parquet-cpp contributors you listed
> that
> > overlap with the Arrow community mainly contribute to the Arrow
> > bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> > repo. Very few of them review/contribute patches to the parquet-cpp core.
> >
>
> So, what are you saying exactly, that some contributions or
> contributors to Apache Parquet matter more than others? I don't
> follow.
>
> As a result of these individual's efforts, the parquet-cpp libraries
> are being installed well over 100,000 times per month on a single
> install path (Python) alone.
>

> > I believe improvements to the parquet-cpp core will be negatively
> impacted
> > since merging the parquet-cpp and arrow-cpp repos will increase the
> barrier
> > of entry to new contributors interested in the parquet-cpp core. The
> > current extensions to the parquet-cpp core related to bloom-filters, and
> > column encryption are all being done by first-time contributors.
>
> I don't understand why this would "increase the barrier of entry".
> Could you explain?
>
It is true that there would be more code in the codebase, but the
> build and test procedure would be no more complex. If anything,
> community productivity will be improved by having a more cohesive /
> centralized development platform (large amounts of code that Parquet
> depends on are in Apache Arrow already).
>
> >
> > If you believe there will be new interest in the parquet-cpp core with
> the
> > mono-repo approach, I am all up for it.
>
> Yes, I believe that this change will result in more and higher quality
> code review to Parquet core changes and general improvements to
> developer productivity across the board. Developer productivity is
> what this is all about.
>
> - Wes
>
> >
> >
> > On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com>
> wrote:
> >
> >> I do not claim to have insight into parquet-cpp development. However,
> from
> >> our experience developing Ray, I can say that the monorepo approach (for
> >> Ray) has improved things a lot. Before we tried various schemes to split
> >> the project into multiple repos, but the build system and test
> >> infrastructure duplications and overhead from synchronizing changes
> slowed
> >> development down significantly (and fixing bugs that touch the subrepos
> and
> >> the main repo is inconvenient).
> >>
> >> Also the decision to put arrow and parquet-cpp into a common repo is
> >> independent of how tightly coupled the two projects are (and there
> could be
> >> a matrix entry in travis which tests that PRs keep them decoupled, or
> >> rather that they both just depend on a small common "base"). Google and
> >> Facebook demonstrate such independence by having many many projects in
> the
> >> same repo of course. It would be great if the open source community
> would
> >> move more into this direction too I think.
> >>
> >> Best,
> >> Philipp.
> >>
> >> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> > hi Donald,
> >> >
> >> > This would make things worse, not better. Code changes routinely
> >> > involve changes to the build system, and so you could be talking about
> >> > having to making changes to 2 or 3 git repositories as the result of a
> >> > single new feature or bug fix. There isn't really a cross-repo CI
> >> > solution available
> >> >
> >> > I've seen some approaches to the monorepo problem using multiple git
> >> > repositories, such as
> >> >
> >> > https://github.com/twosigma/git-meta
> >> >
> >> > Until something like this has first class support by the GitHub
> >> > platform and its CI services (Travis CI, Appveyor), I don't think it
> >> > will work for us.
> >> >
> >> > - Wes
> >> >
> >> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <
> donald.foss@gmail.com>
> >> > wrote:
> >> > > Could this work as each module gets configured as sub-git repots.
> Top
> >> > level
> >> > > build tool go into each sub-repo, pick the correct release version
> to
> >> > test.
> >> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
> >> > pass.
> >> > >
> >> > > This should be the best of both worlds, if sub-repo are supposed
> >> option.
> >> > >
> >> > > --Donald E. Foss
> >> > >
> >> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <
> majeti.deepak@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> I dislike the current build system complications as well.
> >> > >>
> >> > >> However, in my opinion, combining the code bases will severely
> impact
> >> > the
> >> > >> progress of the parquet-cpp project and implicitly the progress of
> the
> >> > >> entire parquet project.
> >> > >> Combining would have made much more sense if parquet-cpp is a
> mature
> >> > >> project and codebase.  But parquet-cpp (and the entire parquet
> >> project)
> >> > is
> >> > >> evolving continuously with new features being added including bloom
> >> > >> filters,  column encryption, and indexes.
> >> > >>
> >> > >> If the two code bases merged, it will be much more difficult to
> >> > contribute
> >> > >> to the parquet-cpp project since now Arrow bindings have to be
> >> > supported as
> >> > >> well. Please correct me if I am wrong here.
> >> > >>
> >> > >> Out of the two evils, I think handling the build system, packaging
> >> > >> duplication is much more manageable since they are quite stable at
> >> this
> >> > >> point.
> >> > >>
> >> > >> Regarding "* API changes cause awkward release coordination issues
> >> > between
> >> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp
> (with
> >> API
> >> > >> changes needed) as and when Arrow is released?
> >> > >>
> >> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> >> > >> converting between Arrow columnar memory format and Parquet". Can
> this
> >> > be
> >> > >> moved to the Arrow project and expose the more stable low-level
> APIs
> >> in
> >> > >> parquet-cpp?
> >> > >>
> >> > >> I am also curious if the Arrow and Parquet Java implementations
> have
> >> > >> similar API compatibility issues.
> >> > >>
> >> > >>
> >> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> >> > wrote:
> >> > >>
> >> > >> > hi folks,
> >> > >> >
> >> > >> > We've been struggling for quite some time with the development
> >> > >> > workflow between the Arrow and Parquet C++ (and Python)
> codebases.
> >> > >> >
> >> > >> > To explain the root issues:
> >> > >> >
> >> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> >> > >> > includes file interfaces, memory management, miscellaneous
> >> algorithms
> >> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> >> > >> > dependency was introduced, there was significant duplicated code
> >> > >> > between these codebases and incompatible abstract interfaces for
> >> > >> > things like files
> >> > >> >
> >> > >> > * we maintain a Arrow conversion code in parquet-cpp for
> converting
> >> > >> > between Arrow columnar memory format and Parquet
> >> > >> >
> >> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> >> > >> > Apache Arrow. This introduces a circular dependency into our CI.
> >> > >> >
> >> > >> > * Substantial portions of our CMake build system and related
> tooling
> >> > >> > are duplicated between the Arrow and Parquet repos
> >> > >> >
> >> > >> > * API changes cause awkward release coordination issues between
> >> Arrow
> >> > >> > and Parquet
> >> > >> >
> >> > >> > I believe the best way to remedy the situation is to adopt a
> >> > >> > "Community over Code" approach and find a way for the Parquet and
> >> > >> > Arrow C++ development communities to operate out of the same code
> >> > >> > repository, i.e. the apache/arrow git repository.
> >> > >> >
> >> > >> > This would bring major benefits:
> >> > >> >
> >> > >> > * Shared CMake build infrastructure, developer tools, and CI
> >> > >> > infrastructure (Parquet is already being built as a dependency in
> >> > >> > Arrow's CI systems)
> >> > >> >
> >> > >> > * Share packaging and release management infrastructure
> >> > >> >
> >> > >> > * Reduce / eliminate problems due to API changes (where we
> currently
> >> > >> > introduce breakage into our CI workflow when there is a breaking
> /
> >> > >> > incompatible change)
> >> > >> >
> >> > >> > * Arrow releases would include a coordinated snapshot of the
> Parquet
> >> > >> > implementation as it stands
> >> > >> >
> >> > >> > Continuing with the status quo has become unsatisfactory to me
> and
> >> as
> >> > >> > a result I've become less motivated to work on the parquet-cpp
> >> > >> > codebase.
> >> > >> >
> >> > >> > The only Parquet C++ committer who is not an Arrow committer is
> >> Deepak
> >> > >> > Majeti. I think the issue of commit privileges could be resolved
> >> > >> > without too much difficulty or time.
> >> > >> >
> >> > >> > I also think if it is truly necessary that the Apache Parquet
> >> > >> > community could create release scripts to cut a miniml versioned
> >> > >> > Apache Parquet C++ release if that is deemed truly necessary.
> >> > >> >
> >> > >> > I know that some people are wary of monorepos and megaprojects,
> but
> >> as
> >> > >> > an example TensorFlow is at least 10 times as large of a
> projects in
> >> > >> > terms of LOCs and number of different platform components, and it
> >> > >> > seems to be getting along just fine. I think we should be able to
> >> work
> >> > >> > together as a community to function just as well.
> >> > >> >
> >> > >> > Interested in the opinions of others, and any other ideas for
> >> > >> > practical solutions to the above problems.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Wes
> >> > >> >
> >> > >>
> >> > >>
> >> > >> --
> >> > >> regards,
> >> > >> Deepak Majeti
> >> > >>
> >> >
> >>
> >
> >
> > --
> > regards,
> > Deepak Majeti
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Julian Hyde <jh...@apache.org>.

I'm not going to comment on the design of the parquet-cpp module and whether it is “closer” to parquet or arrow.

But I do think Wes’s proposal is consistent with Apache policy. PMCs make releases and govern communities; they don’t exist to manage code bases, except as a means to the end of creating releases of known provenance. The Parquet PMC can continue to make parquet-cpp releases, and to end-users those releases will look the same as they do today, even if the code for those releases were to move to a different git repo in the ASF.

Julian



> On Jul 30, 2018, at 3:05 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> hi Deepak
> 
> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com> wrote:
>> @Wes
>> My observation is that most of the parquet-cpp contributors you listed that
>> overlap with the Arrow community mainly contribute to the Arrow
>> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> repo. Very few of them review/contribute patches to the parquet-cpp core.
>> 
> 
> So, what are you saying exactly, that some contributions or
> contributors to Apache Parquet matter more than others? I don't
> follow.
> 
> As a result of these individual's efforts, the parquet-cpp libraries
> are being installed well over 100,000 times per month on a single
> install path (Python) alone.
> 
>> I believe improvements to the parquet-cpp core will be negatively impacted
>> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
>> of entry to new contributors interested in the parquet-cpp core. The
>> current extensions to the parquet-cpp core related to bloom-filters, and
>> column encryption are all being done by first-time contributors.
> 
> I don't understand why this would "increase the barrier of entry".
> Could you explain?
> 
> It is true that there would be more code in the codebase, but the
> build and test procedure would be no more complex. If anything,
> community productivity will be improved by having a more cohesive /
> centralized development platform (large amounts of code that Parquet
> depends on are in Apache Arrow already).
> 
>> 
>> If you believe there will be new interest in the parquet-cpp core with the
>> mono-repo approach, I am all up for it.
> 
> Yes, I believe that this change will result in more and higher quality
> code review to Parquet core changes and general improvements to
> developer productivity across the board. Developer productivity is
> what this is all about.
> 
> - Wes
> 
>> 
>> 
>> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:
>> 
>>> I do not claim to have insight into parquet-cpp development. However, from
>>> our experience developing Ray, I can say that the monorepo approach (for
>>> Ray) has improved things a lot. Before we tried various schemes to split
>>> the project into multiple repos, but the build system and test
>>> infrastructure duplications and overhead from synchronizing changes slowed
>>> development down significantly (and fixing bugs that touch the subrepos and
>>> the main repo is inconvenient).
>>> 
>>> Also the decision to put arrow and parquet-cpp into a common repo is
>>> independent of how tightly coupled the two projects are (and there could be
>>> a matrix entry in travis which tests that PRs keep them decoupled, or
>>> rather that they both just depend on a small common "base"). Google and
>>> Facebook demonstrate such independence by having many many projects in the
>>> same repo of course. It would be great if the open source community would
>>> move more into this direction too I think.
>>> 
>>> Best,
>>> Philipp.
>>> 
>>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>>> hi Donald,
>>>> 
>>>> This would make things worse, not better. Code changes routinely
>>>> involve changes to the build system, and so you could be talking about
>>>> having to making changes to 2 or 3 git repositories as the result of a
>>>> single new feature or bug fix. There isn't really a cross-repo CI
>>>> solution available
>>>> 
>>>> I've seen some approaches to the monorepo problem using multiple git
>>>> repositories, such as
>>>> 
>>>> https://github.com/twosigma/git-meta
>>>> 
>>>> Until something like this has first class support by the GitHub
>>>> platform and its CI services (Travis CI, Appveyor), I don't think it
>>>> will work for us.
>>>> 
>>>> - Wes
>>>> 
>>>> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
>>>> wrote:
>>>>> Could this work as each module gets configured as sub-git repots. Top
>>>> level
>>>>> build tool go into each sub-repo, pick the correct release version to
>>>> test.
>>>>> Tests in Python is dependent on cpp sub-repo to ensure the API still
>>>> pass.
>>>>> 
>>>>> This should be the best of both worlds, if sub-repo are supposed
>>> option.
>>>>> 
>>>>> --Donald E. Foss
>>>>> 
>>>>> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I dislike the current build system complications as well.
>>>>>> 
>>>>>> However, in my opinion, combining the code bases will severely impact
>>>> the
>>>>>> progress of the parquet-cpp project and implicitly the progress of the
>>>>>> entire parquet project.
>>>>>> Combining would have made much more sense if parquet-cpp is a mature
>>>>>> project and codebase.  But parquet-cpp (and the entire parquet
>>> project)
>>>> is
>>>>>> evolving continuously with new features being added including bloom
>>>>>> filters,  column encryption, and indexes.
>>>>>> 
>>>>>> If the two code bases merged, it will be much more difficult to
>>>> contribute
>>>>>> to the parquet-cpp project since now Arrow bindings have to be
>>>> supported as
>>>>>> well. Please correct me if I am wrong here.
>>>>>> 
>>>>>> Out of the two evils, I think handling the build system, packaging
>>>>>> duplication is much more manageable since they are quite stable at
>>> this
>>>>>> point.
>>>>>> 
>>>>>> Regarding "* API changes cause awkward release coordination issues
>>>> between
>>>>>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>>> API
>>>>>> changes needed) as and when Arrow is released?
>>>>>> 
>>>>>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>>>>>> converting between Arrow columnar memory format and Parquet". Can this
>>>> be
>>>>>> moved to the Arrow project and expose the more stable low-level APIs
>>> in
>>>>>> parquet-cpp?
>>>>>> 
>>>>>> I am also curious if the Arrow and Parquet Java implementations have
>>>>>> similar API compatibility issues.
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> hi folks,
>>>>>>> 
>>>>>>> We've been struggling for quite some time with the development
>>>>>>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>>>>>> 
>>>>>>> To explain the root issues:
>>>>>>> 
>>>>>>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>>>>>>> includes file interfaces, memory management, miscellaneous
>>> algorithms
>>>>>>> (e.g. dictionary encoding), etc. Note that before this "platform"
>>>>>>> dependency was introduced, there was significant duplicated code
>>>>>>> between these codebases and incompatible abstract interfaces for
>>>>>>> things like files
>>>>>>> 
>>>>>>> * we maintain a Arrow conversion code in parquet-cpp for converting
>>>>>>> between Arrow columnar memory format and Parquet
>>>>>>> 
>>>>>>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>>>>>>> Apache Arrow. This introduces a circular dependency into our CI.
>>>>>>> 
>>>>>>> * Substantial portions of our CMake build system and related tooling
>>>>>>> are duplicated between the Arrow and Parquet repos
>>>>>>> 
>>>>>>> * API changes cause awkward release coordination issues between
>>> Arrow
>>>>>>> and Parquet
>>>>>>> 
>>>>>>> I believe the best way to remedy the situation is to adopt a
>>>>>>> "Community over Code" approach and find a way for the Parquet and
>>>>>>> Arrow C++ development communities to operate out of the same code
>>>>>>> repository, i.e. the apache/arrow git repository.
>>>>>>> 
>>>>>>> This would bring major benefits:
>>>>>>> 
>>>>>>> * Shared CMake build infrastructure, developer tools, and CI
>>>>>>> infrastructure (Parquet is already being built as a dependency in
>>>>>>> Arrow's CI systems)
>>>>>>> 
>>>>>>> * Share packaging and release management infrastructure
>>>>>>> 
>>>>>>> * Reduce / eliminate problems due to API changes (where we currently
>>>>>>> introduce breakage into our CI workflow when there is a breaking /
>>>>>>> incompatible change)
>>>>>>> 
>>>>>>> * Arrow releases would include a coordinated snapshot of the Parquet
>>>>>>> implementation as it stands
>>>>>>> 
>>>>>>> Continuing with the status quo has become unsatisfactory to me and
>>> as
>>>>>>> a result I've become less motivated to work on the parquet-cpp
>>>>>>> codebase.
>>>>>>> 
>>>>>>> The only Parquet C++ committer who is not an Arrow committer is
>>> Deepak
>>>>>>> Majeti. I think the issue of commit privileges could be resolved
>>>>>>> without too much difficulty or time.
>>>>>>> 
>>>>>>> I also think if it is truly necessary that the Apache Parquet
>>>>>>> community could create release scripts to cut a miniml versioned
>>>>>>> Apache Parquet C++ release if that is deemed truly necessary.
>>>>>>> 
>>>>>>> I know that some people are wary of monorepos and megaprojects, but
>>> as
>>>>>>> an example TensorFlow is at least 10 times as large of a projects in
>>>>>>> terms of LOCs and number of different platform components, and it
>>>>>>> seems to be getting along just fine. I think we should be able to
>>> work
>>>>>>> together as a community to function just as well.
>>>>>>> 
>>>>>>> Interested in the opinions of others, and any other ideas for
>>>>>>> practical solutions to the above problems.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> regards,
>>>>>> Deepak Majeti
>>>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Deepak

On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com> wrote:
> @Wes
> My observation is that most of the parquet-cpp contributors you listed that
> overlap with the Arrow community mainly contribute to the Arrow
> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> repo. Very few of them review/contribute patches to the parquet-cpp core.
>

So, what are you saying exactly, that some contributions or
contributors to Apache Parquet matter more than others? I don't
follow.

As a result of these individual's efforts, the parquet-cpp libraries
are being installed well over 100,000 times per month on a single
install path (Python) alone.

> I believe improvements to the parquet-cpp core will be negatively impacted
> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
> of entry to new contributors interested in the parquet-cpp core. The
> current extensions to the parquet-cpp core related to bloom-filters, and
> column encryption are all being done by first-time contributors.

I don't understand why this would "increase the barrier of entry".
Could you explain?

It is true that there would be more code in the codebase, but the
build and test procedure would be no more complex. If anything,
community productivity will be improved by having a more cohesive /
centralized development platform (large amounts of code that Parquet
depends on are in Apache Arrow already).

>
> If you believe there will be new interest in the parquet-cpp core with the
> mono-repo approach, I am all up for it.

Yes, I believe that this change will result in more and higher quality
code review to Parquet core changes and general improvements to
developer productivity across the board. Developer productivity is
what this is all about.

- Wes

>
>
> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:
>
>> I do not claim to have insight into parquet-cpp development. However, from
>> our experience developing Ray, I can say that the monorepo approach (for
>> Ray) has improved things a lot. Before we tried various schemes to split
>> the project into multiple repos, but the build system and test
>> infrastructure duplications and overhead from synchronizing changes slowed
>> development down significantly (and fixing bugs that touch the subrepos and
>> the main repo is inconvenient).
>>
>> Also the decision to put arrow and parquet-cpp into a common repo is
>> independent of how tightly coupled the two projects are (and there could be
>> a matrix entry in travis which tests that PRs keep them decoupled, or
>> rather that they both just depend on a small common "base"). Google and
>> Facebook demonstrate such independence by having many many projects in the
>> same repo of course. It would be great if the open source community would
>> move more into this direction too I think.
>>
>> Best,
>> Philipp.
>>
>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi Donald,
>> >
>> > This would make things worse, not better. Code changes routinely
>> > involve changes to the build system, and so you could be talking about
>> > having to making changes to 2 or 3 git repositories as the result of a
>> > single new feature or bug fix. There isn't really a cross-repo CI
>> > solution available
>> >
>> > I've seen some approaches to the monorepo problem using multiple git
>> > repositories, such as
>> >
>> > https://github.com/twosigma/git-meta
>> >
>> > Until something like this has first class support by the GitHub
>> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> > will work for us.
>> >
>> > - Wes
>> >
>> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
>> > wrote:
>> > > Could this work as each module gets configured as sub-git repots. Top
>> > level
>> > > build tool go into each sub-repo, pick the correct release version to
>> > test.
>> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> > pass.
>> > >
>> > > This should be the best of both worlds, if sub-repo are supposed
>> option.
>> > >
>> > > --Donald E. Foss
>> > >
>> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
>> > > wrote:
>> > >
>> > >> I dislike the current build system complications as well.
>> > >>
>> > >> However, in my opinion, combining the code bases will severely impact
>> > the
>> > >> progress of the parquet-cpp project and implicitly the progress of the
>> > >> entire parquet project.
>> > >> Combining would have made much more sense if parquet-cpp is a mature
>> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> project)
>> > is
>> > >> evolving continuously with new features being added including bloom
>> > >> filters,  column encryption, and indexes.
>> > >>
>> > >> If the two code bases merged, it will be much more difficult to
>> > contribute
>> > >> to the parquet-cpp project since now Arrow bindings have to be
>> > supported as
>> > >> well. Please correct me if I am wrong here.
>> > >>
>> > >> Out of the two evils, I think handling the build system, packaging
>> > >> duplication is much more manageable since they are quite stable at
>> this
>> > >> point.
>> > >>
>> > >> Regarding "* API changes cause awkward release coordination issues
>> > between
>> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>> API
>> > >> changes needed) as and when Arrow is released?
>> > >>
>> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> > >> converting between Arrow columnar memory format and Parquet". Can this
>> > be
>> > >> moved to the Arrow project and expose the more stable low-level APIs
>> in
>> > >> parquet-cpp?
>> > >>
>> > >> I am also curious if the Arrow and Parquet Java implementations have
>> > >> similar API compatibility issues.
>> > >>
>> > >>
>> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>> > wrote:
>> > >>
>> > >> > hi folks,
>> > >> >
>> > >> > We've been struggling for quite some time with the development
>> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> > >> >
>> > >> > To explain the root issues:
>> > >> >
>> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > >> > includes file interfaces, memory management, miscellaneous
>> algorithms
>> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > >> > dependency was introduced, there was significant duplicated code
>> > >> > between these codebases and incompatible abstract interfaces for
>> > >> > things like files
>> > >> >
>> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > >> > between Arrow columnar memory format and Parquet
>> > >> >
>> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> > >> >
>> > >> > * Substantial portions of our CMake build system and related tooling
>> > >> > are duplicated between the Arrow and Parquet repos
>> > >> >
>> > >> > * API changes cause awkward release coordination issues between
>> Arrow
>> > >> > and Parquet
>> > >> >
>> > >> > I believe the best way to remedy the situation is to adopt a
>> > >> > "Community over Code" approach and find a way for the Parquet and
>> > >> > Arrow C++ development communities to operate out of the same code
>> > >> > repository, i.e. the apache/arrow git repository.
>> > >> >
>> > >> > This would bring major benefits:
>> > >> >
>> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> > >> > infrastructure (Parquet is already being built as a dependency in
>> > >> > Arrow's CI systems)
>> > >> >
>> > >> > * Share packaging and release management infrastructure
>> > >> >
>> > >> > * Reduce / eliminate problems due to API changes (where we currently
>> > >> > introduce breakage into our CI workflow when there is a breaking /
>> > >> > incompatible change)
>> > >> >
>> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > >> > implementation as it stands
>> > >> >
>> > >> > Continuing with the status quo has become unsatisfactory to me and
>> as
>> > >> > a result I've become less motivated to work on the parquet-cpp
>> > >> > codebase.
>> > >> >
>> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> Deepak
>> > >> > Majeti. I think the issue of commit privileges could be resolved
>> > >> > without too much difficulty or time.
>> > >> >
>> > >> > I also think if it is truly necessary that the Apache Parquet
>> > >> > community could create release scripts to cut a miniml versioned
>> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> > >> >
>> > >> > I know that some people are wary of monorepos and megaprojects, but
>> as
>> > >> > an example TensorFlow is at least 10 times as large of a projects in
>> > >> > terms of LOCs and number of different platform components, and it
>> > >> > seems to be getting along just fine. I think we should be able to
>> work
>> > >> > together as a community to function just as well.
>> > >> >
>> > >> > Interested in the opinions of others, and any other ideas for
>> > >> > practical solutions to the above problems.
>> > >> >
>> > >> > Thanks,
>> > >> > Wes
>> > >> >
>> > >>
>> > >>
>> > >> --
>> > >> regards,
>> > >> Deepak Majeti
>> > >>
>> >
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Deepak

On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <ma...@gmail.com> wrote:
> @Wes
> My observation is that most of the parquet-cpp contributors you listed that
> overlap with the Arrow community mainly contribute to the Arrow
> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> repo. Very few of them review/contribute patches to the parquet-cpp core.
>

So, what are you saying exactly, that some contributions or
contributors to Apache Parquet matter more than others? I don't
follow.

As a result of these individual's efforts, the parquet-cpp libraries
are being installed well over 100,000 times per month on a single
install path (Python) alone.

> I believe improvements to the parquet-cpp core will be negatively impacted
> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
> of entry to new contributors interested in the parquet-cpp core. The
> current extensions to the parquet-cpp core related to bloom-filters, and
> column encryption are all being done by first-time contributors.

I don't understand why this would "increase the barrier of entry".
Could you explain?

It is true that there would be more code in the codebase, but the
build and test procedure would be no more complex. If anything,
community productivity will be improved by having a more cohesive /
centralized development platform (large amounts of code that Parquet
depends on are in Apache Arrow already).

>
> If you believe there will be new interest in the parquet-cpp core with the
> mono-repo approach, I am all up for it.

Yes, I believe that this change will result in more and higher quality
code review to Parquet core changes and general improvements to
developer productivity across the board. Developer productivity is
what this is all about.

- Wes

>
>
> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:
>
>> I do not claim to have insight into parquet-cpp development. However, from
>> our experience developing Ray, I can say that the monorepo approach (for
>> Ray) has improved things a lot. Before we tried various schemes to split
>> the project into multiple repos, but the build system and test
>> infrastructure duplications and overhead from synchronizing changes slowed
>> development down significantly (and fixing bugs that touch the subrepos and
>> the main repo is inconvenient).
>>
>> Also the decision to put arrow and parquet-cpp into a common repo is
>> independent of how tightly coupled the two projects are (and there could be
>> a matrix entry in travis which tests that PRs keep them decoupled, or
>> rather that they both just depend on a small common "base"). Google and
>> Facebook demonstrate such independence by having many many projects in the
>> same repo of course. It would be great if the open source community would
>> move more into this direction too I think.
>>
>> Best,
>> Philipp.
>>
>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi Donald,
>> >
>> > This would make things worse, not better. Code changes routinely
>> > involve changes to the build system, and so you could be talking about
>> > having to making changes to 2 or 3 git repositories as the result of a
>> > single new feature or bug fix. There isn't really a cross-repo CI
>> > solution available
>> >
>> > I've seen some approaches to the monorepo problem using multiple git
>> > repositories, such as
>> >
>> > https://github.com/twosigma/git-meta
>> >
>> > Until something like this has first class support by the GitHub
>> > platform and its CI services (Travis CI, Appveyor), I don't think it
>> > will work for us.
>> >
>> > - Wes
>> >
>> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
>> > wrote:
>> > > Could this work as each module gets configured as sub-git repots. Top
>> > level
>> > > build tool go into each sub-repo, pick the correct release version to
>> > test.
>> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
>> > pass.
>> > >
>> > > This should be the best of both worlds, if sub-repo are supposed
>> option.
>> > >
>> > > --Donald E. Foss
>> > >
>> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
>> > > wrote:
>> > >
>> > >> I dislike the current build system complications as well.
>> > >>
>> > >> However, in my opinion, combining the code bases will severely impact
>> > the
>> > >> progress of the parquet-cpp project and implicitly the progress of the
>> > >> entire parquet project.
>> > >> Combining would have made much more sense if parquet-cpp is a mature
>> > >> project and codebase.  But parquet-cpp (and the entire parquet
>> project)
>> > is
>> > >> evolving continuously with new features being added including bloom
>> > >> filters,  column encryption, and indexes.
>> > >>
>> > >> If the two code bases merged, it will be much more difficult to
>> > contribute
>> > >> to the parquet-cpp project since now Arrow bindings have to be
>> > supported as
>> > >> well. Please correct me if I am wrong here.
>> > >>
>> > >> Out of the two evils, I think handling the build system, packaging
>> > >> duplication is much more manageable since they are quite stable at
>> this
>> > >> point.
>> > >>
>> > >> Regarding "* API changes cause awkward release coordination issues
>> > between
>> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>> API
>> > >> changes needed) as and when Arrow is released?
>> > >>
>> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> > >> converting between Arrow columnar memory format and Parquet". Can this
>> > be
>> > >> moved to the Arrow project and expose the more stable low-level APIs
>> in
>> > >> parquet-cpp?
>> > >>
>> > >> I am also curious if the Arrow and Parquet Java implementations have
>> > >> similar API compatibility issues.
>> > >>
>> > >>
>> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
>> > wrote:
>> > >>
>> > >> > hi folks,
>> > >> >
>> > >> > We've been struggling for quite some time with the development
>> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> > >> >
>> > >> > To explain the root issues:
>> > >> >
>> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > >> > includes file interfaces, memory management, miscellaneous
>> algorithms
>> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > >> > dependency was introduced, there was significant duplicated code
>> > >> > between these codebases and incompatible abstract interfaces for
>> > >> > things like files
>> > >> >
>> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > >> > between Arrow columnar memory format and Parquet
>> > >> >
>> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > >> > Apache Arrow. This introduces a circular dependency into our CI.
>> > >> >
>> > >> > * Substantial portions of our CMake build system and related tooling
>> > >> > are duplicated between the Arrow and Parquet repos
>> > >> >
>> > >> > * API changes cause awkward release coordination issues between
>> Arrow
>> > >> > and Parquet
>> > >> >
>> > >> > I believe the best way to remedy the situation is to adopt a
>> > >> > "Community over Code" approach and find a way for the Parquet and
>> > >> > Arrow C++ development communities to operate out of the same code
>> > >> > repository, i.e. the apache/arrow git repository.
>> > >> >
>> > >> > This would bring major benefits:
>> > >> >
>> > >> > * Shared CMake build infrastructure, developer tools, and CI
>> > >> > infrastructure (Parquet is already being built as a dependency in
>> > >> > Arrow's CI systems)
>> > >> >
>> > >> > * Share packaging and release management infrastructure
>> > >> >
>> > >> > * Reduce / eliminate problems due to API changes (where we currently
>> > >> > introduce breakage into our CI workflow when there is a breaking /
>> > >> > incompatible change)
>> > >> >
>> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > >> > implementation as it stands
>> > >> >
>> > >> > Continuing with the status quo has become unsatisfactory to me and
>> as
>> > >> > a result I've become less motivated to work on the parquet-cpp
>> > >> > codebase.
>> > >> >
>> > >> > The only Parquet C++ committer who is not an Arrow committer is
>> Deepak
>> > >> > Majeti. I think the issue of commit privileges could be resolved
>> > >> > without too much difficulty or time.
>> > >> >
>> > >> > I also think if it is truly necessary that the Apache Parquet
>> > >> > community could create release scripts to cut a miniml versioned
>> > >> > Apache Parquet C++ release if that is deemed truly necessary.
>> > >> >
>> > >> > I know that some people are wary of monorepos and megaprojects, but
>> as
>> > >> > an example TensorFlow is at least 10 times as large of a projects in
>> > >> > terms of LOCs and number of different platform components, and it
>> > >> > seems to be getting along just fine. I think we should be able to
>> work
>> > >> > together as a community to function just as well.
>> > >> >
>> > >> > Interested in the opinions of others, and any other ideas for
>> > >> > practical solutions to the above problems.
>> > >> >
>> > >> > Thanks,
>> > >> > Wes
>> > >> >
>> > >>
>> > >>
>> > >> --
>> > >> regards,
>> > >> Deepak Majeti
>> > >>
>> >
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

@Wes
My observation is that most of the parquet-cpp contributors you listed that
overlap with the Arrow community mainly contribute to the Arrow
bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
repo. Very few of them review/contribute patches to the parquet-cpp core.

I believe improvements to the parquet-cpp core will be negatively impacted
since merging the parquet-cpp and arrow-cpp repos will increase the barrier
of entry to new contributors interested in the parquet-cpp core. The
current extensions to the parquet-cpp core related to bloom-filters, and
column encryption are all being done by first-time contributors.

If you believe there will be new interest in the parquet-cpp core with the
mono-repo approach, I am all up for it.


On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:

> I do not claim to have insight into parquet-cpp development. However, from
> our experience developing Ray, I can say that the monorepo approach (for
> Ray) has improved things a lot. Before we tried various schemes to split
> the project into multiple repos, but the build system and test
> infrastructure duplications and overhead from synchronizing changes slowed
> development down significantly (and fixing bugs that touch the subrepos and
> the main repo is inconvenient).
>
> Also the decision to put arrow and parquet-cpp into a common repo is
> independent of how tightly coupled the two projects are (and there could be
> a matrix entry in travis which tests that PRs keep them decoupled, or
> rather that they both just depend on a small common "base"). Google and
> Facebook demonstrate such independence by having many many projects in the
> same repo of course. It would be great if the open source community would
> move more into this direction too I think.
>
> Best,
> Philipp.
>
> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>
> > hi Donald,
> >
> > This would make things worse, not better. Code changes routinely
> > involve changes to the build system, and so you could be talking about
> > having to making changes to 2 or 3 git repositories as the result of a
> > single new feature or bug fix. There isn't really a cross-repo CI
> > solution available
> >
> > I've seen some approaches to the monorepo problem using multiple git
> > repositories, such as
> >
> > https://github.com/twosigma/git-meta
> >
> > Until something like this has first class support by the GitHub
> > platform and its CI services (Travis CI, Appveyor), I don't think it
> > will work for us.
> >
> > - Wes
> >
> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
> > wrote:
> > > Could this work as each module gets configured as sub-git repots. Top
> > level
> > > build tool go into each sub-repo, pick the correct release version to
> > test.
> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
> > pass.
> > >
> > > This should be the best of both worlds, if sub-repo are supposed
> option.
> > >
> > > --Donald E. Foss
> > >
> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> > > wrote:
> > >
> > >> I dislike the current build system complications as well.
> > >>
> > >> However, in my opinion, combining the code bases will severely impact
> > the
> > >> progress of the parquet-cpp project and implicitly the progress of the
> > >> entire parquet project.
> > >> Combining would have made much more sense if parquet-cpp is a mature
> > >> project and codebase.  But parquet-cpp (and the entire parquet
> project)
> > is
> > >> evolving continuously with new features being added including bloom
> > >> filters,  column encryption, and indexes.
> > >>
> > >> If the two code bases merged, it will be much more difficult to
> > contribute
> > >> to the parquet-cpp project since now Arrow bindings have to be
> > supported as
> > >> well. Please correct me if I am wrong here.
> > >>
> > >> Out of the two evils, I think handling the build system, packaging
> > >> duplication is much more manageable since they are quite stable at
> this
> > >> point.
> > >>
> > >> Regarding "* API changes cause awkward release coordination issues
> > between
> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
> API
> > >> changes needed) as and when Arrow is released?
> > >>
> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> > >> converting between Arrow columnar memory format and Parquet". Can this
> > be
> > >> moved to the Arrow project and expose the more stable low-level APIs
> in
> > >> parquet-cpp?
> > >>
> > >> I am also curious if the Arrow and Parquet Java implementations have
> > >> similar API compatibility issues.
> > >>
> > >>
> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>
> > >> > hi folks,
> > >> >
> > >> > We've been struggling for quite some time with the development
> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> > >> >
> > >> > To explain the root issues:
> > >> >
> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > >> > includes file interfaces, memory management, miscellaneous
> algorithms
> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > >> > dependency was introduced, there was significant duplicated code
> > >> > between these codebases and incompatible abstract interfaces for
> > >> > things like files
> > >> >
> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > >> > between Arrow columnar memory format and Parquet
> > >> >
> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > >> > Apache Arrow. This introduces a circular dependency into our CI.
> > >> >
> > >> > * Substantial portions of our CMake build system and related tooling
> > >> > are duplicated between the Arrow and Parquet repos
> > >> >
> > >> > * API changes cause awkward release coordination issues between
> Arrow
> > >> > and Parquet
> > >> >
> > >> > I believe the best way to remedy the situation is to adopt a
> > >> > "Community over Code" approach and find a way for the Parquet and
> > >> > Arrow C++ development communities to operate out of the same code
> > >> > repository, i.e. the apache/arrow git repository.
> > >> >
> > >> > This would bring major benefits:
> > >> >
> > >> > * Shared CMake build infrastructure, developer tools, and CI
> > >> > infrastructure (Parquet is already being built as a dependency in
> > >> > Arrow's CI systems)
> > >> >
> > >> > * Share packaging and release management infrastructure
> > >> >
> > >> > * Reduce / eliminate problems due to API changes (where we currently
> > >> > introduce breakage into our CI workflow when there is a breaking /
> > >> > incompatible change)
> > >> >
> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
> > >> > implementation as it stands
> > >> >
> > >> > Continuing with the status quo has become unsatisfactory to me and
> as
> > >> > a result I've become less motivated to work on the parquet-cpp
> > >> > codebase.
> > >> >
> > >> > The only Parquet C++ committer who is not an Arrow committer is
> Deepak
> > >> > Majeti. I think the issue of commit privileges could be resolved
> > >> > without too much difficulty or time.
> > >> >
> > >> > I also think if it is truly necessary that the Apache Parquet
> > >> > community could create release scripts to cut a miniml versioned
> > >> > Apache Parquet C++ release if that is deemed truly necessary.
> > >> >
> > >> > I know that some people are wary of monorepos and megaprojects, but
> as
> > >> > an example TensorFlow is at least 10 times as large of a projects in
> > >> > terms of LOCs and number of different platform components, and it
> > >> > seems to be getting along just fine. I think we should be able to
> work
> > >> > together as a community to function just as well.
> > >> >
> > >> > Interested in the opinions of others, and any other ideas for
> > >> > practical solutions to the above problems.
> > >> >
> > >> > Thanks,
> > >> > Wes
> > >> >
> > >>
> > >>
> > >> --
> > >> regards,
> > >> Deepak Majeti
> > >>
> >
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

@Wes
My observation is that most of the parquet-cpp contributors you listed that
overlap with the Arrow community mainly contribute to the Arrow
bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
repo. Very few of them review/contribute patches to the parquet-cpp core.

I believe improvements to the parquet-cpp core will be negatively impacted
since merging the parquet-cpp and arrow-cpp repos will increase the barrier
of entry to new contributors interested in the parquet-cpp core. The
current extensions to the parquet-cpp core related to bloom-filters, and
column encryption are all being done by first-time contributors.

If you believe there will be new interest in the parquet-cpp core with the
mono-repo approach, I am all up for it.


On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pc...@gmail.com> wrote:

> I do not claim to have insight into parquet-cpp development. However, from
> our experience developing Ray, I can say that the monorepo approach (for
> Ray) has improved things a lot. Before we tried various schemes to split
> the project into multiple repos, but the build system and test
> infrastructure duplications and overhead from synchronizing changes slowed
> development down significantly (and fixing bugs that touch the subrepos and
> the main repo is inconvenient).
>
> Also the decision to put arrow and parquet-cpp into a common repo is
> independent of how tightly coupled the two projects are (and there could be
> a matrix entry in travis which tests that PRs keep them decoupled, or
> rather that they both just depend on a small common "base"). Google and
> Facebook demonstrate such independence by having many many projects in the
> same repo of course. It would be great if the open source community would
> move more into this direction too I think.
>
> Best,
> Philipp.
>
> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:
>
> > hi Donald,
> >
> > This would make things worse, not better. Code changes routinely
> > involve changes to the build system, and so you could be talking about
> > having to making changes to 2 or 3 git repositories as the result of a
> > single new feature or bug fix. There isn't really a cross-repo CI
> > solution available
> >
> > I've seen some approaches to the monorepo problem using multiple git
> > repositories, such as
> >
> > https://github.com/twosigma/git-meta
> >
> > Until something like this has first class support by the GitHub
> > platform and its CI services (Travis CI, Appveyor), I don't think it
> > will work for us.
> >
> > - Wes
> >
> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
> > wrote:
> > > Could this work as each module gets configured as sub-git repots. Top
> > level
> > > build tool go into each sub-repo, pick the correct release version to
> > test.
> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
> > pass.
> > >
> > > This should be the best of both worlds, if sub-repo are supposed
> option.
> > >
> > > --Donald E. Foss
> > >
> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> > > wrote:
> > >
> > >> I dislike the current build system complications as well.
> > >>
> > >> However, in my opinion, combining the code bases will severely impact
> > the
> > >> progress of the parquet-cpp project and implicitly the progress of the
> > >> entire parquet project.
> > >> Combining would have made much more sense if parquet-cpp is a mature
> > >> project and codebase.  But parquet-cpp (and the entire parquet
> project)
> > is
> > >> evolving continuously with new features being added including bloom
> > >> filters,  column encryption, and indexes.
> > >>
> > >> If the two code bases merged, it will be much more difficult to
> > contribute
> > >> to the parquet-cpp project since now Arrow bindings have to be
> > supported as
> > >> well. Please correct me if I am wrong here.
> > >>
> > >> Out of the two evils, I think handling the build system, packaging
> > >> duplication is much more manageable since they are quite stable at
> this
> > >> point.
> > >>
> > >> Regarding "* API changes cause awkward release coordination issues
> > between
> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
> API
> > >> changes needed) as and when Arrow is released?
> > >>
> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> > >> converting between Arrow columnar memory format and Parquet". Can this
> > be
> > >> moved to the Arrow project and expose the more stable low-level APIs
> in
> > >> parquet-cpp?
> > >>
> > >> I am also curious if the Arrow and Parquet Java implementations have
> > >> similar API compatibility issues.
> > >>
> > >>
> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>
> > >> > hi folks,
> > >> >
> > >> > We've been struggling for quite some time with the development
> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> > >> >
> > >> > To explain the root issues:
> > >> >
> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > >> > includes file interfaces, memory management, miscellaneous
> algorithms
> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > >> > dependency was introduced, there was significant duplicated code
> > >> > between these codebases and incompatible abstract interfaces for
> > >> > things like files
> > >> >
> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > >> > between Arrow columnar memory format and Parquet
> > >> >
> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > >> > Apache Arrow. This introduces a circular dependency into our CI.
> > >> >
> > >> > * Substantial portions of our CMake build system and related tooling
> > >> > are duplicated between the Arrow and Parquet repos
> > >> >
> > >> > * API changes cause awkward release coordination issues between
> Arrow
> > >> > and Parquet
> > >> >
> > >> > I believe the best way to remedy the situation is to adopt a
> > >> > "Community over Code" approach and find a way for the Parquet and
> > >> > Arrow C++ development communities to operate out of the same code
> > >> > repository, i.e. the apache/arrow git repository.
> > >> >
> > >> > This would bring major benefits:
> > >> >
> > >> > * Shared CMake build infrastructure, developer tools, and CI
> > >> > infrastructure (Parquet is already being built as a dependency in
> > >> > Arrow's CI systems)
> > >> >
> > >> > * Share packaging and release management infrastructure
> > >> >
> > >> > * Reduce / eliminate problems due to API changes (where we currently
> > >> > introduce breakage into our CI workflow when there is a breaking /
> > >> > incompatible change)
> > >> >
> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
> > >> > implementation as it stands
> > >> >
> > >> > Continuing with the status quo has become unsatisfactory to me and
> as
> > >> > a result I've become less motivated to work on the parquet-cpp
> > >> > codebase.
> > >> >
> > >> > The only Parquet C++ committer who is not an Arrow committer is
> Deepak
> > >> > Majeti. I think the issue of commit privileges could be resolved
> > >> > without too much difficulty or time.
> > >> >
> > >> > I also think if it is truly necessary that the Apache Parquet
> > >> > community could create release scripts to cut a miniml versioned
> > >> > Apache Parquet C++ release if that is deemed truly necessary.
> > >> >
> > >> > I know that some people are wary of monorepos and megaprojects, but
> as
> > >> > an example TensorFlow is at least 10 times as large of a projects in
> > >> > terms of LOCs and number of different platform components, and it
> > >> > seems to be getting along just fine. I think we should be able to
> work
> > >> > together as a community to function just as well.
> > >> >
> > >> > Interested in the opinions of others, and any other ideas for
> > >> > practical solutions to the above problems.
> > >> >
> > >> > Thanks,
> > >> > Wes
> > >> >
> > >>
> > >>
> > >> --
> > >> regards,
> > >> Deepak Majeti
> > >>
> >
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Philipp Moritz <pc...@gmail.com>.

I do not claim to have insight into parquet-cpp development. However, from
our experience developing Ray, I can say that the monorepo approach (for
Ray) has improved things a lot. Before we tried various schemes to split
the project into multiple repos, but the build system and test
infrastructure duplications and overhead from synchronizing changes slowed
development down significantly (and fixing bugs that touch the subrepos and
the main repo is inconvenient).

Also the decision to put arrow and parquet-cpp into a common repo is
independent of how tightly coupled the two projects are (and there could be
a matrix entry in travis which tests that PRs keep them decoupled, or
rather that they both just depend on a small common "base"). Google and
Facebook demonstrate such independence by having many many projects in the
same repo of course. It would be great if the open source community would
move more into this direction too I think.

Best,
Philipp.

On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Donald,
>
> This would make things worse, not better. Code changes routinely
> involve changes to the build system, and so you could be talking about
> having to making changes to 2 or 3 git repositories as the result of a
> single new feature or bug fix. There isn't really a cross-repo CI
> solution available
>
> I've seen some approaches to the monorepo problem using multiple git
> repositories, such as
>
> https://github.com/twosigma/git-meta
>
> Until something like this has first class support by the GitHub
> platform and its CI services (Travis CI, Appveyor), I don't think it
> will work for us.
>
> - Wes
>
> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
> wrote:
> > Could this work as each module gets configured as sub-git repots. Top
> level
> > build tool go into each sub-repo, pick the correct release version to
> test.
> > Tests in Python is dependent on cpp sub-repo to ensure the API still
> pass.
> >
> > This should be the best of both worlds, if sub-repo are supposed option.
> >
> > --Donald E. Foss
> >
> > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> > wrote:
> >
> >> I dislike the current build system complications as well.
> >>
> >> However, in my opinion, combining the code bases will severely impact
> the
> >> progress of the parquet-cpp project and implicitly the progress of the
> >> entire parquet project.
> >> Combining would have made much more sense if parquet-cpp is a mature
> >> project and codebase.  But parquet-cpp (and the entire parquet project)
> is
> >> evolving continuously with new features being added including bloom
> >> filters,  column encryption, and indexes.
> >>
> >> If the two code bases merged, it will be much more difficult to
> contribute
> >> to the parquet-cpp project since now Arrow bindings have to be
> supported as
> >> well. Please correct me if I am wrong here.
> >>
> >> Out of the two evils, I think handling the build system, packaging
> >> duplication is much more manageable since they are quite stable at this
> >> point.
> >>
> >> Regarding "* API changes cause awkward release coordination issues
> between
> >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> >> changes needed) as and when Arrow is released?
> >>
> >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> >> converting between Arrow columnar memory format and Parquet". Can this
> be
> >> moved to the Arrow project and expose the more stable low-level APIs in
> >> parquet-cpp?
> >>
> >> I am also curious if the Arrow and Parquet Java implementations have
> >> similar API compatibility issues.
> >>
> >>
> >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> > hi folks,
> >> >
> >> > We've been struggling for quite some time with the development
> >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >> >
> >> > To explain the root issues:
> >> >
> >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> >> > includes file interfaces, memory management, miscellaneous algorithms
> >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> >> > dependency was introduced, there was significant duplicated code
> >> > between these codebases and incompatible abstract interfaces for
> >> > things like files
> >> >
> >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> >> > between Arrow columnar memory format and Parquet
> >> >
> >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> >> > Apache Arrow. This introduces a circular dependency into our CI.
> >> >
> >> > * Substantial portions of our CMake build system and related tooling
> >> > are duplicated between the Arrow and Parquet repos
> >> >
> >> > * API changes cause awkward release coordination issues between Arrow
> >> > and Parquet
> >> >
> >> > I believe the best way to remedy the situation is to adopt a
> >> > "Community over Code" approach and find a way for the Parquet and
> >> > Arrow C++ development communities to operate out of the same code
> >> > repository, i.e. the apache/arrow git repository.
> >> >
> >> > This would bring major benefits:
> >> >
> >> > * Shared CMake build infrastructure, developer tools, and CI
> >> > infrastructure (Parquet is already being built as a dependency in
> >> > Arrow's CI systems)
> >> >
> >> > * Share packaging and release management infrastructure
> >> >
> >> > * Reduce / eliminate problems due to API changes (where we currently
> >> > introduce breakage into our CI workflow when there is a breaking /
> >> > incompatible change)
> >> >
> >> > * Arrow releases would include a coordinated snapshot of the Parquet
> >> > implementation as it stands
> >> >
> >> > Continuing with the status quo has become unsatisfactory to me and as
> >> > a result I've become less motivated to work on the parquet-cpp
> >> > codebase.
> >> >
> >> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> >> > Majeti. I think the issue of commit privileges could be resolved
> >> > without too much difficulty or time.
> >> >
> >> > I also think if it is truly necessary that the Apache Parquet
> >> > community could create release scripts to cut a miniml versioned
> >> > Apache Parquet C++ release if that is deemed truly necessary.
> >> >
> >> > I know that some people are wary of monorepos and megaprojects, but as
> >> > an example TensorFlow is at least 10 times as large of a projects in
> >> > terms of LOCs and number of different platform components, and it
> >> > seems to be getting along just fine. I think we should be able to work
> >> > together as a community to function just as well.
> >> >
> >> > Interested in the opinions of others, and any other ideas for
> >> > practical solutions to the above problems.
> >> >
> >> > Thanks,
> >> > Wes
> >> >
> >>
> >>
> >> --
> >> regards,
> >> Deepak Majeti
> >>
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Philipp Moritz <pc...@gmail.com>.

I do not claim to have insight into parquet-cpp development. However, from
our experience developing Ray, I can say that the monorepo approach (for
Ray) has improved things a lot. Before we tried various schemes to split
the project into multiple repos, but the build system and test
infrastructure duplications and overhead from synchronizing changes slowed
development down significantly (and fixing bugs that touch the subrepos and
the main repo is inconvenient).

Also the decision to put arrow and parquet-cpp into a common repo is
independent of how tightly coupled the two projects are (and there could be
a matrix entry in travis which tests that PRs keep them decoupled, or
rather that they both just depend on a small common "base"). Google and
Facebook demonstrate such independence by having many many projects in the
same repo of course. It would be great if the open source community would
move more into this direction too I think.

Best,
Philipp.

On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Donald,
>
> This would make things worse, not better. Code changes routinely
> involve changes to the build system, and so you could be talking about
> having to making changes to 2 or 3 git repositories as the result of a
> single new feature or bug fix. There isn't really a cross-repo CI
> solution available
>
> I've seen some approaches to the monorepo problem using multiple git
> repositories, such as
>
> https://github.com/twosigma/git-meta
>
> Until something like this has first class support by the GitHub
> platform and its CI services (Travis CI, Appveyor), I don't think it
> will work for us.
>
> - Wes
>
> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com>
> wrote:
> > Could this work as each module gets configured as sub-git repots. Top
> level
> > build tool go into each sub-repo, pick the correct release version to
> test.
> > Tests in Python is dependent on cpp sub-repo to ensure the API still
> pass.
> >
> > This should be the best of both worlds, if sub-repo are supposed option.
> >
> > --Donald E. Foss
> >
> > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> > wrote:
> >
> >> I dislike the current build system complications as well.
> >>
> >> However, in my opinion, combining the code bases will severely impact
> the
> >> progress of the parquet-cpp project and implicitly the progress of the
> >> entire parquet project.
> >> Combining would have made much more sense if parquet-cpp is a mature
> >> project and codebase.  But parquet-cpp (and the entire parquet project)
> is
> >> evolving continuously with new features being added including bloom
> >> filters,  column encryption, and indexes.
> >>
> >> If the two code bases merged, it will be much more difficult to
> contribute
> >> to the parquet-cpp project since now Arrow bindings have to be
> supported as
> >> well. Please correct me if I am wrong here.
> >>
> >> Out of the two evils, I think handling the build system, packaging
> >> duplication is much more manageable since they are quite stable at this
> >> point.
> >>
> >> Regarding "* API changes cause awkward release coordination issues
> between
> >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> >> changes needed) as and when Arrow is released?
> >>
> >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> >> converting between Arrow columnar memory format and Parquet". Can this
> be
> >> moved to the Arrow project and expose the more stable low-level APIs in
> >> parquet-cpp?
> >>
> >> I am also curious if the Arrow and Parquet Java implementations have
> >> similar API compatibility issues.
> >>
> >>
> >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> > hi folks,
> >> >
> >> > We've been struggling for quite some time with the development
> >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >> >
> >> > To explain the root issues:
> >> >
> >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> >> > includes file interfaces, memory management, miscellaneous algorithms
> >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> >> > dependency was introduced, there was significant duplicated code
> >> > between these codebases and incompatible abstract interfaces for
> >> > things like files
> >> >
> >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> >> > between Arrow columnar memory format and Parquet
> >> >
> >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> >> > Apache Arrow. This introduces a circular dependency into our CI.
> >> >
> >> > * Substantial portions of our CMake build system and related tooling
> >> > are duplicated between the Arrow and Parquet repos
> >> >
> >> > * API changes cause awkward release coordination issues between Arrow
> >> > and Parquet
> >> >
> >> > I believe the best way to remedy the situation is to adopt a
> >> > "Community over Code" approach and find a way for the Parquet and
> >> > Arrow C++ development communities to operate out of the same code
> >> > repository, i.e. the apache/arrow git repository.
> >> >
> >> > This would bring major benefits:
> >> >
> >> > * Shared CMake build infrastructure, developer tools, and CI
> >> > infrastructure (Parquet is already being built as a dependency in
> >> > Arrow's CI systems)
> >> >
> >> > * Share packaging and release management infrastructure
> >> >
> >> > * Reduce / eliminate problems due to API changes (where we currently
> >> > introduce breakage into our CI workflow when there is a breaking /
> >> > incompatible change)
> >> >
> >> > * Arrow releases would include a coordinated snapshot of the Parquet
> >> > implementation as it stands
> >> >
> >> > Continuing with the status quo has become unsatisfactory to me and as
> >> > a result I've become less motivated to work on the parquet-cpp
> >> > codebase.
> >> >
> >> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> >> > Majeti. I think the issue of commit privileges could be resolved
> >> > without too much difficulty or time.
> >> >
> >> > I also think if it is truly necessary that the Apache Parquet
> >> > community could create release scripts to cut a miniml versioned
> >> > Apache Parquet C++ release if that is deemed truly necessary.
> >> >
> >> > I know that some people are wary of monorepos and megaprojects, but as
> >> > an example TensorFlow is at least 10 times as large of a projects in
> >> > terms of LOCs and number of different platform components, and it
> >> > seems to be getting along just fine. I think we should be able to work
> >> > together as a community to function just as well.
> >> >
> >> > Interested in the opinions of others, and any other ideas for
> >> > practical solutions to the above problems.
> >> >
> >> > Thanks,
> >> > Wes
> >> >
> >>
> >>
> >> --
> >> regards,
> >> Deepak Majeti
> >>
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Donald,

This would make things worse, not better. Code changes routinely
involve changes to the build system, and so you could be talking about
having to making changes to 2 or 3 git repositories as the result of a
single new feature or bug fix. There isn't really a cross-repo CI
solution available

I've seen some approaches to the monorepo problem using multiple git
repositories, such as

https://github.com/twosigma/git-meta

Until something like this has first class support by the GitHub
platform and its CI services (Travis CI, Appveyor), I don't think it
will work for us.

- Wes

On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com> wrote:
> Could this work as each module gets configured as sub-git repots. Top level
> build tool go into each sub-repo, pick the correct release version to test.
> Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
>
> This should be the best of both worlds, if sub-repo are supposed option.
>
> --Donald E. Foss
>
> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> wrote:
>
>> I dislike the current build system complications as well.
>>
>> However, in my opinion, combining the code bases will severely impact the
>> progress of the parquet-cpp project and implicitly the progress of the
>> entire parquet project.
>> Combining would have made much more sense if parquet-cpp is a mature
>> project and codebase.  But parquet-cpp (and the entire parquet project) is
>> evolving continuously with new features being added including bloom
>> filters,  column encryption, and indexes.
>>
>> If the two code bases merged, it will be much more difficult to contribute
>> to the parquet-cpp project since now Arrow bindings have to be supported as
>> well. Please correct me if I am wrong here.
>>
>> Out of the two evils, I think handling the build system, packaging
>> duplication is much more manageable since they are quite stable at this
>> point.
>>
>> Regarding "* API changes cause awkward release coordination issues between
>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
>> changes needed) as and when Arrow is released?
>>
>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> converting between Arrow columnar memory format and Parquet". Can this be
>> moved to the Arrow project and expose the more stable low-level APIs in
>> parquet-cpp?
>>
>> I am also curious if the Arrow and Parquet Java implementations have
>> similar API compatibility issues.
>>
>>
>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi folks,
>> >
>> > We've been struggling for quite some time with the development
>> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> >
>> > To explain the root issues:
>> >
>> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > includes file interfaces, memory management, miscellaneous algorithms
>> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > dependency was introduced, there was significant duplicated code
>> > between these codebases and incompatible abstract interfaces for
>> > things like files
>> >
>> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > between Arrow columnar memory format and Parquet
>> >
>> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > Apache Arrow. This introduces a circular dependency into our CI.
>> >
>> > * Substantial portions of our CMake build system and related tooling
>> > are duplicated between the Arrow and Parquet repos
>> >
>> > * API changes cause awkward release coordination issues between Arrow
>> > and Parquet
>> >
>> > I believe the best way to remedy the situation is to adopt a
>> > "Community over Code" approach and find a way for the Parquet and
>> > Arrow C++ development communities to operate out of the same code
>> > repository, i.e. the apache/arrow git repository.
>> >
>> > This would bring major benefits:
>> >
>> > * Shared CMake build infrastructure, developer tools, and CI
>> > infrastructure (Parquet is already being built as a dependency in
>> > Arrow's CI systems)
>> >
>> > * Share packaging and release management infrastructure
>> >
>> > * Reduce / eliminate problems due to API changes (where we currently
>> > introduce breakage into our CI workflow when there is a breaking /
>> > incompatible change)
>> >
>> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > implementation as it stands
>> >
>> > Continuing with the status quo has become unsatisfactory to me and as
>> > a result I've become less motivated to work on the parquet-cpp
>> > codebase.
>> >
>> > The only Parquet C++ committer who is not an Arrow committer is Deepak
>> > Majeti. I think the issue of commit privileges could be resolved
>> > without too much difficulty or time.
>> >
>> > I also think if it is truly necessary that the Apache Parquet
>> > community could create release scripts to cut a miniml versioned
>> > Apache Parquet C++ release if that is deemed truly necessary.
>> >
>> > I know that some people are wary of monorepos and megaprojects, but as
>> > an example TensorFlow is at least 10 times as large of a projects in
>> > terms of LOCs and number of different platform components, and it
>> > seems to be getting along just fine. I think we should be able to work
>> > together as a community to function just as well.
>> >
>> > Interested in the opinions of others, and any other ideas for
>> > practical solutions to the above problems.
>> >
>> > Thanks,
>> > Wes
>> >
>>
>>
>> --
>> regards,
>> Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Donald,

This would make things worse, not better. Code changes routinely
involve changes to the build system, and so you could be talking about
having to making changes to 2 or 3 git repositories as the result of a
single new feature or bug fix. There isn't really a cross-repo CI
solution available

I've seen some approaches to the monorepo problem using multiple git
repositories, such as

https://github.com/twosigma/git-meta

Until something like this has first class support by the GitHub
platform and its CI services (Travis CI, Appveyor), I don't think it
will work for us.

- Wes

On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <do...@gmail.com> wrote:
> Could this work as each module gets configured as sub-git repots. Top level
> build tool go into each sub-repo, pick the correct release version to test.
> Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
>
> This should be the best of both worlds, if sub-repo are supposed option.
>
> --Donald E. Foss
>
> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
> wrote:
>
>> I dislike the current build system complications as well.
>>
>> However, in my opinion, combining the code bases will severely impact the
>> progress of the parquet-cpp project and implicitly the progress of the
>> entire parquet project.
>> Combining would have made much more sense if parquet-cpp is a mature
>> project and codebase.  But parquet-cpp (and the entire parquet project) is
>> evolving continuously with new features being added including bloom
>> filters,  column encryption, and indexes.
>>
>> If the two code bases merged, it will be much more difficult to contribute
>> to the parquet-cpp project since now Arrow bindings have to be supported as
>> well. Please correct me if I am wrong here.
>>
>> Out of the two evils, I think handling the build system, packaging
>> duplication is much more manageable since they are quite stable at this
>> point.
>>
>> Regarding "* API changes cause awkward release coordination issues between
>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
>> changes needed) as and when Arrow is released?
>>
>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>> converting between Arrow columnar memory format and Parquet". Can this be
>> moved to the Arrow project and expose the more stable low-level APIs in
>> parquet-cpp?
>>
>> I am also curious if the Arrow and Parquet Java implementations have
>> similar API compatibility issues.
>>
>>
>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> > hi folks,
>> >
>> > We've been struggling for quite some time with the development
>> > workflow between the Arrow and Parquet C++ (and Python) codebases.
>> >
>> > To explain the root issues:
>> >
>> > * parquet-cpp depends on "platform code" in Apache Arrow; this
>> > includes file interfaces, memory management, miscellaneous algorithms
>> > (e.g. dictionary encoding), etc. Note that before this "platform"
>> > dependency was introduced, there was significant duplicated code
>> > between these codebases and incompatible abstract interfaces for
>> > things like files
>> >
>> > * we maintain a Arrow conversion code in parquet-cpp for converting
>> > between Arrow columnar memory format and Parquet
>> >
>> > * we maintain Python bindings for parquet-cpp + Arrow interop in
>> > Apache Arrow. This introduces a circular dependency into our CI.
>> >
>> > * Substantial portions of our CMake build system and related tooling
>> > are duplicated between the Arrow and Parquet repos
>> >
>> > * API changes cause awkward release coordination issues between Arrow
>> > and Parquet
>> >
>> > I believe the best way to remedy the situation is to adopt a
>> > "Community over Code" approach and find a way for the Parquet and
>> > Arrow C++ development communities to operate out of the same code
>> > repository, i.e. the apache/arrow git repository.
>> >
>> > This would bring major benefits:
>> >
>> > * Shared CMake build infrastructure, developer tools, and CI
>> > infrastructure (Parquet is already being built as a dependency in
>> > Arrow's CI systems)
>> >
>> > * Share packaging and release management infrastructure
>> >
>> > * Reduce / eliminate problems due to API changes (where we currently
>> > introduce breakage into our CI workflow when there is a breaking /
>> > incompatible change)
>> >
>> > * Arrow releases would include a coordinated snapshot of the Parquet
>> > implementation as it stands
>> >
>> > Continuing with the status quo has become unsatisfactory to me and as
>> > a result I've become less motivated to work on the parquet-cpp
>> > codebase.
>> >
>> > The only Parquet C++ committer who is not an Arrow committer is Deepak
>> > Majeti. I think the issue of commit privileges could be resolved
>> > without too much difficulty or time.
>> >
>> > I also think if it is truly necessary that the Apache Parquet
>> > community could create release scripts to cut a miniml versioned
>> > Apache Parquet C++ release if that is deemed truly necessary.
>> >
>> > I know that some people are wary of monorepos and megaprojects, but as
>> > an example TensorFlow is at least 10 times as large of a projects in
>> > terms of LOCs and number of different platform components, and it
>> > seems to be getting along just fine. I think we should be able to work
>> > together as a community to function just as well.
>> >
>> > Interested in the opinions of others, and any other ideas for
>> > practical solutions to the above problems.
>> >
>> > Thanks,
>> > Wes
>> >
>>
>>
>> --
>> regards,
>> Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by "Donald E. Foss" <do...@gmail.com>.

Could this work as each module gets configured as sub-git repots. Top level
build tool go into each sub-repo, pick the correct release version to test.
Tests in Python is dependent on cpp sub-repo to ensure the API still pass.

This should be the best of both worlds, if sub-repo are supposed option.

--Donald E. Foss

On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
wrote:

> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>
> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.
>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.
>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?
>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?
>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.
>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > We've been struggling for quite some time with the development
> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >
> > To explain the root issues:
> >
> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > includes file interfaces, memory management, miscellaneous algorithms
> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > dependency was introduced, there was significant duplicated code
> > between these codebases and incompatible abstract interfaces for
> > things like files
> >
> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > between Arrow columnar memory format and Parquet
> >
> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > Apache Arrow. This introduces a circular dependency into our CI.
> >
> > * Substantial portions of our CMake build system and related tooling
> > are duplicated between the Arrow and Parquet repos
> >
> > * API changes cause awkward release coordination issues between Arrow
> > and Parquet
> >
> > I believe the best way to remedy the situation is to adopt a
> > "Community over Code" approach and find a way for the Parquet and
> > Arrow C++ development communities to operate out of the same code
> > repository, i.e. the apache/arrow git repository.
> >
> > This would bring major benefits:
> >
> > * Shared CMake build infrastructure, developer tools, and CI
> > infrastructure (Parquet is already being built as a dependency in
> > Arrow's CI systems)
> >
> > * Share packaging and release management infrastructure
> >
> > * Reduce / eliminate problems due to API changes (where we currently
> > introduce breakage into our CI workflow when there is a breaking /
> > incompatible change)
> >
> > * Arrow releases would include a coordinated snapshot of the Parquet
> > implementation as it stands
> >
> > Continuing with the status quo has become unsatisfactory to me and as
> > a result I've become less motivated to work on the parquet-cpp
> > codebase.
> >
> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> > Majeti. I think the issue of commit privileges could be resolved
> > without too much difficulty or time.
> >
> > I also think if it is truly necessary that the Apache Parquet
> > community could create release scripts to cut a miniml versioned
> > Apache Parquet C++ release if that is deemed truly necessary.
> >
> > I know that some people are wary of monorepos and megaprojects, but as
> > an example TensorFlow is at least 10 times as large of a projects in
> > terms of LOCs and number of different platform components, and it
> > seems to be getting along just fine. I think we should be able to work
> > together as a community to function just as well.
> >
> > Interested in the opinions of others, and any other ideas for
> > practical solutions to the above problems.
> >
> > Thanks,
> > Wes
> >
>
>
> --
> regards,
> Deepak Majeti
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by "Donald E. Foss" <do...@gmail.com>.

Could this work as each module gets configured as sub-git repots. Top level
build tool go into each sub-repo, pick the correct release version to test.
Tests in Python is dependent on cpp sub-repo to ensure the API still pass.

This should be the best of both worlds, if sub-repo are supposed option.

--Donald E. Foss

On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <ma...@gmail.com>
wrote:

> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>
> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.
>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.
>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?
>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?
>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.
>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > We've been struggling for quite some time with the development
> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >
> > To explain the root issues:
> >
> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > includes file interfaces, memory management, miscellaneous algorithms
> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > dependency was introduced, there was significant duplicated code
> > between these codebases and incompatible abstract interfaces for
> > things like files
> >
> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > between Arrow columnar memory format and Parquet
> >
> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > Apache Arrow. This introduces a circular dependency into our CI.
> >
> > * Substantial portions of our CMake build system and related tooling
> > are duplicated between the Arrow and Parquet repos
> >
> > * API changes cause awkward release coordination issues between Arrow
> > and Parquet
> >
> > I believe the best way to remedy the situation is to adopt a
> > "Community over Code" approach and find a way for the Parquet and
> > Arrow C++ development communities to operate out of the same code
> > repository, i.e. the apache/arrow git repository.
> >
> > This would bring major benefits:
> >
> > * Shared CMake build infrastructure, developer tools, and CI
> > infrastructure (Parquet is already being built as a dependency in
> > Arrow's CI systems)
> >
> > * Share packaging and release management infrastructure
> >
> > * Reduce / eliminate problems due to API changes (where we currently
> > introduce breakage into our CI workflow when there is a breaking /
> > incompatible change)
> >
> > * Arrow releases would include a coordinated snapshot of the Parquet
> > implementation as it stands
> >
> > Continuing with the status quo has become unsatisfactory to me and as
> > a result I've become less motivated to work on the parquet-cpp
> > codebase.
> >
> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> > Majeti. I think the issue of commit privileges could be resolved
> > without too much difficulty or time.
> >
> > I also think if it is truly necessary that the Apache Parquet
> > community could create release scripts to cut a miniml versioned
> > Apache Parquet C++ release if that is deemed truly necessary.
> >
> > I know that some people are wary of monorepos and megaprojects, but as
> > an example TensorFlow is at least 10 times as large of a projects in
> > terms of LOCs and number of different platform components, and it
> > seems to be getting along just fine. I think we should be able to work
> > together as a community to function just as well.
> >
> > Interested in the opinions of others, and any other ideas for
> > practical solutions to the above problems.
> >
> > Thanks,
> > Wes
> >
>
>
> --
> regards,
> Deepak Majeti
>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Deepak,

responses inline

On Sun, Jul 29, 2018 at 10:44 PM, Deepak Majeti <ma...@gmail.com> wrote:
> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>

I don't see why parquet-cpp development would be impacted in a
negative way. In fact, I've argued exactly the opposite. Can you
explain in more detail why you think this would be the case? If
anything, parquet-cpp would benefit from more mature and better
maintained developer infrastructure.

Here's the project shortlog:

$ git shortlog -sn 08acdf6bfe3cd160ffe19b79bbded2bdc3f7bd62..master
   145  Wes McKinney
   109  Uwe L. Korn
    53  Deepak Majeti
    38  Korn, Uwe
    36  Nong Li
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     8  rip.nsk
     6  Phillip Cloud
     6  Xianjin YE
     5  Aliaksei Sandryhaila
     4  Thomas Sanchez
     3  Artem Tarasov
     3  Joshua Storck
     3  Lars Volker
     3  fscheibner
     3  revaliu
     2  Itai Incze
     2  Kalon Mills
     2  Marc Vertes
     2  Mike Trinkala
     2  Philipp Hoch
     1  Alec Posney
     1  Christopher C. Aycock
     1  Colin Nichols
     1  Dmitry Bushev
     1  Eric Daniel
     1  Fabrizio Fabbri
     1  Florian Scheibner
     1  Jaguar Xiong
     1  Julien Lafaye
     1  Julius Neuffer
     1  Kashif Rasul
     1  Rene Sugar
     1  Robert Gruener
     1  Toby Shaw
     1  William Forson
     1  Yue Chen
     1  thamht4190

Out of these, I know for a fact that at least the following
contributed to parquet-cpp as a result of their involvement with
Apache Arrow:

   145  Wes McKinney
   109  Uwe L. Korn
    38  Korn, Uwe
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     6  Phillip Cloud
     3  Joshua Storck
     1  Christopher C. Aycock
     1  Rene Sugar
     1  Robert Gruener

This is ~70% of commits

> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.

I don't see why this would be true. The people above are already
supporting these bindings (which are pretty isolated to the symbols in
the parquet::arrow namespace), and patches not having to do with the
Arrow columnar data structures would not be affected.

Because of the arguments I made in my first e-mail, it will be less
work for the developers working on both projects to maintain the
interfaces. Currently, it is necessary to make patches to multiple
projects to improve APIs and fix bugs in many cases.

>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.

We've been talking about this for a long time and no concrete and
actionable solution has come forward.

>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?

The central issue is that changes frequently require changes to
multiple codebases and cross-repo CI to verify patches jointly is not
really possible

>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?

The parts of Parquet that do not interact with the Arrow columnar
format still use Arrow platform APIs (IO, memory management,
compression, algorithms, etc.). We still would therefore have a
circular dependency, though some parts (e.g. changes in the
parquet::arrow layer) might be easier

- Wes

>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.

Parquet-Java is pretty different:

* On the plus side, the built-in Java platform solves some of the
problems we have addressed in the Arrow platform APIs. Note that Arrow
hasn't reinvented any wheels here or failed to use tools available in
the C++ standard library or Boost -- if you look at major Google
codebases like TensorFlow they have developed nearly identical
platform APIs to solve the same problems

* Parquet-Java depends on Hadoop platform APIs which has caused
problems for other Java projects which wish to read and write Parquet
files but do not use Hadoop (e.g. they store data in S3)

>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi folks,
>>
>> We've been struggling for quite some time with the development
>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>
>> To explain the root issues:
>>
>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>> includes file interfaces, memory management, miscellaneous algorithms
>> (e.g. dictionary encoding), etc. Note that before this "platform"
>> dependency was introduced, there was significant duplicated code
>> between these codebases and incompatible abstract interfaces for
>> things like files
>>
>> * we maintain a Arrow conversion code in parquet-cpp for converting
>> between Arrow columnar memory format and Parquet
>>
>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>> Apache Arrow. This introduces a circular dependency into our CI.
>>
>> * Substantial portions of our CMake build system and related tooling
>> are duplicated between the Arrow and Parquet repos
>>
>> * API changes cause awkward release coordination issues between Arrow
>> and Parquet
>>
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out of the same code
>> repository, i.e. the apache/arrow git repository.
>>
>> This would bring major benefits:
>>
>> * Shared CMake build infrastructure, developer tools, and CI
>> infrastructure (Parquet is already being built as a dependency in
>> Arrow's CI systems)
>>
>> * Share packaging and release management infrastructure
>>
>> * Reduce / eliminate problems due to API changes (where we currently
>> introduce breakage into our CI workflow when there is a breaking /
>> incompatible change)
>>
>> * Arrow releases would include a coordinated snapshot of the Parquet
>> implementation as it stands
>>
>> Continuing with the status quo has become unsatisfactory to me and as
>> a result I've become less motivated to work on the parquet-cpp
>> codebase.
>>
>> The only Parquet C++ committer who is not an Arrow committer is Deepak
>> Majeti. I think the issue of commit privileges could be resolved
>> without too much difficulty or time.
>>
>> I also think if it is truly necessary that the Apache Parquet
>> community could create release scripts to cut a miniml versioned
>> Apache Parquet C++ release if that is deemed truly necessary.
>>
>> I know that some people are wary of monorepos and megaprojects, but as
>> an example TensorFlow is at least 10 times as large of a projects in
>> terms of LOCs and number of different platform components, and it
>> seems to be getting along just fine. I think we should be able to work
>> together as a community to function just as well.
>>
>> Interested in the opinions of others, and any other ideas for
>> practical solutions to the above problems.
>>
>> Thanks,
>> Wes
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Wes McKinney <we...@gmail.com>.

hi Deepak,

responses inline

On Sun, Jul 29, 2018 at 10:44 PM, Deepak Majeti <ma...@gmail.com> wrote:
> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>

I don't see why parquet-cpp development would be impacted in a
negative way. In fact, I've argued exactly the opposite. Can you
explain in more detail why you think this would be the case? If
anything, parquet-cpp would benefit from more mature and better
maintained developer infrastructure.

Here's the project shortlog:

$ git shortlog -sn 08acdf6bfe3cd160ffe19b79bbded2bdc3f7bd62..master
   145  Wes McKinney
   109  Uwe L. Korn
    53  Deepak Majeti
    38  Korn, Uwe
    36  Nong Li
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     8  rip.nsk
     6  Phillip Cloud
     6  Xianjin YE
     5  Aliaksei Sandryhaila
     4  Thomas Sanchez
     3  Artem Tarasov
     3  Joshua Storck
     3  Lars Volker
     3  fscheibner
     3  revaliu
     2  Itai Incze
     2  Kalon Mills
     2  Marc Vertes
     2  Mike Trinkala
     2  Philipp Hoch
     1  Alec Posney
     1  Christopher C. Aycock
     1  Colin Nichols
     1  Dmitry Bushev
     1  Eric Daniel
     1  Fabrizio Fabbri
     1  Florian Scheibner
     1  Jaguar Xiong
     1  Julien Lafaye
     1  Julius Neuffer
     1  Kashif Rasul
     1  Rene Sugar
     1  Robert Gruener
     1  Toby Shaw
     1  William Forson
     1  Yue Chen
     1  thamht4190

Out of these, I know for a fact that at least the following
contributed to parquet-cpp as a result of their involvement with
Apache Arrow:

   145  Wes McKinney
   109  Uwe L. Korn
    38  Korn, Uwe
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     6  Phillip Cloud
     3  Joshua Storck
     1  Christopher C. Aycock
     1  Rene Sugar
     1  Robert Gruener

This is ~70% of commits

> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.

I don't see why this would be true. The people above are already
supporting these bindings (which are pretty isolated to the symbols in
the parquet::arrow namespace), and patches not having to do with the
Arrow columnar data structures would not be affected.

Because of the arguments I made in my first e-mail, it will be less
work for the developers working on both projects to maintain the
interfaces. Currently, it is necessary to make patches to multiple
projects to improve APIs and fix bugs in many cases.

>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.

We've been talking about this for a long time and no concrete and
actionable solution has come forward.

>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?

The central issue is that changes frequently require changes to
multiple codebases and cross-repo CI to verify patches jointly is not
really possible

>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?

The parts of Parquet that do not interact with the Arrow columnar
format still use Arrow platform APIs (IO, memory management,
compression, algorithms, etc.). We still would therefore have a
circular dependency, though some parts (e.g. changes in the
parquet::arrow layer) might be easier

- Wes

>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.

Parquet-Java is pretty different:

* On the plus side, the built-in Java platform solves some of the
problems we have addressed in the Arrow platform APIs. Note that Arrow
hasn't reinvented any wheels here or failed to use tools available in
the C++ standard library or Boost -- if you look at major Google
codebases like TensorFlow they have developed nearly identical
platform APIs to solve the same problems

* Parquet-Java depends on Hadoop platform APIs which has caused
problems for other Java projects which wish to read and write Parquet
files but do not use Hadoop (e.g. they store data in S3)

>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi folks,
>>
>> We've been struggling for quite some time with the development
>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>
>> To explain the root issues:
>>
>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>> includes file interfaces, memory management, miscellaneous algorithms
>> (e.g. dictionary encoding), etc. Note that before this "platform"
>> dependency was introduced, there was significant duplicated code
>> between these codebases and incompatible abstract interfaces for
>> things like files
>>
>> * we maintain a Arrow conversion code in parquet-cpp for converting
>> between Arrow columnar memory format and Parquet
>>
>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>> Apache Arrow. This introduces a circular dependency into our CI.
>>
>> * Substantial portions of our CMake build system and related tooling
>> are duplicated between the Arrow and Parquet repos
>>
>> * API changes cause awkward release coordination issues between Arrow
>> and Parquet
>>
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out of the same code
>> repository, i.e. the apache/arrow git repository.
>>
>> This would bring major benefits:
>>
>> * Shared CMake build infrastructure, developer tools, and CI
>> infrastructure (Parquet is already being built as a dependency in
>> Arrow's CI systems)
>>
>> * Share packaging and release management infrastructure
>>
>> * Reduce / eliminate problems due to API changes (where we currently
>> introduce breakage into our CI workflow when there is a breaking /
>> incompatible change)
>>
>> * Arrow releases would include a coordinated snapshot of the Parquet
>> implementation as it stands
>>
>> Continuing with the status quo has become unsatisfactory to me and as
>> a result I've become less motivated to work on the parquet-cpp
>> codebase.
>>
>> The only Parquet C++ committer who is not an Arrow committer is Deepak
>> Majeti. I think the issue of commit privileges could be resolved
>> without too much difficulty or time.
>>
>> I also think if it is truly necessary that the Apache Parquet
>> community could create release scripts to cut a miniml versioned
>> Apache Parquet C++ release if that is deemed truly necessary.
>>
>> I know that some people are wary of monorepos and megaprojects, but as
>> an example TensorFlow is at least 10 times as large of a projects in
>> terms of LOCs and number of different platform components, and it
>> seems to be getting along just fine. I think we should be able to work
>> together as a community to function just as well.
>>
>> Interested in the opinions of others, and any other ideas for
>> practical solutions to the above problems.
>>
>> Thanks,
>> Wes
>>
>
>
> --
> regards,
> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I dislike the current build system complications as well.

However, in my opinion, combining the code bases will severely impact the
progress of the parquet-cpp project and implicitly the progress of the
entire parquet project.
Combining would have made much more sense if parquet-cpp is a mature
project and codebase.  But parquet-cpp (and the entire parquet project) is
evolving continuously with new features being added including bloom
filters,  column encryption, and indexes.

If the two code bases merged, it will be much more difficult to contribute
to the parquet-cpp project since now Arrow bindings have to be supported as
well. Please correct me if I am wrong here.

Out of the two evils, I think handling the build system, packaging
duplication is much more manageable since they are quite stable at this
point.

Regarding "* API changes cause awkward release coordination issues between
Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
changes needed) as and when Arrow is released?

Regarding "we maintain a Arrow conversion code in parquet-cpp for
converting between Arrow columnar memory format and Parquet". Can this be
moved to the Arrow project and expose the more stable low-level APIs in
parquet-cpp?

I am also curious if the Arrow and Parquet Java implementations have
similar API compatibility issues.

On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> We've been struggling for quite some time with the development
> workflow between the Arrow and Parquet C++ (and Python) codebases.
>
> To explain the root issues:
>
> * parquet-cpp depends on "platform code" in Apache Arrow; this
> includes file interfaces, memory management, miscellaneous algorithms
> (e.g. dictionary encoding), etc. Note that before this "platform"
> dependency was introduced, there was significant duplicated code
> between these codebases and incompatible abstract interfaces for
> things like files
>
> * we maintain a Arrow conversion code in parquet-cpp for converting
> between Arrow columnar memory format and Parquet
>
> * we maintain Python bindings for parquet-cpp + Arrow interop in
> Apache Arrow. This introduces a circular dependency into our CI.
>
> * Substantial portions of our CMake build system and related tooling
> are duplicated between the Arrow and Parquet repos
>
> * API changes cause awkward release coordination issues between Arrow
> and Parquet
>
> I believe the best way to remedy the situation is to adopt a
> "Community over Code" approach and find a way for the Parquet and
> Arrow C++ development communities to operate out of the same code
> repository, i.e. the apache/arrow git repository.
>
> This would bring major benefits:
>
> * Shared CMake build infrastructure, developer tools, and CI
> infrastructure (Parquet is already being built as a dependency in
> Arrow's CI systems)
>
> * Share packaging and release management infrastructure
>
> * Reduce / eliminate problems due to API changes (where we currently
> introduce breakage into our CI workflow when there is a breaking /
> incompatible change)
>
> * Arrow releases would include a coordinated snapshot of the Parquet
> implementation as it stands
>
> Continuing with the status quo has become unsatisfactory to me and as
> a result I've become less motivated to work on the parquet-cpp
> codebase.
>
> The only Parquet C++ committer who is not an Arrow committer is Deepak
> Majeti. I think the issue of commit privileges could be resolved
> without too much difficulty or time.
>
> I also think if it is truly necessary that the Apache Parquet
> community could create release scripts to cut a miniml versioned
> Apache Parquet C++ release if that is deemed truly necessary.
>
> I know that some people are wary of monorepos and megaprojects, but as
> an example TensorFlow is at least 10 times as large of a projects in
> terms of LOCs and number of different platform components, and it
> seems to be getting along just fine. I think we should be able to work
> together as a community to function just as well.
>
> Interested in the opinions of others, and any other ideas for
> practical solutions to the above problems.
>
> Thanks,
> Wes
>

-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Posted by Deepak Majeti <ma...@gmail.com>.

I dislike the current build system complications as well.

However, in my opinion, combining the code bases will severely impact the
progress of the parquet-cpp project and implicitly the progress of the
entire parquet project.
Combining would have made much more sense if parquet-cpp is a mature
project and codebase.  But parquet-cpp (and the entire parquet project) is
evolving continuously with new features being added including bloom
filters,  column encryption, and indexes.

If the two code bases merged, it will be much more difficult to contribute
to the parquet-cpp project since now Arrow bindings have to be supported as
well. Please correct me if I am wrong here.

Out of the two evils, I think handling the build system, packaging
duplication is much more manageable since they are quite stable at this
point.

Regarding "* API changes cause awkward release coordination issues between
Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
changes needed) as and when Arrow is released?

Regarding "we maintain a Arrow conversion code in parquet-cpp for
converting between Arrow columnar memory format and Parquet". Can this be
moved to the Arrow project and expose the more stable low-level APIs in
parquet-cpp?

I am also curious if the Arrow and Parquet Java implementations have
similar API compatibility issues.

On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> We've been struggling for quite some time with the development
> workflow between the Arrow and Parquet C++ (and Python) codebases.
>
> To explain the root issues:
>
> * parquet-cpp depends on "platform code" in Apache Arrow; this
> includes file interfaces, memory management, miscellaneous algorithms
> (e.g. dictionary encoding), etc. Note that before this "platform"
> dependency was introduced, there was significant duplicated code
> between these codebases and incompatible abstract interfaces for
> things like files
>
> * we maintain a Arrow conversion code in parquet-cpp for converting
> between Arrow columnar memory format and Parquet
>
> * we maintain Python bindings for parquet-cpp + Arrow interop in
> Apache Arrow. This introduces a circular dependency into our CI.
>
> * Substantial portions of our CMake build system and related tooling
> are duplicated between the Arrow and Parquet repos
>
> * API changes cause awkward release coordination issues between Arrow
> and Parquet
>
> I believe the best way to remedy the situation is to adopt a
> "Community over Code" approach and find a way for the Parquet and
> Arrow C++ development communities to operate out of the same code
> repository, i.e. the apache/arrow git repository.
>
> This would bring major benefits:
>
> * Shared CMake build infrastructure, developer tools, and CI
> infrastructure (Parquet is already being built as a dependency in
> Arrow's CI systems)
>
> * Share packaging and release management infrastructure
>
> * Reduce / eliminate problems due to API changes (where we currently
> introduce breakage into our CI workflow when there is a breaking /
> incompatible change)
>
> * Arrow releases would include a coordinated snapshot of the Parquet
> implementation as it stands
>
> Continuing with the status quo has become unsatisfactory to me and as
> a result I've become less motivated to work on the parquet-cpp
> codebase.
>
> The only Parquet C++ committer who is not an Arrow committer is Deepak
> Majeti. I think the issue of commit privileges could be resolved
> without too much difficulty or time.
>
> I also think if it is truly necessary that the Apache Parquet
> community could create release scripts to cut a miniml versioned
> Apache Parquet C++ release if that is deemed truly necessary.
>
> I know that some people are wary of monorepos and megaprojects, but as
> an example TensorFlow is at least 10 times as large of a projects in
> terms of LOCs and number of different platform components, and it
> seems to be getting along just fine. I think we should be able to work
> together as a community to function just as well.
>
> Interested in the opinions of others, and any other ideas for
> practical solutions to the above problems.
>
> Thanks,
> Wes
>

-- 
regards,
Deepak Majeti