You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2018/10/16 17:02:42 UTC

[Discuss] Monorepo vs. independent repositories for independent implementations

Hello,

We are quickly growing the number of Arrow implementations.  Soon we'll
have:
- C++: the most mature, reference, and historical implementation
- Python: linked with Arrow C++
- C/GLib: linked with Arrow C++
- Ruby: linked with Arrow C++ (indirectly through C/GLib)
- R: linked with Arrow C++
- Matlab: linked with Arrow C++
- Java: independent implementation
- Rust: independent implementation
- Go: independent implementation
- Javascript: independent implementation
- .Net (C#): independent implementation

This creates various kinds of issues.  Technical issues such as CI
matrices being more and more large and complex.  Social issues such as
different implementations having different development speeds and
maturity, and the fact that development teams are effectively disjoint
(for example, whoever develops on the C++ codebase usually doesn't
develop on the Rust codebase, and vice-versa).

I'm not proposing anything concrete here, but would like to ask what
people think of moving independent implementations (those that don't
depend on Arrow C++) into independent repositories.  This would let them
define their own workflow, permissions, teams, CI configurations and
whatnot.  This would also allow growing the CI matrix for the main repo
without reaching humongous sizes.  The implementations would still be
under the umbrella of the Apache Arrow project; but they would exist as
independent GitHub projects (this is a bit how Parquet implementations
are handled, AFAIK).

To start with, Wes expressed opposition to the idea:
"""
I am against breaking up the monorepo -- I think that we should scale
our process using tools that we develop rather than conforming to the
objectively crude affordances of Travis CI and Appveyor. Implementations
that are independent now may not be so in the future by the nature of
the project -- any implementation could integrate with Gandiva, for
example, and that would become much more difficult to develop if the
code is fragmented in multiple repositories.
"""

(https://github.com/apache/arrow/pull/2765#issuecomment-430224701)

Regards

Antoine.

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Wes McKinney <we...@gmail.com>.

I see. This isn't a supported use case for the project -- we expect
third parties to use released source or binary artifacts.
On Wed, Oct 17, 2018 at 1:24 PM Francois Saint-Jacques
<fs...@networkdump.com> wrote:
>
> Not the nesting, but pulling a lot of unused files.
>
> On Wed, Oct 17, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Why would one level of directory nesting cause awkwardness (curious)?
> >
> > On Wed, Oct 17, 2018, 12:28 PM Francois Saint-Jacques <
> > fsaintjacques@networkdump.com> wrote:
> >
> >> One point toward seperate repositories, vendoring Arrow for C++ project
> >> with git submodules becomes awkward if it's a multi-lang monorepo.
> >>
> >> On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> > I would also add -- Krisztian's recent work Dockerizing the project is
> >> > setting us up to be able to decouple ourselves from Travis CI. We need
> >> > build hosts where we can use Docker to be able to do this, though.
> >> > Preferably the build hosts would have NVIDIA GPUs so we can use
> >> > nvidia-docker to test our GPU functionality
> >> > On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> > >
> >> > > hi Antoine,
> >> > >
> >> > > Some small critiques to the listing of implementations:
> >> > >
> >> > > * The Java library predates the C++ library (it originated in Apache
> >> > Drill)
> >> > > * Python and C++ both interact with the Java library in different
> >> > > ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> >> > > JPype in unit tests
> >> > >
> >> > > There's some critical questions to answer here:
> >> > >
> >> > > 1. Is there such a thing as an "independent implementation"?
> >> > > 2. What's the best way to manage changesets / patches?
> >> > > 3. What is the best way to manage the burgeoning complexity of testing
> >> > > and verification of the entire project?
> >> > > 4. How much longer will public CI services be adequate for our needs?
> >> > >
> >> > > This may be a bit long winded so bear with me
> >> > >
> >> > > 1. Is there such a thing as an "independent implementation"?
> >> > >
> >> > > My answer to this is actually "not really". The reasons are as
> >> follows:
> >> > >
> >> > > * The integration tests are one of the most important parts of the
> >> > > project. While C++, Java, and JavaScript are the only participants, we
> >> > > eventually need Rust, Go, and C# to be in the matrix. This will
> >> > > include integration testing for RPC / Flight in addition to the
> >> > > current IPC tests.
> >> > > * By the nature of Arrow, any implementation may build in-memory or
> >> > > RPC-based bindings to computational libraries that are in C++ or use
> >> > > LLVM, such as Gandiva and Plasma. This is already the case in Java,
> >> > > and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> >> > > or Plasma. The scope of what kinds of shared infrastructure might be
> >> > > used in multiple languages will only expand over time
> >> > >
> >> > > 2. What's the best way to manage changesets / patches?
> >> > >
> >> > > * Because no two implementations can be guaranteed to be independent,
> >> > > in a non-monorepo setup, changes may require multiple patches.
> >> > > Verifying "joint patches" is likely to require manual / human
> >> > > intervention in ways that are a non-issue for a monorepo
> >> > > * Splitting development up into multiple repositories will decrease
> >> > > visibility into the patch queues in the less active subprojects. I'm
> >> > > strongly in support not only of a single codebase but a single patch
> >> > > queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> >> > > out a bit, but having 70 patches spread across 5 repos would be more
> >> > > stressful for me at least
> >> > > * Broken builds in any part of the project should be a concern to the
> >> > > entire community -- we should not have broken builds. I'd be concerned
> >> > > about having any part of the project becoming a "ghetto" if the
> >> > > plurality of developers are working elsewhere with an "out of sight,
> >> > > out of mind" mindset
> >> > >
> >> > > To play devil's advocate, some web applications could be developed to
> >> > > create the appearance of a unified patch queue across many repos.
> >> > >
> >> > > That being said, our patch queue pales in comparison to some larger /
> >> > > more mature ASF projects:
> >> > >
> >> > > * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> >> > > * Airflow has 218 open PRs:
> >> > https://github.com/apache/incubator-airflow/pulls
> >> > > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
> >> > >
> >> > > 3. What is the best way to manage the burgeoning complexity of testing
> >> > > and verification of the entire project?
> >> > > 4. How much longer will public CI services be adequate for our needs?
> >> > >
> >> > > I think we are already reaching the limits of what we can reasonably
> >> > > accomplish with public CI services. Apache Arrow is a project with
> >> > > sophistication and scope that is destined to outgrow what Travis CI
> >> > > can provide within the scope of a single implementation, i.e.
> >> > > C++/Python. For example, we're going to be past the 50 minute time
> >> > > limit before too long. I think that continuing to constrain ourselves
> >> > > by the 50 minute time limit will also limit the scope of what kinds of
> >> > > automated testing we can employ, to our long term detriment. We also
> >> > > have things (like GPU support) that we cannot test there.
> >> > >
> >> > > Considering more mature data projects in the ASF that I'm familiar
> >> > > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> >> > > testing uses Jenkins build slaves and run much longer than our CI
> >> > > jobs. If we used beefier build slaves, our builds would also run much
> >> > > faster.
> >> > >
> >> > > So, what should we do? Well, part of why I have recently created an
> >> > > organization (https://ursalabs.org/) dedicated to Arrow development
> >> is
> >> > > to have the financial means and the engineering resources to actually
> >> > > do something about problems like these. I would propose to make an
> >> > > investment of hardware and engineering time to augment our ability to
> >> > > test the repository to make sure we can manage 5-10x the current test
> >> > > runtime that we have now. If I have to personally halt feature
> >> > > development and focus on build and development tooling for a while, so
> >> > > be it. We've already spent many months this year on packaging
> >> > > automation but we are still coming up short in development tooling. If
> >> > > anyone reading has funds to invest in hardware resources, please let
> >> > > me know.
> >> > >
> >> > > As Clint Eastwood's character said in "The Good, The Bad, and The
> >> > > Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
> >> > >
> >> > > FWIW: I am not sure Parquet is a good example of a better way to be.
> >> > > Parquet lacks automated integration tests (terrifying to me) and
> >> > > failed to grow a community outside of the Java world until 2016 when a
> >> > > few of us started building out the C++ library.
> >> > >
> >> > > - Wes
> >> > > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org>
> >> > wrote:
> >> > > >
> >> > > >
> >> > > > Hello,
> >> > > >
> >> > > > We are quickly growing the number of Arrow implementations.  Soon
> >> we'll
> >> > > > have:
> >> > > > - C++: the most mature, reference, and historical implementation
> >> > > > - Python: linked with Arrow C++
> >> > > > - C/GLib: linked with Arrow C++
> >> > > > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> >> > > > - R: linked with Arrow C++
> >> > > > - Matlab: linked with Arrow C++
> >> > > > - Java: independent implementation
> >> > > > - Rust: independent implementation
> >> > > > - Go: independent implementation
> >> > > > - Javascript: independent implementation
> >> > > > - .Net (C#): independent implementation
> >> > > >
> >> > > > This creates various kinds of issues.  Technical issues such as CI
> >> > > > matrices being more and more large and complex.  Social issues such
> >> as
> >> > > > different implementations having different development speeds and
> >> > > > maturity, and the fact that development teams are effectively
> >> disjoint
> >> > > > (for example, whoever develops on the C++ codebase usually doesn't
> >> > > > develop on the Rust codebase, and vice-versa).
> >> > > >
> >> > > > I'm not proposing anything concrete here, but would like to ask what
> >> > > > people think of moving independent implementations (those that don't
> >> > > > depend on Arrow C++) into independent repositories.  This would let
> >> > them
> >> > > > define their own workflow, permissions, teams, CI configurations and
> >> > > > whatnot.  This would also allow growing the CI matrix for the main
> >> repo
> >> > > > without reaching humongous sizes.  The implementations would still
> >> be
> >> > > > under the umbrella of the Apache Arrow project; but they would
> >> exist as
> >> > > > independent GitHub projects (this is a bit how Parquet
> >> implementations
> >> > > > are handled, AFAIK).
> >> > > >
> >> > > > To start with, Wes expressed opposition to the idea:
> >> > > > """
> >> > > > I am against breaking up the monorepo -- I think that we should
> >> scale
> >> > > > our process using tools that we develop rather than conforming to
> >> the
> >> > > > objectively crude affordances of Travis CI and Appveyor.
> >> > Implementations
> >> > > > that are independent now may not be so in the future by the nature
> >> of
> >> > > > the project -- any implementation could integrate with Gandiva, for
> >> > > > example, and that would become much more difficult to develop if the
> >> > > > code is fragmented in multiple repositories.
> >> > > > """
> >> > > >
> >> > > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> >> > > >
> >> > > > Regards
> >> > > >
> >> > > > Antoine.
> >> >
> >>
> >>
> >> --
> >> Sent from my jetpack.
> >>
> >
>
> --
> Sent from my jetpack.

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Francois Saint-Jacques <fs...@networkdump.com>.

Not the nesting, but pulling a lot of unused files.

On Wed, Oct 17, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:

> Why would one level of directory nesting cause awkwardness (curious)?
>
> On Wed, Oct 17, 2018, 12:28 PM Francois Saint-Jacques <
> fsaintjacques@networkdump.com> wrote:
>
>> One point toward seperate repositories, vendoring Arrow for C++ project
>> with git submodules becomes awkward if it's a multi-lang monorepo.
>>
>> On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> > I would also add -- Krisztian's recent work Dockerizing the project is
>> > setting us up to be able to decouple ourselves from Travis CI. We need
>> > build hosts where we can use Docker to be able to do this, though.
>> > Preferably the build hosts would have NVIDIA GPUs so we can use
>> > nvidia-docker to test our GPU functionality
>> > On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > > hi Antoine,
>> > >
>> > > Some small critiques to the listing of implementations:
>> > >
>> > > * The Java library predates the C++ library (it originated in Apache
>> > Drill)
>> > > * Python and C++ both interact with the Java library in different
>> > > ways. There's JNI for Gandiva and Plasma, and Python uses Java via
>> > > JPype in unit tests
>> > >
>> > > There's some critical questions to answer here:
>> > >
>> > > 1. Is there such a thing as an "independent implementation"?
>> > > 2. What's the best way to manage changesets / patches?
>> > > 3. What is the best way to manage the burgeoning complexity of testing
>> > > and verification of the entire project?
>> > > 4. How much longer will public CI services be adequate for our needs?
>> > >
>> > > This may be a bit long winded so bear with me
>> > >
>> > > 1. Is there such a thing as an "independent implementation"?
>> > >
>> > > My answer to this is actually "not really". The reasons are as
>> follows:
>> > >
>> > > * The integration tests are one of the most important parts of the
>> > > project. While C++, Java, and JavaScript are the only participants, we
>> > > eventually need Rust, Go, and C# to be in the matrix. This will
>> > > include integration testing for RPC / Flight in addition to the
>> > > current IPC tests.
>> > > * By the nature of Arrow, any implementation may build in-memory or
>> > > RPC-based bindings to computational libraries that are in C++ or use
>> > > LLVM, such as Gandiva and Plasma. This is already the case in Java,
>> > > and may expand beyond Java. I could see Go or Rust or C# using Gandiva
>> > > or Plasma. The scope of what kinds of shared infrastructure might be
>> > > used in multiple languages will only expand over time
>> > >
>> > > 2. What's the best way to manage changesets / patches?
>> > >
>> > > * Because no two implementations can be guaranteed to be independent,
>> > > in a non-monorepo setup, changes may require multiple patches.
>> > > Verifying "joint patches" is likely to require manual / human
>> > > intervention in ways that are a non-issue for a monorepo
>> > > * Splitting development up into multiple repositories will decrease
>> > > visibility into the patch queues in the less active subprojects. I'm
>> > > strongly in support not only of a single codebase but a single patch
>> > > queue. I admit that seeing ~70 open pull requests on Arrow stresses me
>> > > out a bit, but having 70 patches spread across 5 repos would be more
>> > > stressful for me at least
>> > > * Broken builds in any part of the project should be a concern to the
>> > > entire community -- we should not have broken builds. I'd be concerned
>> > > about having any part of the project becoming a "ghetto" if the
>> > > plurality of developers are working elsewhere with an "out of sight,
>> > > out of mind" mindset
>> > >
>> > > To play devil's advocate, some web applications could be developed to
>> > > create the appearance of a unified patch queue across many repos.
>> > >
>> > > That being said, our patch queue pales in comparison to some larger /
>> > > more mature ASF projects:
>> > >
>> > > * Spark has 523 open PRs: https://github.com/apache/spark/pulls
>> > > * Airflow has 218 open PRs:
>> > https://github.com/apache/incubator-airflow/pulls
>> > > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
>> > >
>> > > 3. What is the best way to manage the burgeoning complexity of testing
>> > > and verification of the entire project?
>> > > 4. How much longer will public CI services be adequate for our needs?
>> > >
>> > > I think we are already reaching the limits of what we can reasonably
>> > > accomplish with public CI services. Apache Arrow is a project with
>> > > sophistication and scope that is destined to outgrow what Travis CI
>> > > can provide within the scope of a single implementation, i.e.
>> > > C++/Python. For example, we're going to be past the 50 minute time
>> > > limit before too long. I think that continuing to constrain ourselves
>> > > by the 50 minute time limit will also limit the scope of what kinds of
>> > > automated testing we can employ, to our long term detriment. We also
>> > > have things (like GPU support) that we cannot test there.
>> > >
>> > > Considering more mature data projects in the ASF that I'm familiar
>> > > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
>> > > testing uses Jenkins build slaves and run much longer than our CI
>> > > jobs. If we used beefier build slaves, our builds would also run much
>> > > faster.
>> > >
>> > > So, what should we do? Well, part of why I have recently created an
>> > > organization (https://ursalabs.org/) dedicated to Arrow development
>> is
>> > > to have the financial means and the engineering resources to actually
>> > > do something about problems like these. I would propose to make an
>> > > investment of hardware and engineering time to augment our ability to
>> > > test the repository to make sure we can manage 5-10x the current test
>> > > runtime that we have now. If I have to personally halt feature
>> > > development and focus on build and development tooling for a while, so
>> > > be it. We've already spent many months this year on packaging
>> > > automation but we are still coming up short in development tooling. If
>> > > anyone reading has funds to invest in hardware resources, please let
>> > > me know.
>> > >
>> > > As Clint Eastwood's character said in "The Good, The Bad, and The
>> > > Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
>> > >
>> > > FWIW: I am not sure Parquet is a good example of a better way to be.
>> > > Parquet lacks automated integration tests (terrifying to me) and
>> > > failed to grow a community outside of the Java world until 2016 when a
>> > > few of us started building out the C++ library.
>> > >
>> > > - Wes
>> > > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org>
>> > wrote:
>> > > >
>> > > >
>> > > > Hello,
>> > > >
>> > > > We are quickly growing the number of Arrow implementations.  Soon
>> we'll
>> > > > have:
>> > > > - C++: the most mature, reference, and historical implementation
>> > > > - Python: linked with Arrow C++
>> > > > - C/GLib: linked with Arrow C++
>> > > > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
>> > > > - R: linked with Arrow C++
>> > > > - Matlab: linked with Arrow C++
>> > > > - Java: independent implementation
>> > > > - Rust: independent implementation
>> > > > - Go: independent implementation
>> > > > - Javascript: independent implementation
>> > > > - .Net (C#): independent implementation
>> > > >
>> > > > This creates various kinds of issues.  Technical issues such as CI
>> > > > matrices being more and more large and complex.  Social issues such
>> as
>> > > > different implementations having different development speeds and
>> > > > maturity, and the fact that development teams are effectively
>> disjoint
>> > > > (for example, whoever develops on the C++ codebase usually doesn't
>> > > > develop on the Rust codebase, and vice-versa).
>> > > >
>> > > > I'm not proposing anything concrete here, but would like to ask what
>> > > > people think of moving independent implementations (those that don't
>> > > > depend on Arrow C++) into independent repositories.  This would let
>> > them
>> > > > define their own workflow, permissions, teams, CI configurations and
>> > > > whatnot.  This would also allow growing the CI matrix for the main
>> repo
>> > > > without reaching humongous sizes.  The implementations would still
>> be
>> > > > under the umbrella of the Apache Arrow project; but they would
>> exist as
>> > > > independent GitHub projects (this is a bit how Parquet
>> implementations
>> > > > are handled, AFAIK).
>> > > >
>> > > > To start with, Wes expressed opposition to the idea:
>> > > > """
>> > > > I am against breaking up the monorepo -- I think that we should
>> scale
>> > > > our process using tools that we develop rather than conforming to
>> the
>> > > > objectively crude affordances of Travis CI and Appveyor.
>> > Implementations
>> > > > that are independent now may not be so in the future by the nature
>> of
>> > > > the project -- any implementation could integrate with Gandiva, for
>> > > > example, and that would become much more difficult to develop if the
>> > > > code is fragmented in multiple repositories.
>> > > > """
>> > > >
>> > > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
>> > > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> >
>>
>>
>> --
>> Sent from my jetpack.
>>
>

-- 
Sent from my jetpack.

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Wes McKinney <we...@gmail.com>.

Why would one level of directory nesting cause awkwardness (curious)?

On Wed, Oct 17, 2018, 12:28 PM Francois Saint-Jacques <
fsaintjacques@networkdump.com> wrote:

> One point toward seperate repositories, vendoring Arrow for C++ project
> with git submodules becomes awkward if it's a multi-lang monorepo.
>
> On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <we...@gmail.com> wrote:
>
> > I would also add -- Krisztian's recent work Dockerizing the project is
> > setting us up to be able to decouple ourselves from Travis CI. We need
> > build hosts where we can use Docker to be able to do this, though.
> > Preferably the build hosts would have NVIDIA GPUs so we can use
> > nvidia-docker to test our GPU functionality
> > On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > hi Antoine,
> > >
> > > Some small critiques to the listing of implementations:
> > >
> > > * The Java library predates the C++ library (it originated in Apache
> > Drill)
> > > * Python and C++ both interact with the Java library in different
> > > ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> > > JPype in unit tests
> > >
> > > There's some critical questions to answer here:
> > >
> > > 1. Is there such a thing as an "independent implementation"?
> > > 2. What's the best way to manage changesets / patches?
> > > 3. What is the best way to manage the burgeoning complexity of testing
> > > and verification of the entire project?
> > > 4. How much longer will public CI services be adequate for our needs?
> > >
> > > This may be a bit long winded so bear with me
> > >
> > > 1. Is there such a thing as an "independent implementation"?
> > >
> > > My answer to this is actually "not really". The reasons are as follows:
> > >
> > > * The integration tests are one of the most important parts of the
> > > project. While C++, Java, and JavaScript are the only participants, we
> > > eventually need Rust, Go, and C# to be in the matrix. This will
> > > include integration testing for RPC / Flight in addition to the
> > > current IPC tests.
> > > * By the nature of Arrow, any implementation may build in-memory or
> > > RPC-based bindings to computational libraries that are in C++ or use
> > > LLVM, such as Gandiva and Plasma. This is already the case in Java,
> > > and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> > > or Plasma. The scope of what kinds of shared infrastructure might be
> > > used in multiple languages will only expand over time
> > >
> > > 2. What's the best way to manage changesets / patches?
> > >
> > > * Because no two implementations can be guaranteed to be independent,
> > > in a non-monorepo setup, changes may require multiple patches.
> > > Verifying "joint patches" is likely to require manual / human
> > > intervention in ways that are a non-issue for a monorepo
> > > * Splitting development up into multiple repositories will decrease
> > > visibility into the patch queues in the less active subprojects. I'm
> > > strongly in support not only of a single codebase but a single patch
> > > queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> > > out a bit, but having 70 patches spread across 5 repos would be more
> > > stressful for me at least
> > > * Broken builds in any part of the project should be a concern to the
> > > entire community -- we should not have broken builds. I'd be concerned
> > > about having any part of the project becoming a "ghetto" if the
> > > plurality of developers are working elsewhere with an "out of sight,
> > > out of mind" mindset
> > >
> > > To play devil's advocate, some web applications could be developed to
> > > create the appearance of a unified patch queue across many repos.
> > >
> > > That being said, our patch queue pales in comparison to some larger /
> > > more mature ASF projects:
> > >
> > > * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> > > * Airflow has 218 open PRs:
> > https://github.com/apache/incubator-airflow/pulls
> > > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
> > >
> > > 3. What is the best way to manage the burgeoning complexity of testing
> > > and verification of the entire project?
> > > 4. How much longer will public CI services be adequate for our needs?
> > >
> > > I think we are already reaching the limits of what we can reasonably
> > > accomplish with public CI services. Apache Arrow is a project with
> > > sophistication and scope that is destined to outgrow what Travis CI
> > > can provide within the scope of a single implementation, i.e.
> > > C++/Python. For example, we're going to be past the 50 minute time
> > > limit before too long. I think that continuing to constrain ourselves
> > > by the 50 minute time limit will also limit the scope of what kinds of
> > > automated testing we can employ, to our long term detriment. We also
> > > have things (like GPU support) that we cannot test there.
> > >
> > > Considering more mature data projects in the ASF that I'm familiar
> > > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> > > testing uses Jenkins build slaves and run much longer than our CI
> > > jobs. If we used beefier build slaves, our builds would also run much
> > > faster.
> > >
> > > So, what should we do? Well, part of why I have recently created an
> > > organization (https://ursalabs.org/) dedicated to Arrow development is
> > > to have the financial means and the engineering resources to actually
> > > do something about problems like these. I would propose to make an
> > > investment of hardware and engineering time to augment our ability to
> > > test the repository to make sure we can manage 5-10x the current test
> > > runtime that we have now. If I have to personally halt feature
> > > development and focus on build and development tooling for a while, so
> > > be it. We've already spent many months this year on packaging
> > > automation but we are still coming up short in development tooling. If
> > > anyone reading has funds to invest in hardware resources, please let
> > > me know.
> > >
> > > As Clint Eastwood's character said in "The Good, The Bad, and The
> > > Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
> > >
> > > FWIW: I am not sure Parquet is a good example of a better way to be.
> > > Parquet lacks automated integration tests (terrifying to me) and
> > > failed to grow a community outside of the Java world until 2016 when a
> > > few of us started building out the C++ library.
> > >
> > > - Wes
> > > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org>
> > wrote:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > We are quickly growing the number of Arrow implementations.  Soon
> we'll
> > > > have:
> > > > - C++: the most mature, reference, and historical implementation
> > > > - Python: linked with Arrow C++
> > > > - C/GLib: linked with Arrow C++
> > > > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> > > > - R: linked with Arrow C++
> > > > - Matlab: linked with Arrow C++
> > > > - Java: independent implementation
> > > > - Rust: independent implementation
> > > > - Go: independent implementation
> > > > - Javascript: independent implementation
> > > > - .Net (C#): independent implementation
> > > >
> > > > This creates various kinds of issues.  Technical issues such as CI
> > > > matrices being more and more large and complex.  Social issues such
> as
> > > > different implementations having different development speeds and
> > > > maturity, and the fact that development teams are effectively
> disjoint
> > > > (for example, whoever develops on the C++ codebase usually doesn't
> > > > develop on the Rust codebase, and vice-versa).
> > > >
> > > > I'm not proposing anything concrete here, but would like to ask what
> > > > people think of moving independent implementations (those that don't
> > > > depend on Arrow C++) into independent repositories.  This would let
> > them
> > > > define their own workflow, permissions, teams, CI configurations and
> > > > whatnot.  This would also allow growing the CI matrix for the main
> repo
> > > > without reaching humongous sizes.  The implementations would still be
> > > > under the umbrella of the Apache Arrow project; but they would exist
> as
> > > > independent GitHub projects (this is a bit how Parquet
> implementations
> > > > are handled, AFAIK).
> > > >
> > > > To start with, Wes expressed opposition to the idea:
> > > > """
> > > > I am against breaking up the monorepo -- I think that we should scale
> > > > our process using tools that we develop rather than conforming to the
> > > > objectively crude affordances of Travis CI and Appveyor.
> > Implementations
> > > > that are independent now may not be so in the future by the nature of
> > > > the project -- any implementation could integrate with Gandiva, for
> > > > example, and that would become much more difficult to develop if the
> > > > code is fragmented in multiple repositories.
> > > > """
> > > >
> > > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >
>
>
> --
> Sent from my jetpack.
>

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Francois Saint-Jacques <fs...@networkdump.com>.

One point toward seperate repositories, vendoring Arrow for C++ project
with git submodules becomes awkward if it's a multi-lang monorepo.

On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <we...@gmail.com> wrote:

> I would also add -- Krisztian's recent work Dockerizing the project is
> setting us up to be able to decouple ourselves from Travis CI. We need
> build hosts where we can use Docker to be able to do this, though.
> Preferably the build hosts would have NVIDIA GPUs so we can use
> nvidia-docker to test our GPU functionality
> On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > hi Antoine,
> >
> > Some small critiques to the listing of implementations:
> >
> > * The Java library predates the C++ library (it originated in Apache
> Drill)
> > * Python and C++ both interact with the Java library in different
> > ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> > JPype in unit tests
> >
> > There's some critical questions to answer here:
> >
> > 1. Is there such a thing as an "independent implementation"?
> > 2. What's the best way to manage changesets / patches?
> > 3. What is the best way to manage the burgeoning complexity of testing
> > and verification of the entire project?
> > 4. How much longer will public CI services be adequate for our needs?
> >
> > This may be a bit long winded so bear with me
> >
> > 1. Is there such a thing as an "independent implementation"?
> >
> > My answer to this is actually "not really". The reasons are as follows:
> >
> > * The integration tests are one of the most important parts of the
> > project. While C++, Java, and JavaScript are the only participants, we
> > eventually need Rust, Go, and C# to be in the matrix. This will
> > include integration testing for RPC / Flight in addition to the
> > current IPC tests.
> > * By the nature of Arrow, any implementation may build in-memory or
> > RPC-based bindings to computational libraries that are in C++ or use
> > LLVM, such as Gandiva and Plasma. This is already the case in Java,
> > and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> > or Plasma. The scope of what kinds of shared infrastructure might be
> > used in multiple languages will only expand over time
> >
> > 2. What's the best way to manage changesets / patches?
> >
> > * Because no two implementations can be guaranteed to be independent,
> > in a non-monorepo setup, changes may require multiple patches.
> > Verifying "joint patches" is likely to require manual / human
> > intervention in ways that are a non-issue for a monorepo
> > * Splitting development up into multiple repositories will decrease
> > visibility into the patch queues in the less active subprojects. I'm
> > strongly in support not only of a single codebase but a single patch
> > queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> > out a bit, but having 70 patches spread across 5 repos would be more
> > stressful for me at least
> > * Broken builds in any part of the project should be a concern to the
> > entire community -- we should not have broken builds. I'd be concerned
> > about having any part of the project becoming a "ghetto" if the
> > plurality of developers are working elsewhere with an "out of sight,
> > out of mind" mindset
> >
> > To play devil's advocate, some web applications could be developed to
> > create the appearance of a unified patch queue across many repos.
> >
> > That being said, our patch queue pales in comparison to some larger /
> > more mature ASF projects:
> >
> > * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> > * Airflow has 218 open PRs:
> https://github.com/apache/incubator-airflow/pulls
> > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
> >
> > 3. What is the best way to manage the burgeoning complexity of testing
> > and verification of the entire project?
> > 4. How much longer will public CI services be adequate for our needs?
> >
> > I think we are already reaching the limits of what we can reasonably
> > accomplish with public CI services. Apache Arrow is a project with
> > sophistication and scope that is destined to outgrow what Travis CI
> > can provide within the scope of a single implementation, i.e.
> > C++/Python. For example, we're going to be past the 50 minute time
> > limit before too long. I think that continuing to constrain ourselves
> > by the 50 minute time limit will also limit the scope of what kinds of
> > automated testing we can employ, to our long term detriment. We also
> > have things (like GPU support) that we cannot test there.
> >
> > Considering more mature data projects in the ASF that I'm familiar
> > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> > testing uses Jenkins build slaves and run much longer than our CI
> > jobs. If we used beefier build slaves, our builds would also run much
> > faster.
> >
> > So, what should we do? Well, part of why I have recently created an
> > organization (https://ursalabs.org/) dedicated to Arrow development is
> > to have the financial means and the engineering resources to actually
> > do something about problems like these. I would propose to make an
> > investment of hardware and engineering time to augment our ability to
> > test the repository to make sure we can manage 5-10x the current test
> > runtime that we have now. If I have to personally halt feature
> > development and focus on build and development tooling for a while, so
> > be it. We've already spent many months this year on packaging
> > automation but we are still coming up short in development tooling. If
> > anyone reading has funds to invest in hardware resources, please let
> > me know.
> >
> > As Clint Eastwood's character said in "The Good, The Bad, and The
> > Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
> >
> > FWIW: I am not sure Parquet is a good example of a better way to be.
> > Parquet lacks automated integration tests (terrifying to me) and
> > failed to grow a community outside of the Java world until 2016 when a
> > few of us started building out the C++ library.
> >
> > - Wes
> > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > We are quickly growing the number of Arrow implementations.  Soon we'll
> > > have:
> > > - C++: the most mature, reference, and historical implementation
> > > - Python: linked with Arrow C++
> > > - C/GLib: linked with Arrow C++
> > > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> > > - R: linked with Arrow C++
> > > - Matlab: linked with Arrow C++
> > > - Java: independent implementation
> > > - Rust: independent implementation
> > > - Go: independent implementation
> > > - Javascript: independent implementation
> > > - .Net (C#): independent implementation
> > >
> > > This creates various kinds of issues.  Technical issues such as CI
> > > matrices being more and more large and complex.  Social issues such as
> > > different implementations having different development speeds and
> > > maturity, and the fact that development teams are effectively disjoint
> > > (for example, whoever develops on the C++ codebase usually doesn't
> > > develop on the Rust codebase, and vice-versa).
> > >
> > > I'm not proposing anything concrete here, but would like to ask what
> > > people think of moving independent implementations (those that don't
> > > depend on Arrow C++) into independent repositories.  This would let
> them
> > > define their own workflow, permissions, teams, CI configurations and
> > > whatnot.  This would also allow growing the CI matrix for the main repo
> > > without reaching humongous sizes.  The implementations would still be
> > > under the umbrella of the Apache Arrow project; but they would exist as
> > > independent GitHub projects (this is a bit how Parquet implementations
> > > are handled, AFAIK).
> > >
> > > To start with, Wes expressed opposition to the idea:
> > > """
> > > I am against breaking up the monorepo -- I think that we should scale
> > > our process using tools that we develop rather than conforming to the
> > > objectively crude affordances of Travis CI and Appveyor.
> Implementations
> > > that are independent now may not be so in the future by the nature of
> > > the project -- any implementation could integrate with Gandiva, for
> > > example, and that would become much more difficult to develop if the
> > > code is fragmented in multiple repositories.
> > > """
> > >
> > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> > >
> > > Regards
> > >
> > > Antoine.
>


-- 
Sent from my jetpack.

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Wes McKinney <we...@gmail.com>.

I would also add -- Krisztian's recent work Dockerizing the project is
setting us up to be able to decouple ourselves from Travis CI. We need
build hosts where we can use Docker to be able to do this, though.
Preferably the build hosts would have NVIDIA GPUs so we can use
nvidia-docker to test our GPU functionality
On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Antoine,
>
> Some small critiques to the listing of implementations:
>
> * The Java library predates the C++ library (it originated in Apache Drill)
> * Python and C++ both interact with the Java library in different
> ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> JPype in unit tests
>
> There's some critical questions to answer here:
>
> 1. Is there such a thing as an "independent implementation"?
> 2. What's the best way to manage changesets / patches?
> 3. What is the best way to manage the burgeoning complexity of testing
> and verification of the entire project?
> 4. How much longer will public CI services be adequate for our needs?
>
> This may be a bit long winded so bear with me
>
> 1. Is there such a thing as an "independent implementation"?
>
> My answer to this is actually "not really". The reasons are as follows:
>
> * The integration tests are one of the most important parts of the
> project. While C++, Java, and JavaScript are the only participants, we
> eventually need Rust, Go, and C# to be in the matrix. This will
> include integration testing for RPC / Flight in addition to the
> current IPC tests.
> * By the nature of Arrow, any implementation may build in-memory or
> RPC-based bindings to computational libraries that are in C++ or use
> LLVM, such as Gandiva and Plasma. This is already the case in Java,
> and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> or Plasma. The scope of what kinds of shared infrastructure might be
> used in multiple languages will only expand over time
>
> 2. What's the best way to manage changesets / patches?
>
> * Because no two implementations can be guaranteed to be independent,
> in a non-monorepo setup, changes may require multiple patches.
> Verifying "joint patches" is likely to require manual / human
> intervention in ways that are a non-issue for a monorepo
> * Splitting development up into multiple repositories will decrease
> visibility into the patch queues in the less active subprojects. I'm
> strongly in support not only of a single codebase but a single patch
> queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> out a bit, but having 70 patches spread across 5 repos would be more
> stressful for me at least
> * Broken builds in any part of the project should be a concern to the
> entire community -- we should not have broken builds. I'd be concerned
> about having any part of the project becoming a "ghetto" if the
> plurality of developers are working elsewhere with an "out of sight,
> out of mind" mindset
>
> To play devil's advocate, some web applications could be developed to
> create the appearance of a unified patch queue across many repos.
>
> That being said, our patch queue pales in comparison to some larger /
> more mature ASF projects:
>
> * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> * Airflow has 218 open PRs: https://github.com/apache/incubator-airflow/pulls
> * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
>
> 3. What is the best way to manage the burgeoning complexity of testing
> and verification of the entire project?
> 4. How much longer will public CI services be adequate for our needs?
>
> I think we are already reaching the limits of what we can reasonably
> accomplish with public CI services. Apache Arrow is a project with
> sophistication and scope that is destined to outgrow what Travis CI
> can provide within the scope of a single implementation, i.e.
> C++/Python. For example, we're going to be past the 50 minute time
> limit before too long. I think that continuing to constrain ourselves
> by the 50 minute time limit will also limit the scope of what kinds of
> automated testing we can employ, to our long term detriment. We also
> have things (like GPU support) that we cannot test there.
>
> Considering more mature data projects in the ASF that I'm familiar
> with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> testing uses Jenkins build slaves and run much longer than our CI
> jobs. If we used beefier build slaves, our builds would also run much
> faster.
>
> So, what should we do? Well, part of why I have recently created an
> organization (https://ursalabs.org/) dedicated to Arrow development is
> to have the financial means and the engineering resources to actually
> do something about problems like these. I would propose to make an
> investment of hardware and engineering time to augment our ability to
> test the repository to make sure we can manage 5-10x the current test
> runtime that we have now. If I have to personally halt feature
> development and focus on build and development tooling for a while, so
> be it. We've already spent many months this year on packaging
> automation but we are still coming up short in development tooling. If
> anyone reading has funds to invest in hardware resources, please let
> me know.
>
> As Clint Eastwood's character said in "The Good, The Bad, and The
> Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
>
> FWIW: I am not sure Parquet is a good example of a better way to be.
> Parquet lacks automated integration tests (terrifying to me) and
> failed to grow a community outside of the Java world until 2016 when a
> few of us started building out the C++ library.
>
> - Wes
> On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hello,
> >
> > We are quickly growing the number of Arrow implementations.  Soon we'll
> > have:
> > - C++: the most mature, reference, and historical implementation
> > - Python: linked with Arrow C++
> > - C/GLib: linked with Arrow C++
> > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> > - R: linked with Arrow C++
> > - Matlab: linked with Arrow C++
> > - Java: independent implementation
> > - Rust: independent implementation
> > - Go: independent implementation
> > - Javascript: independent implementation
> > - .Net (C#): independent implementation
> >
> > This creates various kinds of issues.  Technical issues such as CI
> > matrices being more and more large and complex.  Social issues such as
> > different implementations having different development speeds and
> > maturity, and the fact that development teams are effectively disjoint
> > (for example, whoever develops on the C++ codebase usually doesn't
> > develop on the Rust codebase, and vice-versa).
> >
> > I'm not proposing anything concrete here, but would like to ask what
> > people think of moving independent implementations (those that don't
> > depend on Arrow C++) into independent repositories.  This would let them
> > define their own workflow, permissions, teams, CI configurations and
> > whatnot.  This would also allow growing the CI matrix for the main repo
> > without reaching humongous sizes.  The implementations would still be
> > under the umbrella of the Apache Arrow project; but they would exist as
> > independent GitHub projects (this is a bit how Parquet implementations
> > are handled, AFAIK).
> >
> > To start with, Wes expressed opposition to the idea:
> > """
> > I am against breaking up the monorepo -- I think that we should scale
> > our process using tools that we develop rather than conforming to the
> > objectively crude affordances of Travis CI and Appveyor. Implementations
> > that are independent now may not be so in the future by the nature of
> > the project -- any implementation could integrate with Gandiva, for
> > example, and that would become much more difficult to develop if the
> > code is fragmented in multiple repositories.
> > """
> >
> > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> >
> > Regards
> >
> > Antoine.

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

Posted by Wes McKinney <we...@gmail.com>.

hi Antoine,

Some small critiques to the listing of implementations:

* The Java library predates the C++ library (it originated in Apache Drill)
* Python and C++ both interact with the Java library in different
ways. There's JNI for Gandiva and Plasma, and Python uses Java via
JPype in unit tests

There's some critical questions to answer here:

1. Is there such a thing as an "independent implementation"?
2. What's the best way to manage changesets / patches?
3. What is the best way to manage the burgeoning complexity of testing
and verification of the entire project?
4. How much longer will public CI services be adequate for our needs?

This may be a bit long winded so bear with me

1. Is there such a thing as an "independent implementation"?

My answer to this is actually "not really". The reasons are as follows:

* The integration tests are one of the most important parts of the
project. While C++, Java, and JavaScript are the only participants, we
eventually need Rust, Go, and C# to be in the matrix. This will
include integration testing for RPC / Flight in addition to the
current IPC tests.
* By the nature of Arrow, any implementation may build in-memory or
RPC-based bindings to computational libraries that are in C++ or use
LLVM, such as Gandiva and Plasma. This is already the case in Java,
and may expand beyond Java. I could see Go or Rust or C# using Gandiva
or Plasma. The scope of what kinds of shared infrastructure might be
used in multiple languages will only expand over time

2. What's the best way to manage changesets / patches?

* Because no two implementations can be guaranteed to be independent,
in a non-monorepo setup, changes may require multiple patches.
Verifying "joint patches" is likely to require manual / human
intervention in ways that are a non-issue for a monorepo
* Splitting development up into multiple repositories will decrease
visibility into the patch queues in the less active subprojects. I'm
strongly in support not only of a single codebase but a single patch
queue. I admit that seeing ~70 open pull requests on Arrow stresses me
out a bit, but having 70 patches spread across 5 repos would be more
stressful for me at least
* Broken builds in any part of the project should be a concern to the
entire community -- we should not have broken builds. I'd be concerned
about having any part of the project becoming a "ghetto" if the
plurality of developers are working elsewhere with an "out of sight,
out of mind" mindset

To play devil's advocate, some web applications could be developed to
create the appearance of a unified patch queue across many repos.

That being said, our patch queue pales in comparison to some larger /
more mature ASF projects:

* Spark has 523 open PRs: https://github.com/apache/spark/pulls
* Airflow has 218 open PRs: https://github.com/apache/incubator-airflow/pulls
* Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls

3. What is the best way to manage the burgeoning complexity of testing
and verification of the entire project?
4. How much longer will public CI services be adequate for our needs?

I think we are already reaching the limits of what we can reasonably
accomplish with public CI services. Apache Arrow is a project with
sophistication and scope that is destined to outgrow what Travis CI
can provide within the scope of a single implementation, i.e.
C++/Python. For example, we're going to be past the 50 minute time
limit before too long. I think that continuing to constrain ourselves
by the 50 minute time limit will also limit the scope of what kinds of
automated testing we can employ, to our long term detriment. We also
have things (like GPU support) that we cannot test there.

Considering more mature data projects in the ASF that I'm familiar
with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
testing uses Jenkins build slaves and run much longer than our CI
jobs. If we used beefier build slaves, our builds would also run much
faster.

So, what should we do? Well, part of why I have recently created an
organization (https://ursalabs.org/) dedicated to Arrow development is
to have the financial means and the engineering resources to actually
do something about problems like these. I would propose to make an
investment of hardware and engineering time to augment our ability to
test the repository to make sure we can manage 5-10x the current test
runtime that we have now. If I have to personally halt feature
development and focus on build and development tooling for a while, so
be it. We've already spent many months this year on packaging
automation but we are still coming up short in development tooling. If
anyone reading has funds to invest in hardware resources, please let
me know.

As Clint Eastwood's character said in "The Good, The Bad, and The
Ugly", "$200,000 is a lot of money. We're gonna have to earn it."

FWIW: I am not sure Parquet is a good example of a better way to be.
Parquet lacks automated integration tests (terrifying to me) and
failed to grow a community outside of the Java world until 2016 when a
few of us started building out the C++ library.

- Wes
On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> We are quickly growing the number of Arrow implementations.  Soon we'll
> have:
> - C++: the most mature, reference, and historical implementation
> - Python: linked with Arrow C++
> - C/GLib: linked with Arrow C++
> - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> - R: linked with Arrow C++
> - Matlab: linked with Arrow C++
> - Java: independent implementation
> - Rust: independent implementation
> - Go: independent implementation
> - Javascript: independent implementation
> - .Net (C#): independent implementation
>
> This creates various kinds of issues.  Technical issues such as CI
> matrices being more and more large and complex.  Social issues such as
> different implementations having different development speeds and
> maturity, and the fact that development teams are effectively disjoint
> (for example, whoever develops on the C++ codebase usually doesn't
> develop on the Rust codebase, and vice-versa).
>
> I'm not proposing anything concrete here, but would like to ask what
> people think of moving independent implementations (those that don't
> depend on Arrow C++) into independent repositories.  This would let them
> define their own workflow, permissions, teams, CI configurations and
> whatnot.  This would also allow growing the CI matrix for the main repo
> without reaching humongous sizes.  The implementations would still be
> under the umbrella of the Apache Arrow project; but they would exist as
> independent GitHub projects (this is a bit how Parquet implementations
> are handled, AFAIK).
>
> To start with, Wes expressed opposition to the idea:
> """
> I am against breaking up the monorepo -- I think that we should scale
> our process using tools that we develop rather than conforming to the
> objectively crude affordances of Travis CI and Appveyor. Implementations
> that are independent now may not be so in the future by the nature of
> the project -- any implementation could integrate with Gandiva, for
> example, and that would become much more difficult to develop if the
> code is fragmented in multiple repositories.
> """
>
> (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
>
> Regards
>
> Antoine.