You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/10/10 10:02:59 UTC

[C++] The quest for zero-dependency builds

Hi all,

I'm a bit concerned that we're planning to add many additional build
options in the quest to have a core zero-dependency build in C++.
See for example https://issues.apache.org/jira/browse/ARROW-6633 or
https://issues.apache.org/jira/browse/ARROW-6612.

The problem is that this is creating many possible configurations and we
will only be testing a tiny subset of them.  Inevitably, users will try
other option combinations and they'll fail building for some random
reason.  It will not be a very good user experience.

Another related issue is user perception when doing a default build.
For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
build with jemalloc disabled by default.  Inevitably, people will be
doing benchmarks with this (publicly or not) and they'll conclude Arrow
is not as performant as it claims to be.

Perhaps we should look for another approach instead?

For example we could have a single ARROW_BARE_CORE (whatever the name)
option that when enabled (not by default) builds the tiniest minimal
subset of Arrow.  It's more inflexible, but at least it's something that
we can reasonably test.

Regards

Antoine.

Re: [C++] The quest for zero-dependency builds

Posted by Micah Kornfield <em...@gmail.com>.

I'll add I don't think we will actually be switching anytime soon.  bazel
does have some advantages at least over our current CMake system in terms
of developer productivity (users can target smaller components with unit
tests which avoid re linking).  I've started on a prototype and hope to
have something to share in the next few days, so we can evaluate if it is
reasonable to have the two live side-by-side in the short term.

On Wed, Oct 23, 2019 at 4:11 PM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Oct 20, 2019 at 12:22 PM Maarten Ballintijn <ma...@xs4all.nl>
> wrote:
> >
> > Dev's
> >
> > I would request to be as conservative as possible in choosing (keeping)
> a build system.
> >
> > For developers, packagers and even end-users for some languages the
> build system is just
> > another dependency. Even if cmake is not ideal, it has become quite
> ubiquitous which is a huge plus.
> >
> > Maybe it is possible to come up with a way of expressing the dependency
> relations in cmake in
> > a way that makes maintaining them easier. Otherwise it is maybe possible
> to generate them from
> > a (simple) description file?
>
> There do seem to be parts of our CMake build system that contain
> boilerplate (particularly some of the platform-specific export
> defines) that might be better auto-generated in some way, so this is
> something it would be worth looking more at.
>
> FWIW, some Google projects I have seen offer CMake as a build option
> but the CMake files are mostly auto-generated from another build
> configuration.
>
> >
> > Cheers,
> > Maarten.
> >
> >
> > > On Oct 19, 2019, at 11:22 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > >>
> > >> Perhaps meson is also worth exploring?
> > >
> > >
> > > It could be, if someone else wants to take a look we can, compare what
> > > things look at in each. Recently, Bazel build rules seem like they
> would be
> > > useful for some work projects I've been dealing with, so I plan on
> focusing
> > > my exploration there.
> > >
> > > On Wed, Oct 16, 2019 at 6:27 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >>
> > >> Perhaps meson is also worth exploring?
> > >>
> > >>
> > >> Le 15/10/2019 à 23:06, Micah Kornfield a écrit :
> > >>> Hi Wes,
> > >>> I agree on both accounts that it won't be a done in the short term,
> and
> > >> it
> > >>> makes sense to tackle in incrementally.  Like I said I don't have
> much
> > >>> bandwidth at the moment but might be able to re-arrange a few things
> on
> > >> my
> > >>> plate.  I think some people have asked on the mailing list how they
> might
> > >>> be able to help, this might be one area that doesn't require a lot of
> > >>> in-depth knowledge of C++ at least for a proof of concept.  I'll try
> to
> > >>> open up some JIRAs soon.
> > >>>
> > >>> Thanks,
> > >>> Micah
> > >>>
> > >>> On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> hi Micah,
> > >>>>
> > >>>> Definitely Bazel is worth exploring, but we must be realistic about
> > >>>> the amount of energy (several hundred hours or more) that's been
> > >>>> invested in the build system we have now. So a new build system will
> > >>>> be a large endeavor, but hopefully can make things simpler.
> > >>>>
> > >>>> Aside from the requirements gathering process, if it is felt that
> > >>>> Bazel is a possible path forward in the future, it may be good to
> try
> > >>>> to break up the work into more tractable pieces. For example, a
> first
> > >>>> step would be to set up Bazel configurations to build the project's
> > >>>> thirdparty toolchain. Since we're reliant in ExternalProject in
> CMake
> > >>>> to do a lot of heavy lifting there for us, I imagine this (taking
> care
> > >>>> of what ThirdpartyToolchain.cmake does not) will take up a lot of
> the
> > >>>> energy
> > >>>>
> > >>>> - Wes
> > >>>>
> > >>>> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <
> emkornfield@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> This might be taking the thread on more of a tangent, but maybe we
> > >>>> should
> > >>>>> start collecting requirements for the C++ build system in general
> and
> > >> see
> > >>>>> if there might be better solution that can address some of these
> > >>>> concerns?
> > >>>>> In particular, Bazel at least on the surface seems like it might
> be a
> > >>>>> better fit for some of the use cases discussed here.  I know this
> is a
> > >>>> big
> > >>>>> project (and I currently don't have much bandwidth for it) but I
> think
> > >> if
> > >>>>> CMake is lacking in these areas it might be worth at least
> exploring
> > >>>>> instead of going down the path of building our own meta-build
> system on
> > >>>> top
> > >>>>> of CMake.
> > >>>>>
> > >>>>> Requirements that I think we are targeting:
> > >>>>> 1.  Be able to provide an out of box build system that requires as
> > >> close
> > >>>> to
> > >>>>> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD
> > >> minimal"
> > >>>>> works on any C++ developers desktop without additional
> requirements)
> > >>>>> 2.  The build system should limit configuration knobs in favor of
> > >> implied
> > >>>>> dependencies (e.g. "$BUILD python" automatically builds "compute",
> > >>>>> "filesystem", "ipc")
> > >>>>> 3.  The build system should be configurable to use (and have the
> user
> > >>>>> specify) one of "System packages", "Conda packages" or source
> packages
> > >>>> for
> > >>>>> providing dependencies (and fallback options between the three).
> > >>>>> 4.  The build system should be able to treat some dependencies as
> > >>>> optional
> > >>>>> (e.g. different compression libraries or allocators).
> > >>>>> 5.  Easily allow developers to limit building unnecessary code for
> > >> their
> > >>>>> particular task at hand.
> > >>>>> 6.  The build system must work across the following
> > >> toolchains/platforms:
> > >>>>>     - Linux:  g++ and clang.  x86 and ARM
> > >>>>>     - Mac
> > >>>>>     - Windows (msys2 and MSVC)
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Micah
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <antoine@python.org
> >
> > >>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> Yes, we could express dependencies in a Python script and have it
> > >>>>>> generate a CMake module of if/else chains in cmake_modules (which
> we
> > >>>>>> would check in git to avoid having people depend on a Python
> install,
> > >>>>>> perhaps).
> > >>>>>>
> > >>>>>> Still, that is an additional maintenance burden.
> > >>>>>>
> > >>>>>> Regards
> > >>>>>>
> > >>>>>> Antoine.
> > >>>>>>
> > >>>>>>
> > >>>>>> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> > >>>>>>> I guess one question we should first discuss is: who is the C++
> build
> > >>>>>>> system for?
> > >>>>>>>
> > >>>>>>> The users who are most sensitive to benchmark-driven decision
> making
> > >>>>>>> will generally be consuming the project through pre-built
> binaries,
> > >>>>>>> like our Python or R packages. If C++ developers build the
> project
> > >>>>>>> from source and don't do a minimal read of the documentation to
> see
> > >>>>>>> what a "recommended configuration" looks like, I would say that
> is
> > >>>>>>> more their fault than ours. In the case of the ARROW_JEMALLOC
> option,
> > >>>>>>> I think it's important for C++ system integrators to be aware of
> the
> > >>>>>>> impact of the choice of memory allocator.
> > >>>>>>>
> > >>>>>>> The concern I have with the current "out of the box" experience
> is
> > >>>>>>> that people are getting the impression that "I have to build $X,
> $Y,
> > >>>>>>> and $Z -- which I don't necessarily need -- to have
> $CORE_FEATURE_1".
> > >>>>>>> They can, of course, read the documentation and learn that those
> > >>>>>>> things can be toggled off, but I think the user that reaches for
> a
> > >>>>>>> self-built source install is much different in general than
> someone
> > >>>>>>> who uses the project through the Linux binary packages, for
> example.
> > >>>>>>>
> > >>>>>>> On the subject of managing intraproject dependencies and
> > >>>>>>> relationships, I think we should develop a better way to express
> > >>>>>>> relationships between components than we have now.
> > >>>>>>>
> > >>>>>>> As an example, building the Python library assumes that various
> > >>>>>>> components are enabled
> > >>>>>>>
> > >>>>>>> - ARROW_COMPUTE=ON
> > >>>>>>> - ARROW_FILESYSTEM=ON
> > >>>>>>> - ARROW_IPC=ON
> > >>>>>>>
> > >>>>>>> Somewhere in the code we might have some code like
> > >>>>>>>
> > >>>>>>> if (ARROW_PYTHON)
> > >>>>>>>   set(ARROW_COMPUTE ON)
> > >>>>>>>   ...
> > >>>>>>> endif()
> > >>>>>>>
> > >>>>>>> This doesn't strike me as that scalable. I would rather see a
> > >>>>>>> dependency file like
> > >>>>>>>
> > >>>>>>> component_dependencies = {
> > >>>>>>>     ...
> > >>>>>>>     'python': ['compute', 'filesystem', 'ipc'],
> > >>>>>>>     ...
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> A helper Python script as part of the build could be used to give
> > >>>>>>> CMake (because CMake is a bit poor as a programming language) the
> > >>>> list
> > >>>>>>> of required components based on what the user has indicated to
> CMake.
> > >>>>>>>
> > >>>>>>> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> > >>>>>>> <fs...@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> There's always the route of vendoring some library and not
> exposing
> > >>>>>>>> external CMake options. This would achieve the goal of
> > >>>>>>>> compile-out-of-the-box and enable important feature in the basic
> > >>>>>>>> build. We also simplify dependencies requirements (benefits CI
> or
> > >>>>>>>> developer). The downside is following security patches and
> grumpy
> > >>>>>>>> reaction from package maintainers. I think we should explore
> this
> > >>>>>>>> route for dependencies that match the following criteria:
> > >>>>>>>>
> > >>>>>>>> - libarrow*.so don't export any of the symbols of the
> dependency and
> > >>>>>>>> not referenced in any public headers
> > >>>>>>>> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
> > >>>> llvm,
> > >>>>>>>> thrift, protobuf
> > >>>>>>>> - dependency is not-ubiquitous on major platform and have a
> stable
> > >>>>>>>> API, e.g. excludes libz and openssl
> > >>>>>>>>
> > >>>>>>>> A small list of candidates:
> > >>>>>>>> - RapidJSON (enables JSON)
> > >>>>>>>> - DoubleConversion (enables CSV)
> > >>>>>>>>
> > >>>>>>>> There's a precedent, arrow already vendors small C++ libraries
> > >>>>>>>> (datetime, utf8cpp, variant, xxhash).
> > >>>>>>>>
> > >>>>>>>> François
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <
> antoine@python.org>
> > >>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi all,
> > >>>>>>>>>
> > >>>>>>>>> I'm a bit concerned that we're planning to add many additional
> > >>>> build
> > >>>>>>>>> options in the quest to have a core zero-dependency build in
> C++.
> > >>>>>>>>> See for example
> https://issues.apache.org/jira/browse/ARROW-6633
> > >>>> or
> > >>>>>>>>> https://issues.apache.org/jira/browse/ARROW-6612.
> > >>>>>>>>>
> > >>>>>>>>> The problem is that this is creating many possible
> configurations
> > >>>> and
> > >>>>>> we
> > >>>>>>>>> will only be testing a tiny subset of them.  Inevitably, users
> > >>>> will try
> > >>>>>>>>> other option combinations and they'll fail building for some
> random
> > >>>>>>>>> reason.  It will not be a very good user experience.
> > >>>>>>>>>
> > >>>>>>>>> Another related issue is user perception when doing a default
> > >>>> build.
> > >>>>>>>>> For example https://issues.apache.org/jira/browse/ARROW-6638
> > >>>> proposes
> > >>>>>> to
> > >>>>>>>>> build with jemalloc disabled by default.  Inevitably, people
> will
> > >>>> be
> > >>>>>>>>> doing benchmarks with this (publicly or not) and they'll
> conclude
> > >>>> Arrow
> > >>>>>>>>> is not as performant as it claims to be.
> > >>>>>>>>>
> > >>>>>>>>> Perhaps we should look for another approach instead?
> > >>>>>>>>>
> > >>>>>>>>> For example we could have a single ARROW_BARE_CORE (whatever
> the
> > >>>> name)
> > >>>>>>>>> option that when enabled (not by default) builds the tiniest
> > >>>> minimal
> > >>>>>>>>> subset of Arrow.  It's more inflexible, but at least it's
> something
> > >>>>>> that
> > >>>>>>>>> we can reasonably test.
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>>
> > >>>>>>>>> Antoine.
> > >>>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: [C++] The quest for zero-dependency builds

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Oct 20, 2019 at 12:22 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Dev's
>
> I would request to be as conservative as possible in choosing (keeping) a build system.
>
> For developers, packagers and even end-users for some languages the build system is just
> another dependency. Even if cmake is not ideal, it has become quite ubiquitous which is a huge plus.
>
> Maybe it is possible to come up with a way of expressing the dependency relations in cmake in
> a way that makes maintaining them easier. Otherwise it is maybe possible to generate them from
> a (simple) description file?

There do seem to be parts of our CMake build system that contain
boilerplate (particularly some of the platform-specific export
defines) that might be better auto-generated in some way, so this is
something it would be worth looking more at.

FWIW, some Google projects I have seen offer CMake as a build option
but the CMake files are mostly auto-generated from another build
configuration.

>
> Cheers,
> Maarten.
>
>
> > On Oct 19, 2019, at 11:22 PM, Micah Kornfield <em...@gmail.com> wrote:
> >
> >>
> >> Perhaps meson is also worth exploring?
> >
> >
> > It could be, if someone else wants to take a look we can, compare what
> > things look at in each. Recently, Bazel build rules seem like they would be
> > useful for some work projects I've been dealing with, so I plan on focusing
> > my exploration there.
> >
> > On Wed, Oct 16, 2019 at 6:27 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >>
> >> Perhaps meson is also worth exploring?
> >>
> >>
> >> Le 15/10/2019 à 23:06, Micah Kornfield a écrit :
> >>> Hi Wes,
> >>> I agree on both accounts that it won't be a done in the short term, and
> >> it
> >>> makes sense to tackle in incrementally.  Like I said I don't have much
> >>> bandwidth at the moment but might be able to re-arrange a few things on
> >> my
> >>> plate.  I think some people have asked on the mailing list how they might
> >>> be able to help, this might be one area that doesn't require a lot of
> >>> in-depth knowledge of C++ at least for a proof of concept.  I'll try to
> >>> open up some JIRAs soon.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com>
> >> wrote:
> >>>
> >>>> hi Micah,
> >>>>
> >>>> Definitely Bazel is worth exploring, but we must be realistic about
> >>>> the amount of energy (several hundred hours or more) that's been
> >>>> invested in the build system we have now. So a new build system will
> >>>> be a large endeavor, but hopefully can make things simpler.
> >>>>
> >>>> Aside from the requirements gathering process, if it is felt that
> >>>> Bazel is a possible path forward in the future, it may be good to try
> >>>> to break up the work into more tractable pieces. For example, a first
> >>>> step would be to set up Bazel configurations to build the project's
> >>>> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
> >>>> to do a lot of heavy lifting there for us, I imagine this (taking care
> >>>> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
> >>>> energy
> >>>>
> >>>> - Wes
> >>>>
> >>>> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> This might be taking the thread on more of a tangent, but maybe we
> >>>> should
> >>>>> start collecting requirements for the C++ build system in general and
> >> see
> >>>>> if there might be better solution that can address some of these
> >>>> concerns?
> >>>>> In particular, Bazel at least on the surface seems like it might be a
> >>>>> better fit for some of the use cases discussed here.  I know this is a
> >>>> big
> >>>>> project (and I currently don't have much bandwidth for it) but I think
> >> if
> >>>>> CMake is lacking in these areas it might be worth at least exploring
> >>>>> instead of going down the path of building our own meta-build system on
> >>>> top
> >>>>> of CMake.
> >>>>>
> >>>>> Requirements that I think we are targeting:
> >>>>> 1.  Be able to provide an out of box build system that requires as
> >> close
> >>>> to
> >>>>> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD
> >> minimal"
> >>>>> works on any C++ developers desktop without additional requirements)
> >>>>> 2.  The build system should limit configuration knobs in favor of
> >> implied
> >>>>> dependencies (e.g. "$BUILD python" automatically builds "compute",
> >>>>> "filesystem", "ipc")
> >>>>> 3.  The build system should be configurable to use (and have the user
> >>>>> specify) one of "System packages", "Conda packages" or source packages
> >>>> for
> >>>>> providing dependencies (and fallback options between the three).
> >>>>> 4.  The build system should be able to treat some dependencies as
> >>>> optional
> >>>>> (e.g. different compression libraries or allocators).
> >>>>> 5.  Easily allow developers to limit building unnecessary code for
> >> their
> >>>>> particular task at hand.
> >>>>> 6.  The build system must work across the following
> >> toolchains/platforms:
> >>>>>     - Linux:  g++ and clang.  x86 and ARM
> >>>>>     - Mac
> >>>>>     - Windows (msys2 and MSVC)
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> Yes, we could express dependencies in a Python script and have it
> >>>>>> generate a CMake module of if/else chains in cmake_modules (which we
> >>>>>> would check in git to avoid having people depend on a Python install,
> >>>>>> perhaps).
> >>>>>>
> >>>>>> Still, that is an additional maintenance burden.
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Antoine.
> >>>>>>
> >>>>>>
> >>>>>> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> >>>>>>> I guess one question we should first discuss is: who is the C++ build
> >>>>>>> system for?
> >>>>>>>
> >>>>>>> The users who are most sensitive to benchmark-driven decision making
> >>>>>>> will generally be consuming the project through pre-built binaries,
> >>>>>>> like our Python or R packages. If C++ developers build the project
> >>>>>>> from source and don't do a minimal read of the documentation to see
> >>>>>>> what a "recommended configuration" looks like, I would say that is
> >>>>>>> more their fault than ours. In the case of the ARROW_JEMALLOC option,
> >>>>>>> I think it's important for C++ system integrators to be aware of the
> >>>>>>> impact of the choice of memory allocator.
> >>>>>>>
> >>>>>>> The concern I have with the current "out of the box" experience is
> >>>>>>> that people are getting the impression that "I have to build $X, $Y,
> >>>>>>> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> >>>>>>> They can, of course, read the documentation and learn that those
> >>>>>>> things can be toggled off, but I think the user that reaches for a
> >>>>>>> self-built source install is much different in general than someone
> >>>>>>> who uses the project through the Linux binary packages, for example.
> >>>>>>>
> >>>>>>> On the subject of managing intraproject dependencies and
> >>>>>>> relationships, I think we should develop a better way to express
> >>>>>>> relationships between components than we have now.
> >>>>>>>
> >>>>>>> As an example, building the Python library assumes that various
> >>>>>>> components are enabled
> >>>>>>>
> >>>>>>> - ARROW_COMPUTE=ON
> >>>>>>> - ARROW_FILESYSTEM=ON
> >>>>>>> - ARROW_IPC=ON
> >>>>>>>
> >>>>>>> Somewhere in the code we might have some code like
> >>>>>>>
> >>>>>>> if (ARROW_PYTHON)
> >>>>>>>   set(ARROW_COMPUTE ON)
> >>>>>>>   ...
> >>>>>>> endif()
> >>>>>>>
> >>>>>>> This doesn't strike me as that scalable. I would rather see a
> >>>>>>> dependency file like
> >>>>>>>
> >>>>>>> component_dependencies = {
> >>>>>>>     ...
> >>>>>>>     'python': ['compute', 'filesystem', 'ipc'],
> >>>>>>>     ...
> >>>>>>> }
> >>>>>>>
> >>>>>>> A helper Python script as part of the build could be used to give
> >>>>>>> CMake (because CMake is a bit poor as a programming language) the
> >>>> list
> >>>>>>> of required components based on what the user has indicated to CMake.
> >>>>>>>
> >>>>>>> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> >>>>>>> <fs...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> There's always the route of vendoring some library and not exposing
> >>>>>>>> external CMake options. This would achieve the goal of
> >>>>>>>> compile-out-of-the-box and enable important feature in the basic
> >>>>>>>> build. We also simplify dependencies requirements (benefits CI or
> >>>>>>>> developer). The downside is following security patches and grumpy
> >>>>>>>> reaction from package maintainers. I think we should explore this
> >>>>>>>> route for dependencies that match the following criteria:
> >>>>>>>>
> >>>>>>>> - libarrow*.so don't export any of the symbols of the dependency and
> >>>>>>>> not referenced in any public headers
> >>>>>>>> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
> >>>> llvm,
> >>>>>>>> thrift, protobuf
> >>>>>>>> - dependency is not-ubiquitous on major platform and have a stable
> >>>>>>>> API, e.g. excludes libz and openssl
> >>>>>>>>
> >>>>>>>> A small list of candidates:
> >>>>>>>> - RapidJSON (enables JSON)
> >>>>>>>> - DoubleConversion (enables CSV)
> >>>>>>>>
> >>>>>>>> There's a precedent, arrow already vendors small C++ libraries
> >>>>>>>> (datetime, utf8cpp, variant, xxhash).
> >>>>>>>>
> >>>>>>>> François
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I'm a bit concerned that we're planning to add many additional
> >>>> build
> >>>>>>>>> options in the quest to have a core zero-dependency build in C++.
> >>>>>>>>> See for example https://issues.apache.org/jira/browse/ARROW-6633
> >>>> or
> >>>>>>>>> https://issues.apache.org/jira/browse/ARROW-6612.
> >>>>>>>>>
> >>>>>>>>> The problem is that this is creating many possible configurations
> >>>> and
> >>>>>> we
> >>>>>>>>> will only be testing a tiny subset of them.  Inevitably, users
> >>>> will try
> >>>>>>>>> other option combinations and they'll fail building for some random
> >>>>>>>>> reason.  It will not be a very good user experience.
> >>>>>>>>>
> >>>>>>>>> Another related issue is user perception when doing a default
> >>>> build.
> >>>>>>>>> For example https://issues.apache.org/jira/browse/ARROW-6638
> >>>> proposes
> >>>>>> to
> >>>>>>>>> build with jemalloc disabled by default.  Inevitably, people will
> >>>> be
> >>>>>>>>> doing benchmarks with this (publicly or not) and they'll conclude
> >>>> Arrow
> >>>>>>>>> is not as performant as it claims to be.
> >>>>>>>>>
> >>>>>>>>> Perhaps we should look for another approach instead?
> >>>>>>>>>
> >>>>>>>>> For example we could have a single ARROW_BARE_CORE (whatever the
> >>>> name)
> >>>>>>>>> option that when enabled (not by default) builds the tiniest
> >>>> minimal
> >>>>>>>>> subset of Arrow.  It's more inflexible, but at least it's something
> >>>>>> that
> >>>>>>>>> we can reasonably test.
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> Antoine.
> >>>>>>
> >>>>
> >>>
> >>
>

Re: [C++] The quest for zero-dependency builds

Posted by Maarten Ballintijn <ma...@xs4all.nl>.

Dev's

I would request to be as conservative as possible in choosing (keeping) a build system.

For developers, packagers and even end-users for some languages the build system is just
another dependency. Even if cmake is not ideal, it has become quite ubiquitous which is a huge plus.

Maybe it is possible to come up with a way of expressing the dependency relations in cmake in
a way that makes maintaining them easier. Otherwise it is maybe possible to generate them from
a (simple) description file?

Cheers,
Maarten.


> On Oct 19, 2019, at 11:22 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
>> 
>> Perhaps meson is also worth exploring?
> 
> 
> It could be, if someone else wants to take a look we can, compare what
> things look at in each. Recently, Bazel build rules seem like they would be
> useful for some work projects I've been dealing with, so I plan on focusing
> my exploration there.
> 
> On Wed, Oct 16, 2019 at 6:27 AM Antoine Pitrou <an...@python.org> wrote:
> 
>> 
>> Perhaps meson is also worth exploring?
>> 
>> 
>> Le 15/10/2019 à 23:06, Micah Kornfield a écrit :
>>> Hi Wes,
>>> I agree on both accounts that it won't be a done in the short term, and
>> it
>>> makes sense to tackle in incrementally.  Like I said I don't have much
>>> bandwidth at the moment but might be able to re-arrange a few things on
>> my
>>> plate.  I think some people have asked on the mailing list how they might
>>> be able to help, this might be one area that doesn't require a lot of
>>> in-depth knowledge of C++ at least for a proof of concept.  I'll try to
>>> open up some JIRAs soon.
>>> 
>>> Thanks,
>>> Micah
>>> 
>>> On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com>
>> wrote:
>>> 
>>>> hi Micah,
>>>> 
>>>> Definitely Bazel is worth exploring, but we must be realistic about
>>>> the amount of energy (several hundred hours or more) that's been
>>>> invested in the build system we have now. So a new build system will
>>>> be a large endeavor, but hopefully can make things simpler.
>>>> 
>>>> Aside from the requirements gathering process, if it is felt that
>>>> Bazel is a possible path forward in the future, it may be good to try
>>>> to break up the work into more tractable pieces. For example, a first
>>>> step would be to set up Bazel configurations to build the project's
>>>> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
>>>> to do a lot of heavy lifting there for us, I imagine this (taking care
>>>> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
>>>> energy
>>>> 
>>>> - Wes
>>>> 
>>>> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> This might be taking the thread on more of a tangent, but maybe we
>>>> should
>>>>> start collecting requirements for the C++ build system in general and
>> see
>>>>> if there might be better solution that can address some of these
>>>> concerns?
>>>>> In particular, Bazel at least on the surface seems like it might be a
>>>>> better fit for some of the use cases discussed here.  I know this is a
>>>> big
>>>>> project (and I currently don't have much bandwidth for it) but I think
>> if
>>>>> CMake is lacking in these areas it might be worth at least exploring
>>>>> instead of going down the path of building our own meta-build system on
>>>> top
>>>>> of CMake.
>>>>> 
>>>>> Requirements that I think we are targeting:
>>>>> 1.  Be able to provide an out of box build system that requires as
>> close
>>>> to
>>>>> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD
>> minimal"
>>>>> works on any C++ developers desktop without additional requirements)
>>>>> 2.  The build system should limit configuration knobs in favor of
>> implied
>>>>> dependencies (e.g. "$BUILD python" automatically builds "compute",
>>>>> "filesystem", "ipc")
>>>>> 3.  The build system should be configurable to use (and have the user
>>>>> specify) one of "System packages", "Conda packages" or source packages
>>>> for
>>>>> providing dependencies (and fallback options between the three).
>>>>> 4.  The build system should be able to treat some dependencies as
>>>> optional
>>>>> (e.g. different compression libraries or allocators).
>>>>> 5.  Easily allow developers to limit building unnecessary code for
>> their
>>>>> particular task at hand.
>>>>> 6.  The build system must work across the following
>> toolchains/platforms:
>>>>>     - Linux:  g++ and clang.  x86 and ARM
>>>>>     - Mac
>>>>>     - Windows (msys2 and MSVC)
>>>>> 
>>>>> Thanks,
>>>>> Micah
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org>
>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Yes, we could express dependencies in a Python script and have it
>>>>>> generate a CMake module of if/else chains in cmake_modules (which we
>>>>>> would check in git to avoid having people depend on a Python install,
>>>>>> perhaps).
>>>>>> 
>>>>>> Still, that is an additional maintenance burden.
>>>>>> 
>>>>>> Regards
>>>>>> 
>>>>>> Antoine.
>>>>>> 
>>>>>> 
>>>>>> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
>>>>>>> I guess one question we should first discuss is: who is the C++ build
>>>>>>> system for?
>>>>>>> 
>>>>>>> The users who are most sensitive to benchmark-driven decision making
>>>>>>> will generally be consuming the project through pre-built binaries,
>>>>>>> like our Python or R packages. If C++ developers build the project
>>>>>>> from source and don't do a minimal read of the documentation to see
>>>>>>> what a "recommended configuration" looks like, I would say that is
>>>>>>> more their fault than ours. In the case of the ARROW_JEMALLOC option,
>>>>>>> I think it's important for C++ system integrators to be aware of the
>>>>>>> impact of the choice of memory allocator.
>>>>>>> 
>>>>>>> The concern I have with the current "out of the box" experience is
>>>>>>> that people are getting the impression that "I have to build $X, $Y,
>>>>>>> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
>>>>>>> They can, of course, read the documentation and learn that those
>>>>>>> things can be toggled off, but I think the user that reaches for a
>>>>>>> self-built source install is much different in general than someone
>>>>>>> who uses the project through the Linux binary packages, for example.
>>>>>>> 
>>>>>>> On the subject of managing intraproject dependencies and
>>>>>>> relationships, I think we should develop a better way to express
>>>>>>> relationships between components than we have now.
>>>>>>> 
>>>>>>> As an example, building the Python library assumes that various
>>>>>>> components are enabled
>>>>>>> 
>>>>>>> - ARROW_COMPUTE=ON
>>>>>>> - ARROW_FILESYSTEM=ON
>>>>>>> - ARROW_IPC=ON
>>>>>>> 
>>>>>>> Somewhere in the code we might have some code like
>>>>>>> 
>>>>>>> if (ARROW_PYTHON)
>>>>>>>   set(ARROW_COMPUTE ON)
>>>>>>>   ...
>>>>>>> endif()
>>>>>>> 
>>>>>>> This doesn't strike me as that scalable. I would rather see a
>>>>>>> dependency file like
>>>>>>> 
>>>>>>> component_dependencies = {
>>>>>>>     ...
>>>>>>>     'python': ['compute', 'filesystem', 'ipc'],
>>>>>>>     ...
>>>>>>> }
>>>>>>> 
>>>>>>> A helper Python script as part of the build could be used to give
>>>>>>> CMake (because CMake is a bit poor as a programming language) the
>>>> list
>>>>>>> of required components based on what the user has indicated to CMake.
>>>>>>> 
>>>>>>> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
>>>>>>> <fs...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> There's always the route of vendoring some library and not exposing
>>>>>>>> external CMake options. This would achieve the goal of
>>>>>>>> compile-out-of-the-box and enable important feature in the basic
>>>>>>>> build. We also simplify dependencies requirements (benefits CI or
>>>>>>>> developer). The downside is following security patches and grumpy
>>>>>>>> reaction from package maintainers. I think we should explore this
>>>>>>>> route for dependencies that match the following criteria:
>>>>>>>> 
>>>>>>>> - libarrow*.so don't export any of the symbols of the dependency and
>>>>>>>> not referenced in any public headers
>>>>>>>> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
>>>> llvm,
>>>>>>>> thrift, protobuf
>>>>>>>> - dependency is not-ubiquitous on major platform and have a stable
>>>>>>>> API, e.g. excludes libz and openssl
>>>>>>>> 
>>>>>>>> A small list of candidates:
>>>>>>>> - RapidJSON (enables JSON)
>>>>>>>> - DoubleConversion (enables CSV)
>>>>>>>> 
>>>>>>>> There's a precedent, arrow already vendors small C++ libraries
>>>>>>>> (datetime, utf8cpp, variant, xxhash).
>>>>>>>> 
>>>>>>>> François
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I'm a bit concerned that we're planning to add many additional
>>>> build
>>>>>>>>> options in the quest to have a core zero-dependency build in C++.
>>>>>>>>> See for example https://issues.apache.org/jira/browse/ARROW-6633
>>>> or
>>>>>>>>> https://issues.apache.org/jira/browse/ARROW-6612.
>>>>>>>>> 
>>>>>>>>> The problem is that this is creating many possible configurations
>>>> and
>>>>>> we
>>>>>>>>> will only be testing a tiny subset of them.  Inevitably, users
>>>> will try
>>>>>>>>> other option combinations and they'll fail building for some random
>>>>>>>>> reason.  It will not be a very good user experience.
>>>>>>>>> 
>>>>>>>>> Another related issue is user perception when doing a default
>>>> build.
>>>>>>>>> For example https://issues.apache.org/jira/browse/ARROW-6638
>>>> proposes
>>>>>> to
>>>>>>>>> build with jemalloc disabled by default.  Inevitably, people will
>>>> be
>>>>>>>>> doing benchmarks with this (publicly or not) and they'll conclude
>>>> Arrow
>>>>>>>>> is not as performant as it claims to be.
>>>>>>>>> 
>>>>>>>>> Perhaps we should look for another approach instead?
>>>>>>>>> 
>>>>>>>>> For example we could have a single ARROW_BARE_CORE (whatever the
>>>> name)
>>>>>>>>> option that when enabled (not by default) builds the tiniest
>>>> minimal
>>>>>>>>> subset of Arrow.  It's more inflexible, but at least it's something
>>>>>> that
>>>>>>>>> we can reasonably test.
>>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> Antoine.
>>>>>> 
>>>> 
>>> 
>>

Re: [C++] The quest for zero-dependency builds

Posted by Micah Kornfield <em...@gmail.com>.

>
> Perhaps meson is also worth exploring?


It could be, if someone else wants to take a look we can, compare what
things look at in each. Recently, Bazel build rules seem like they would be
useful for some work projects I've been dealing with, so I plan on focusing
my exploration there.

On Wed, Oct 16, 2019 at 6:27 AM Antoine Pitrou <an...@python.org> wrote:

>
> Perhaps meson is also worth exploring?
>
>
> Le 15/10/2019 à 23:06, Micah Kornfield a écrit :
> > Hi Wes,
> > I agree on both accounts that it won't be a done in the short term, and
> it
> > makes sense to tackle in incrementally.  Like I said I don't have much
> > bandwidth at the moment but might be able to re-arrange a few things on
> my
> > plate.  I think some people have asked on the mailing list how they might
> > be able to help, this might be one area that doesn't require a lot of
> > in-depth knowledge of C++ at least for a proof of concept.  I'll try to
> > open up some JIRAs soon.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hi Micah,
> >>
> >> Definitely Bazel is worth exploring, but we must be realistic about
> >> the amount of energy (several hundred hours or more) that's been
> >> invested in the build system we have now. So a new build system will
> >> be a large endeavor, but hopefully can make things simpler.
> >>
> >> Aside from the requirements gathering process, if it is felt that
> >> Bazel is a possible path forward in the future, it may be good to try
> >> to break up the work into more tractable pieces. For example, a first
> >> step would be to set up Bazel configurations to build the project's
> >> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
> >> to do a lot of heavy lifting there for us, I imagine this (taking care
> >> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
> >> energy
> >>
> >> - Wes
> >>
> >> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>>
> >>>>
> >>>>
> >>>> This might be taking the thread on more of a tangent, but maybe we
> >> should
> >>> start collecting requirements for the C++ build system in general and
> see
> >>> if there might be better solution that can address some of these
> >> concerns?
> >>> In particular, Bazel at least on the surface seems like it might be a
> >>> better fit for some of the use cases discussed here.  I know this is a
> >> big
> >>> project (and I currently don't have much bandwidth for it) but I think
> if
> >>> CMake is lacking in these areas it might be worth at least exploring
> >>> instead of going down the path of building our own meta-build system on
> >> top
> >>> of CMake.
> >>>
> >>> Requirements that I think we are targeting:
> >>> 1.  Be able to provide an out of box build system that requires as
> close
> >> to
> >>> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD
> minimal"
> >>> works on any C++ developers desktop without additional requirements)
> >>> 2.  The build system should limit configuration knobs in favor of
> implied
> >>> dependencies (e.g. "$BUILD python" automatically builds "compute",
> >>> "filesystem", "ipc")
> >>> 3.  The build system should be configurable to use (and have the user
> >>> specify) one of "System packages", "Conda packages" or source packages
> >> for
> >>> providing dependencies (and fallback options between the three).
> >>> 4.  The build system should be able to treat some dependencies as
> >> optional
> >>> (e.g. different compression libraries or allocators).
> >>> 5.  Easily allow developers to limit building unnecessary code for
> their
> >>> particular task at hand.
> >>> 6.  The build system must work across the following
> toolchains/platforms:
> >>>      - Linux:  g++ and clang.  x86 and ARM
> >>>      - Mac
> >>>      - Windows (msys2 and MSVC)
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>>
> >>>
> >>> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Yes, we could express dependencies in a Python script and have it
> >>>> generate a CMake module of if/else chains in cmake_modules (which we
> >>>> would check in git to avoid having people depend on a Python install,
> >>>> perhaps).
> >>>>
> >>>> Still, that is an additional maintenance burden.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> >>>>> I guess one question we should first discuss is: who is the C++ build
> >>>>> system for?
> >>>>>
> >>>>> The users who are most sensitive to benchmark-driven decision making
> >>>>> will generally be consuming the project through pre-built binaries,
> >>>>> like our Python or R packages. If C++ developers build the project
> >>>>> from source and don't do a minimal read of the documentation to see
> >>>>> what a "recommended configuration" looks like, I would say that is
> >>>>> more their fault than ours. In the case of the ARROW_JEMALLOC option,
> >>>>> I think it's important for C++ system integrators to be aware of the
> >>>>> impact of the choice of memory allocator.
> >>>>>
> >>>>> The concern I have with the current "out of the box" experience is
> >>>>> that people are getting the impression that "I have to build $X, $Y,
> >>>>> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> >>>>> They can, of course, read the documentation and learn that those
> >>>>> things can be toggled off, but I think the user that reaches for a
> >>>>> self-built source install is much different in general than someone
> >>>>> who uses the project through the Linux binary packages, for example.
> >>>>>
> >>>>> On the subject of managing intraproject dependencies and
> >>>>> relationships, I think we should develop a better way to express
> >>>>> relationships between components than we have now.
> >>>>>
> >>>>> As an example, building the Python library assumes that various
> >>>>> components are enabled
> >>>>>
> >>>>> - ARROW_COMPUTE=ON
> >>>>> - ARROW_FILESYSTEM=ON
> >>>>> - ARROW_IPC=ON
> >>>>>
> >>>>> Somewhere in the code we might have some code like
> >>>>>
> >>>>> if (ARROW_PYTHON)
> >>>>>    set(ARROW_COMPUTE ON)
> >>>>>    ...
> >>>>> endif()
> >>>>>
> >>>>> This doesn't strike me as that scalable. I would rather see a
> >>>>> dependency file like
> >>>>>
> >>>>> component_dependencies = {
> >>>>>      ...
> >>>>>      'python': ['compute', 'filesystem', 'ipc'],
> >>>>>      ...
> >>>>> }
> >>>>>
> >>>>> A helper Python script as part of the build could be used to give
> >>>>> CMake (because CMake is a bit poor as a programming language) the
> >> list
> >>>>> of required components based on what the user has indicated to CMake.
> >>>>>
> >>>>> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> >>>>> <fs...@gmail.com> wrote:
> >>>>>>
> >>>>>> There's always the route of vendoring some library and not exposing
> >>>>>> external CMake options. This would achieve the goal of
> >>>>>> compile-out-of-the-box and enable important feature in the basic
> >>>>>> build. We also simplify dependencies requirements (benefits CI or
> >>>>>> developer). The downside is following security patches and grumpy
> >>>>>> reaction from package maintainers. I think we should explore this
> >>>>>> route for dependencies that match the following criteria:
> >>>>>>
> >>>>>> - libarrow*.so don't export any of the symbols of the dependency and
> >>>>>> not referenced in any public headers
> >>>>>> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
> >> llvm,
> >>>>>> thrift, protobuf
> >>>>>> - dependency is not-ubiquitous on major platform and have a stable
> >>>>>> API, e.g. excludes libz and openssl
> >>>>>>
> >>>>>> A small list of candidates:
> >>>>>> - RapidJSON (enables JSON)
> >>>>>> - DoubleConversion (enables CSV)
> >>>>>>
> >>>>>> There's a precedent, arrow already vendors small C++ libraries
> >>>>>> (datetime, utf8cpp, variant, xxhash).
> >>>>>>
> >>>>>> François
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> I'm a bit concerned that we're planning to add many additional
> >> build
> >>>>>>> options in the quest to have a core zero-dependency build in C++.
> >>>>>>> See for example https://issues.apache.org/jira/browse/ARROW-6633
> >> or
> >>>>>>> https://issues.apache.org/jira/browse/ARROW-6612.
> >>>>>>>
> >>>>>>> The problem is that this is creating many possible configurations
> >> and
> >>>> we
> >>>>>>> will only be testing a tiny subset of them.  Inevitably, users
> >> will try
> >>>>>>> other option combinations and they'll fail building for some random
> >>>>>>> reason.  It will not be a very good user experience.
> >>>>>>>
> >>>>>>> Another related issue is user perception when doing a default
> >> build.
> >>>>>>> For example https://issues.apache.org/jira/browse/ARROW-6638
> >> proposes
> >>>> to
> >>>>>>> build with jemalloc disabled by default.  Inevitably, people will
> >> be
> >>>>>>> doing benchmarks with this (publicly or not) and they'll conclude
> >> Arrow
> >>>>>>> is not as performant as it claims to be.
> >>>>>>>
> >>>>>>> Perhaps we should look for another approach instead?
> >>>>>>>
> >>>>>>> For example we could have a single ARROW_BARE_CORE (whatever the
> >> name)
> >>>>>>> option that when enabled (not by default) builds the tiniest
> >> minimal
> >>>>>>> subset of Arrow.  It's more inflexible, but at least it's something
> >>>> that
> >>>>>>> we can reasonably test.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>
> >>
> >
>

Re: [C++] The quest for zero-dependency builds

Posted by Antoine Pitrou <an...@python.org>.

Perhaps meson is also worth exploring?


Le 15/10/2019 à 23:06, Micah Kornfield a écrit :
> Hi Wes,
> I agree on both accounts that it won't be a done in the short term, and it
> makes sense to tackle in incrementally.  Like I said I don't have much
> bandwidth at the moment but might be able to re-arrange a few things on my
> plate.  I think some people have asked on the mailing list how they might
> be able to help, this might be one area that doesn't require a lot of
> in-depth knowledge of C++ at least for a proof of concept.  I'll try to
> open up some JIRAs soon.
> 
> Thanks,
> Micah
> 
> On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com> wrote:
> 
>> hi Micah,
>>
>> Definitely Bazel is worth exploring, but we must be realistic about
>> the amount of energy (several hundred hours or more) that's been
>> invested in the build system we have now. So a new build system will
>> be a large endeavor, but hopefully can make things simpler.
>>
>> Aside from the requirements gathering process, if it is felt that
>> Bazel is a possible path forward in the future, it may be good to try
>> to break up the work into more tractable pieces. For example, a first
>> step would be to set up Bazel configurations to build the project's
>> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
>> to do a lot of heavy lifting there for us, I imagine this (taking care
>> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
>> energy
>>
>> - Wes
>>
>> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>>
>>>>
>>>>
>>>> This might be taking the thread on more of a tangent, but maybe we
>> should
>>> start collecting requirements for the C++ build system in general and see
>>> if there might be better solution that can address some of these
>> concerns?
>>> In particular, Bazel at least on the surface seems like it might be a
>>> better fit for some of the use cases discussed here.  I know this is a
>> big
>>> project (and I currently don't have much bandwidth for it) but I think if
>>> CMake is lacking in these areas it might be worth at least exploring
>>> instead of going down the path of building our own meta-build system on
>> top
>>> of CMake.
>>>
>>> Requirements that I think we are targeting:
>>> 1.  Be able to provide an out of box build system that requires as close
>> to
>>> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
>>> works on any C++ developers desktop without additional requirements)
>>> 2.  The build system should limit configuration knobs in favor of implied
>>> dependencies (e.g. "$BUILD python" automatically builds "compute",
>>> "filesystem", "ipc")
>>> 3.  The build system should be configurable to use (and have the user
>>> specify) one of "System packages", "Conda packages" or source packages
>> for
>>> providing dependencies (and fallback options between the three).
>>> 4.  The build system should be able to treat some dependencies as
>> optional
>>> (e.g. different compression libraries or allocators).
>>> 5.  Easily allow developers to limit building unnecessary code for their
>>> particular task at hand.
>>> 6.  The build system must work across the following toolchains/platforms:
>>>      - Linux:  g++ and clang.  x86 and ARM
>>>      - Mac
>>>      - Windows (msys2 and MSVC)
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>>
>>> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org>
>> wrote:
>>>
>>>>
>>>> Yes, we could express dependencies in a Python script and have it
>>>> generate a CMake module of if/else chains in cmake_modules (which we
>>>> would check in git to avoid having people depend on a Python install,
>>>> perhaps).
>>>>
>>>> Still, that is an additional maintenance burden.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
>>>>> I guess one question we should first discuss is: who is the C++ build
>>>>> system for?
>>>>>
>>>>> The users who are most sensitive to benchmark-driven decision making
>>>>> will generally be consuming the project through pre-built binaries,
>>>>> like our Python or R packages. If C++ developers build the project
>>>>> from source and don't do a minimal read of the documentation to see
>>>>> what a "recommended configuration" looks like, I would say that is
>>>>> more their fault than ours. In the case of the ARROW_JEMALLOC option,
>>>>> I think it's important for C++ system integrators to be aware of the
>>>>> impact of the choice of memory allocator.
>>>>>
>>>>> The concern I have with the current "out of the box" experience is
>>>>> that people are getting the impression that "I have to build $X, $Y,
>>>>> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
>>>>> They can, of course, read the documentation and learn that those
>>>>> things can be toggled off, but I think the user that reaches for a
>>>>> self-built source install is much different in general than someone
>>>>> who uses the project through the Linux binary packages, for example.
>>>>>
>>>>> On the subject of managing intraproject dependencies and
>>>>> relationships, I think we should develop a better way to express
>>>>> relationships between components than we have now.
>>>>>
>>>>> As an example, building the Python library assumes that various
>>>>> components are enabled
>>>>>
>>>>> - ARROW_COMPUTE=ON
>>>>> - ARROW_FILESYSTEM=ON
>>>>> - ARROW_IPC=ON
>>>>>
>>>>> Somewhere in the code we might have some code like
>>>>>
>>>>> if (ARROW_PYTHON)
>>>>>    set(ARROW_COMPUTE ON)
>>>>>    ...
>>>>> endif()
>>>>>
>>>>> This doesn't strike me as that scalable. I would rather see a
>>>>> dependency file like
>>>>>
>>>>> component_dependencies = {
>>>>>      ...
>>>>>      'python': ['compute', 'filesystem', 'ipc'],
>>>>>      ...
>>>>> }
>>>>>
>>>>> A helper Python script as part of the build could be used to give
>>>>> CMake (because CMake is a bit poor as a programming language) the
>> list
>>>>> of required components based on what the user has indicated to CMake.
>>>>>
>>>>> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
>>>>> <fs...@gmail.com> wrote:
>>>>>>
>>>>>> There's always the route of vendoring some library and not exposing
>>>>>> external CMake options. This would achieve the goal of
>>>>>> compile-out-of-the-box and enable important feature in the basic
>>>>>> build. We also simplify dependencies requirements (benefits CI or
>>>>>> developer). The downside is following security patches and grumpy
>>>>>> reaction from package maintainers. I think we should explore this
>>>>>> route for dependencies that match the following criteria:
>>>>>>
>>>>>> - libarrow*.so don't export any of the symbols of the dependency and
>>>>>> not referenced in any public headers
>>>>>> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
>> llvm,
>>>>>> thrift, protobuf
>>>>>> - dependency is not-ubiquitous on major platform and have a stable
>>>>>> API, e.g. excludes libz and openssl
>>>>>>
>>>>>> A small list of candidates:
>>>>>> - RapidJSON (enables JSON)
>>>>>> - DoubleConversion (enables CSV)
>>>>>>
>>>>>> There's a precedent, arrow already vendors small C++ libraries
>>>>>> (datetime, utf8cpp, variant, xxhash).
>>>>>>
>>>>>> François
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm a bit concerned that we're planning to add many additional
>> build
>>>>>>> options in the quest to have a core zero-dependency build in C++.
>>>>>>> See for example https://issues.apache.org/jira/browse/ARROW-6633
>> or
>>>>>>> https://issues.apache.org/jira/browse/ARROW-6612.
>>>>>>>
>>>>>>> The problem is that this is creating many possible configurations
>> and
>>>> we
>>>>>>> will only be testing a tiny subset of them.  Inevitably, users
>> will try
>>>>>>> other option combinations and they'll fail building for some random
>>>>>>> reason.  It will not be a very good user experience.
>>>>>>>
>>>>>>> Another related issue is user perception when doing a default
>> build.
>>>>>>> For example https://issues.apache.org/jira/browse/ARROW-6638
>> proposes
>>>> to
>>>>>>> build with jemalloc disabled by default.  Inevitably, people will
>> be
>>>>>>> doing benchmarks with this (publicly or not) and they'll conclude
>> Arrow
>>>>>>> is not as performant as it claims to be.
>>>>>>>
>>>>>>> Perhaps we should look for another approach instead?
>>>>>>>
>>>>>>> For example we could have a single ARROW_BARE_CORE (whatever the
>> name)
>>>>>>> option that when enabled (not by default) builds the tiniest
>> minimal
>>>>>>> subset of Arrow.  It's more inflexible, but at least it's something
>>>> that
>>>>>>> we can reasonably test.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.
>>>>
>>
>

Re: [C++] The quest for zero-dependency builds

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes,
I agree on both accounts that it won't be a done in the short term, and it
makes sense to tackle in incrementally.  Like I said I don't have much
bandwidth at the moment but might be able to re-arrange a few things on my
plate.  I think some people have asked on the mailing list how they might
be able to help, this might be one area that doesn't require a lot of
in-depth knowledge of C++ at least for a proof of concept.  I'll try to
open up some JIRAs soon.

Thanks,
Micah

On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <we...@gmail.com> wrote:

> hi Micah,
>
> Definitely Bazel is worth exploring, but we must be realistic about
> the amount of energy (several hundred hours or more) that's been
> invested in the build system we have now. So a new build system will
> be a large endeavor, but hopefully can make things simpler.
>
> Aside from the requirements gathering process, if it is felt that
> Bazel is a possible path forward in the future, it may be good to try
> to break up the work into more tractable pieces. For example, a first
> step would be to set up Bazel configurations to build the project's
> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
> to do a lot of heavy lifting there for us, I imagine this (taking care
> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
> energy
>
> - Wes
>
> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > >
> > >
> > > This might be taking the thread on more of a tangent, but maybe we
> should
> > start collecting requirements for the C++ build system in general and see
> > if there might be better solution that can address some of these
> concerns?
> > In particular, Bazel at least on the surface seems like it might be a
> > better fit for some of the use cases discussed here.  I know this is a
> big
> > project (and I currently don't have much bandwidth for it) but I think if
> > CMake is lacking in these areas it might be worth at least exploring
> > instead of going down the path of building our own meta-build system on
> top
> > of CMake.
> >
> > Requirements that I think we are targeting:
> > 1.  Be able to provide an out of box build system that requires as close
> to
> > zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
> > works on any C++ developers desktop without additional requirements)
> > 2.  The build system should limit configuration knobs in favor of implied
> > dependencies (e.g. "$BUILD python" automatically builds "compute",
> > "filesystem", "ipc")
> > 3.  The build system should be configurable to use (and have the user
> > specify) one of "System packages", "Conda packages" or source packages
> for
> > providing dependencies (and fallback options between the three).
> > 4.  The build system should be able to treat some dependencies as
> optional
> > (e.g. different compression libraries or allocators).
> > 5.  Easily allow developers to limit building unnecessary code for their
> > particular task at hand.
> > 6.  The build system must work across the following toolchains/platforms:
> >     - Linux:  g++ and clang.  x86 and ARM
> >     - Mac
> >     - Windows (msys2 and MSVC)
> >
> > Thanks,
> > Micah
> >
> >
> >
> > On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Yes, we could express dependencies in a Python script and have it
> > > generate a CMake module of if/else chains in cmake_modules (which we
> > > would check in git to avoid having people depend on a Python install,
> > > perhaps).
> > >
> > > Still, that is an additional maintenance burden.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> > > > I guess one question we should first discuss is: who is the C++ build
> > > > system for?
> > > >
> > > > The users who are most sensitive to benchmark-driven decision making
> > > > will generally be consuming the project through pre-built binaries,
> > > > like our Python or R packages. If C++ developers build the project
> > > > from source and don't do a minimal read of the documentation to see
> > > > what a "recommended configuration" looks like, I would say that is
> > > > more their fault than ours. In the case of the ARROW_JEMALLOC option,
> > > > I think it's important for C++ system integrators to be aware of the
> > > > impact of the choice of memory allocator.
> > > >
> > > > The concern I have with the current "out of the box" experience is
> > > > that people are getting the impression that "I have to build $X, $Y,
> > > > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> > > > They can, of course, read the documentation and learn that those
> > > > things can be toggled off, but I think the user that reaches for a
> > > > self-built source install is much different in general than someone
> > > > who uses the project through the Linux binary packages, for example.
> > > >
> > > > On the subject of managing intraproject dependencies and
> > > > relationships, I think we should develop a better way to express
> > > > relationships between components than we have now.
> > > >
> > > > As an example, building the Python library assumes that various
> > > > components are enabled
> > > >
> > > > - ARROW_COMPUTE=ON
> > > > - ARROW_FILESYSTEM=ON
> > > > - ARROW_IPC=ON
> > > >
> > > > Somewhere in the code we might have some code like
> > > >
> > > > if (ARROW_PYTHON)
> > > >   set(ARROW_COMPUTE ON)
> > > >   ...
> > > > endif()
> > > >
> > > > This doesn't strike me as that scalable. I would rather see a
> > > > dependency file like
> > > >
> > > > component_dependencies = {
> > > >     ...
> > > >     'python': ['compute', 'filesystem', 'ipc'],
> > > >     ...
> > > > }
> > > >
> > > > A helper Python script as part of the build could be used to give
> > > > CMake (because CMake is a bit poor as a programming language) the
> list
> > > > of required components based on what the user has indicated to CMake.
> > > >
> > > > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> > > > <fs...@gmail.com> wrote:
> > > >>
> > > >> There's always the route of vendoring some library and not exposing
> > > >> external CMake options. This would achieve the goal of
> > > >> compile-out-of-the-box and enable important feature in the basic
> > > >> build. We also simplify dependencies requirements (benefits CI or
> > > >> developer). The downside is following security patches and grumpy
> > > >> reaction from package maintainers. I think we should explore this
> > > >> route for dependencies that match the following criteria:
> > > >>
> > > >> - libarrow*.so don't export any of the symbols of the dependency and
> > > >> not referenced in any public headers
> > > >> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
> llvm,
> > > >> thrift, protobuf
> > > >> - dependency is not-ubiquitous on major platform and have a stable
> > > >> API, e.g. excludes libz and openssl
> > > >>
> > > >> A small list of candidates:
> > > >> - RapidJSON (enables JSON)
> > > >> - DoubleConversion (enables CSV)
> > > >>
> > > >> There's a precedent, arrow already vendors small C++ libraries
> > > >> (datetime, utf8cpp, variant, xxhash).
> > > >>
> > > >> François
> > > >>
> > > >>
> > > >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > >>>
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I'm a bit concerned that we're planning to add many additional
> build
> > > >>> options in the quest to have a core zero-dependency build in C++.
> > > >>> See for example https://issues.apache.org/jira/browse/ARROW-6633
> or
> > > >>> https://issues.apache.org/jira/browse/ARROW-6612.
> > > >>>
> > > >>> The problem is that this is creating many possible configurations
> and
> > > we
> > > >>> will only be testing a tiny subset of them.  Inevitably, users
> will try
> > > >>> other option combinations and they'll fail building for some random
> > > >>> reason.  It will not be a very good user experience.
> > > >>>
> > > >>> Another related issue is user perception when doing a default
> build.
> > > >>> For example https://issues.apache.org/jira/browse/ARROW-6638
> proposes
> > > to
> > > >>> build with jemalloc disabled by default.  Inevitably, people will
> be
> > > >>> doing benchmarks with this (publicly or not) and they'll conclude
> Arrow
> > > >>> is not as performant as it claims to be.
> > > >>>
> > > >>> Perhaps we should look for another approach instead?
> > > >>>
> > > >>> For example we could have a single ARROW_BARE_CORE (whatever the
> name)
> > > >>> option that when enabled (not by default) builds the tiniest
> minimal
> > > >>> subset of Arrow.  It's more inflexible, but at least it's something
> > > that
> > > >>> we can reasonably test.
> > > >>>
> > > >>> Regards
> > > >>>
> > > >>> Antoine.
> > >
>

Re: [C++] The quest for zero-dependency builds

Posted by Wes McKinney <we...@gmail.com>.

hi Micah,

Definitely Bazel is worth exploring, but we must be realistic about
the amount of energy (several hundred hours or more) that's been
invested in the build system we have now. So a new build system will
be a large endeavor, but hopefully can make things simpler.

Aside from the requirements gathering process, if it is felt that
Bazel is a possible path forward in the future, it may be good to try
to break up the work into more tractable pieces. For example, a first
step would be to set up Bazel configurations to build the project's
thirdparty toolchain. Since we're reliant in ExternalProject in CMake
to do a lot of heavy lifting there for us, I imagine this (taking care
of what ThirdpartyToolchain.cmake does not) will take up a lot of the
energy

- Wes

On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> >
> > This might be taking the thread on more of a tangent, but maybe we should
> start collecting requirements for the C++ build system in general and see
> if there might be better solution that can address some of these concerns?
> In particular, Bazel at least on the surface seems like it might be a
> better fit for some of the use cases discussed here.  I know this is a big
> project (and I currently don't have much bandwidth for it) but I think if
> CMake is lacking in these areas it might be worth at least exploring
> instead of going down the path of building our own meta-build system on top
> of CMake.
>
> Requirements that I think we are targeting:
> 1.  Be able to provide an out of box build system that requires as close to
> zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
> works on any C++ developers desktop without additional requirements)
> 2.  The build system should limit configuration knobs in favor of implied
> dependencies (e.g. "$BUILD python" automatically builds "compute",
> "filesystem", "ipc")
> 3.  The build system should be configurable to use (and have the user
> specify) one of "System packages", "Conda packages" or source packages for
> providing dependencies (and fallback options between the three).
> 4.  The build system should be able to treat some dependencies as optional
> (e.g. different compression libraries or allocators).
> 5.  Easily allow developers to limit building unnecessary code for their
> particular task at hand.
> 6.  The build system must work across the following toolchains/platforms:
>     - Linux:  g++ and clang.  x86 and ARM
>     - Mac
>     - Windows (msys2 and MSVC)
>
> Thanks,
> Micah
>
>
>
> On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Yes, we could express dependencies in a Python script and have it
> > generate a CMake module of if/else chains in cmake_modules (which we
> > would check in git to avoid having people depend on a Python install,
> > perhaps).
> >
> > Still, that is an additional maintenance burden.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> > > I guess one question we should first discuss is: who is the C++ build
> > > system for?
> > >
> > > The users who are most sensitive to benchmark-driven decision making
> > > will generally be consuming the project through pre-built binaries,
> > > like our Python or R packages. If C++ developers build the project
> > > from source and don't do a minimal read of the documentation to see
> > > what a "recommended configuration" looks like, I would say that is
> > > more their fault than ours. In the case of the ARROW_JEMALLOC option,
> > > I think it's important for C++ system integrators to be aware of the
> > > impact of the choice of memory allocator.
> > >
> > > The concern I have with the current "out of the box" experience is
> > > that people are getting the impression that "I have to build $X, $Y,
> > > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> > > They can, of course, read the documentation and learn that those
> > > things can be toggled off, but I think the user that reaches for a
> > > self-built source install is much different in general than someone
> > > who uses the project through the Linux binary packages, for example.
> > >
> > > On the subject of managing intraproject dependencies and
> > > relationships, I think we should develop a better way to express
> > > relationships between components than we have now.
> > >
> > > As an example, building the Python library assumes that various
> > > components are enabled
> > >
> > > - ARROW_COMPUTE=ON
> > > - ARROW_FILESYSTEM=ON
> > > - ARROW_IPC=ON
> > >
> > > Somewhere in the code we might have some code like
> > >
> > > if (ARROW_PYTHON)
> > >   set(ARROW_COMPUTE ON)
> > >   ...
> > > endif()
> > >
> > > This doesn't strike me as that scalable. I would rather see a
> > > dependency file like
> > >
> > > component_dependencies = {
> > >     ...
> > >     'python': ['compute', 'filesystem', 'ipc'],
> > >     ...
> > > }
> > >
> > > A helper Python script as part of the build could be used to give
> > > CMake (because CMake is a bit poor as a programming language) the list
> > > of required components based on what the user has indicated to CMake.
> > >
> > > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> > > <fs...@gmail.com> wrote:
> > >>
> > >> There's always the route of vendoring some library and not exposing
> > >> external CMake options. This would achieve the goal of
> > >> compile-out-of-the-box and enable important feature in the basic
> > >> build. We also simplify dependencies requirements (benefits CI or
> > >> developer). The downside is following security patches and grumpy
> > >> reaction from package maintainers. I think we should explore this
> > >> route for dependencies that match the following criteria:
> > >>
> > >> - libarrow*.so don't export any of the symbols of the dependency and
> > >> not referenced in any public headers
> > >> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
> > >> thrift, protobuf
> > >> - dependency is not-ubiquitous on major platform and have a stable
> > >> API, e.g. excludes libz and openssl
> > >>
> > >> A small list of candidates:
> > >> - RapidJSON (enables JSON)
> > >> - DoubleConversion (enables CSV)
> > >>
> > >> There's a precedent, arrow already vendors small C++ libraries
> > >> (datetime, utf8cpp, variant, xxhash).
> > >>
> > >> François
> > >>
> > >>
> > >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > >>>
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I'm a bit concerned that we're planning to add many additional build
> > >>> options in the quest to have a core zero-dependency build in C++.
> > >>> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> > >>> https://issues.apache.org/jira/browse/ARROW-6612.
> > >>>
> > >>> The problem is that this is creating many possible configurations and
> > we
> > >>> will only be testing a tiny subset of them.  Inevitably, users will try
> > >>> other option combinations and they'll fail building for some random
> > >>> reason.  It will not be a very good user experience.
> > >>>
> > >>> Another related issue is user perception when doing a default build.
> > >>> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes
> > to
> > >>> build with jemalloc disabled by default.  Inevitably, people will be
> > >>> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> > >>> is not as performant as it claims to be.
> > >>>
> > >>> Perhaps we should look for another approach instead?
> > >>>
> > >>> For example we could have a single ARROW_BARE_CORE (whatever the name)
> > >>> option that when enabled (not by default) builds the tiniest minimal
> > >>> subset of Arrow.  It's more inflexible, but at least it's something
> > that
> > >>> we can reasonably test.
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> >

Re: [C++] The quest for zero-dependency builds

Posted by Micah Kornfield <em...@gmail.com>.

>
>
> This might be taking the thread on more of a tangent, but maybe we should
start collecting requirements for the C++ build system in general and see
if there might be better solution that can address some of these concerns?
In particular, Bazel at least on the surface seems like it might be a
better fit for some of the use cases discussed here.  I know this is a big
project (and I currently don't have much bandwidth for it) but I think if
CMake is lacking in these areas it might be worth at least exploring
instead of going down the path of building our own meta-build system on top
of CMake.

Requirements that I think we are targeting:
1.  Be able to provide an out of box build system that requires as close to
zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
works on any C++ developers desktop without additional requirements)
2.  The build system should limit configuration knobs in favor of implied
dependencies (e.g. "$BUILD python" automatically builds "compute",
"filesystem", "ipc")
3.  The build system should be configurable to use (and have the user
specify) one of "System packages", "Conda packages" or source packages for
providing dependencies (and fallback options between the three).
4.  The build system should be able to treat some dependencies as optional
(e.g. different compression libraries or allocators).
5.  Easily allow developers to limit building unnecessary code for their
particular task at hand.
6.  The build system must work across the following toolchains/platforms:
    - Linux:  g++ and clang.  x86 and ARM
    - Mac
    - Windows (msys2 and MSVC)

Thanks,
Micah



On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <an...@python.org> wrote:

>
> Yes, we could express dependencies in a Python script and have it
> generate a CMake module of if/else chains in cmake_modules (which we
> would check in git to avoid having people depend on a Python install,
> perhaps).
>
> Still, that is an additional maintenance burden.
>
> Regards
>
> Antoine.
>
>
> Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> > I guess one question we should first discuss is: who is the C++ build
> > system for?
> >
> > The users who are most sensitive to benchmark-driven decision making
> > will generally be consuming the project through pre-built binaries,
> > like our Python or R packages. If C++ developers build the project
> > from source and don't do a minimal read of the documentation to see
> > what a "recommended configuration" looks like, I would say that is
> > more their fault than ours. In the case of the ARROW_JEMALLOC option,
> > I think it's important for C++ system integrators to be aware of the
> > impact of the choice of memory allocator.
> >
> > The concern I have with the current "out of the box" experience is
> > that people are getting the impression that "I have to build $X, $Y,
> > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> > They can, of course, read the documentation and learn that those
> > things can be toggled off, but I think the user that reaches for a
> > self-built source install is much different in general than someone
> > who uses the project through the Linux binary packages, for example.
> >
> > On the subject of managing intraproject dependencies and
> > relationships, I think we should develop a better way to express
> > relationships between components than we have now.
> >
> > As an example, building the Python library assumes that various
> > components are enabled
> >
> > - ARROW_COMPUTE=ON
> > - ARROW_FILESYSTEM=ON
> > - ARROW_IPC=ON
> >
> > Somewhere in the code we might have some code like
> >
> > if (ARROW_PYTHON)
> >   set(ARROW_COMPUTE ON)
> >   ...
> > endif()
> >
> > This doesn't strike me as that scalable. I would rather see a
> > dependency file like
> >
> > component_dependencies = {
> >     ...
> >     'python': ['compute', 'filesystem', 'ipc'],
> >     ...
> > }
> >
> > A helper Python script as part of the build could be used to give
> > CMake (because CMake is a bit poor as a programming language) the list
> > of required components based on what the user has indicated to CMake.
> >
> > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> >>
> >> There's always the route of vendoring some library and not exposing
> >> external CMake options. This would achieve the goal of
> >> compile-out-of-the-box and enable important feature in the basic
> >> build. We also simplify dependencies requirements (benefits CI or
> >> developer). The downside is following security patches and grumpy
> >> reaction from package maintainers. I think we should explore this
> >> route for dependencies that match the following criteria:
> >>
> >> - libarrow*.so don't export any of the symbols of the dependency and
> >> not referenced in any public headers
> >> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
> >> thrift, protobuf
> >> - dependency is not-ubiquitous on major platform and have a stable
> >> API, e.g. excludes libz and openssl
> >>
> >> A small list of candidates:
> >> - RapidJSON (enables JSON)
> >> - DoubleConversion (enables CSV)
> >>
> >> There's a precedent, arrow already vendors small C++ libraries
> >> (datetime, utf8cpp, variant, xxhash).
> >>
> >> François
> >>
> >>
> >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org>
> wrote:
> >>>
> >>>
> >>> Hi all,
> >>>
> >>> I'm a bit concerned that we're planning to add many additional build
> >>> options in the quest to have a core zero-dependency build in C++.
> >>> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> >>> https://issues.apache.org/jira/browse/ARROW-6612.
> >>>
> >>> The problem is that this is creating many possible configurations and
> we
> >>> will only be testing a tiny subset of them.  Inevitably, users will try
> >>> other option combinations and they'll fail building for some random
> >>> reason.  It will not be a very good user experience.
> >>>
> >>> Another related issue is user perception when doing a default build.
> >>> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes
> to
> >>> build with jemalloc disabled by default.  Inevitably, people will be
> >>> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> >>> is not as performant as it claims to be.
> >>>
> >>> Perhaps we should look for another approach instead?
> >>>
> >>> For example we could have a single ARROW_BARE_CORE (whatever the name)
> >>> option that when enabled (not by default) builds the tiniest minimal
> >>> subset of Arrow.  It's more inflexible, but at least it's something
> that
> >>> we can reasonably test.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
>

Re: [C++] The quest for zero-dependency builds

Posted by Antoine Pitrou <an...@python.org>.

Yes, we could express dependencies in a Python script and have it
generate a CMake module of if/else chains in cmake_modules (which we
would check in git to avoid having people depend on a Python install,
perhaps).

Still, that is an additional maintenance burden.

Regards

Antoine.


Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> I guess one question we should first discuss is: who is the C++ build
> system for?
> 
> The users who are most sensitive to benchmark-driven decision making
> will generally be consuming the project through pre-built binaries,
> like our Python or R packages. If C++ developers build the project
> from source and don't do a minimal read of the documentation to see
> what a "recommended configuration" looks like, I would say that is
> more their fault than ours. In the case of the ARROW_JEMALLOC option,
> I think it's important for C++ system integrators to be aware of the
> impact of the choice of memory allocator.
> 
> The concern I have with the current "out of the box" experience is
> that people are getting the impression that "I have to build $X, $Y,
> and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> They can, of course, read the documentation and learn that those
> things can be toggled off, but I think the user that reaches for a
> self-built source install is much different in general than someone
> who uses the project through the Linux binary packages, for example.
> 
> On the subject of managing intraproject dependencies and
> relationships, I think we should develop a better way to express
> relationships between components than we have now.
> 
> As an example, building the Python library assumes that various
> components are enabled
> 
> - ARROW_COMPUTE=ON
> - ARROW_FILESYSTEM=ON
> - ARROW_IPC=ON
> 
> Somewhere in the code we might have some code like
> 
> if (ARROW_PYTHON)
>   set(ARROW_COMPUTE ON)
>   ...
> endif()
> 
> This doesn't strike me as that scalable. I would rather see a
> dependency file like
> 
> component_dependencies = {
>     ...
>     'python': ['compute', 'filesystem', 'ipc'],
>     ...
> }
> 
> A helper Python script as part of the build could be used to give
> CMake (because CMake is a bit poor as a programming language) the list
> of required components based on what the user has indicated to CMake.
> 
> On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
>>
>> There's always the route of vendoring some library and not exposing
>> external CMake options. This would achieve the goal of
>> compile-out-of-the-box and enable important feature in the basic
>> build. We also simplify dependencies requirements (benefits CI or
>> developer). The downside is following security patches and grumpy
>> reaction from package maintainers. I think we should explore this
>> route for dependencies that match the following criteria:
>>
>> - libarrow*.so don't export any of the symbols of the dependency and
>> not referenced in any public headers
>> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
>> thrift, protobuf
>> - dependency is not-ubiquitous on major platform and have a stable
>> API, e.g. excludes libz and openssl
>>
>> A small list of candidates:
>> - RapidJSON (enables JSON)
>> - DoubleConversion (enables CSV)
>>
>> There's a precedent, arrow already vendors small C++ libraries
>> (datetime, utf8cpp, variant, xxhash).
>>
>> François
>>
>>
>> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I'm a bit concerned that we're planning to add many additional build
>>> options in the quest to have a core zero-dependency build in C++.
>>> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
>>> https://issues.apache.org/jira/browse/ARROW-6612.
>>>
>>> The problem is that this is creating many possible configurations and we
>>> will only be testing a tiny subset of them.  Inevitably, users will try
>>> other option combinations and they'll fail building for some random
>>> reason.  It will not be a very good user experience.
>>>
>>> Another related issue is user perception when doing a default build.
>>> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
>>> build with jemalloc disabled by default.  Inevitably, people will be
>>> doing benchmarks with this (publicly or not) and they'll conclude Arrow
>>> is not as performant as it claims to be.
>>>
>>> Perhaps we should look for another approach instead?
>>>
>>> For example we could have a single ARROW_BARE_CORE (whatever the name)
>>> option that when enabled (not by default) builds the tiniest minimal
>>> subset of Arrow.  It's more inflexible, but at least it's something that
>>> we can reasonably test.
>>>
>>> Regards
>>>
>>> Antoine.

Re: [C++] The quest for zero-dependency builds

Posted by Wes McKinney <we...@gmail.com>.

I guess one question we should first discuss is: who is the C++ build
system for?

The users who are most sensitive to benchmark-driven decision making
will generally be consuming the project through pre-built binaries,
like our Python or R packages. If C++ developers build the project
from source and don't do a minimal read of the documentation to see
what a "recommended configuration" looks like, I would say that is
more their fault than ours. In the case of the ARROW_JEMALLOC option,
I think it's important for C++ system integrators to be aware of the
impact of the choice of memory allocator.

The concern I have with the current "out of the box" experience is
that people are getting the impression that "I have to build $X, $Y,
and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
They can, of course, read the documentation and learn that those
things can be toggled off, but I think the user that reaches for a
self-built source install is much different in general than someone
who uses the project through the Linux binary packages, for example.

On the subject of managing intraproject dependencies and
relationships, I think we should develop a better way to express
relationships between components than we have now.

As an example, building the Python library assumes that various
components are enabled

- ARROW_COMPUTE=ON
- ARROW_FILESYSTEM=ON
- ARROW_IPC=ON

Somewhere in the code we might have some code like

if (ARROW_PYTHON)
  set(ARROW_COMPUTE ON)
  ...
endif()

This doesn't strike me as that scalable. I would rather see a
dependency file like

component_dependencies = {
    ...
    'python': ['compute', 'filesystem', 'ipc'],
    ...
}

A helper Python script as part of the build could be used to give
CMake (because CMake is a bit poor as a programming language) the list
of required components based on what the user has indicated to CMake.

On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> There's always the route of vendoring some library and not exposing
> external CMake options. This would achieve the goal of
> compile-out-of-the-box and enable important feature in the basic
> build. We also simplify dependencies requirements (benefits CI or
> developer). The downside is following security patches and grumpy
> reaction from package maintainers. I think we should explore this
> route for dependencies that match the following criteria:
>
> - libarrow*.so don't export any of the symbols of the dependency and
> not referenced in any public headers
> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
> thrift, protobuf
> - dependency is not-ubiquitous on major platform and have a stable
> API, e.g. excludes libz and openssl
>
> A small list of candidates:
> - RapidJSON (enables JSON)
> - DoubleConversion (enables CSV)
>
> There's a precedent, arrow already vendors small C++ libraries
> (datetime, utf8cpp, variant, xxhash).
>
> François
>
>
> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hi all,
> >
> > I'm a bit concerned that we're planning to add many additional build
> > options in the quest to have a core zero-dependency build in C++.
> > See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> > https://issues.apache.org/jira/browse/ARROW-6612.
> >
> > The problem is that this is creating many possible configurations and we
> > will only be testing a tiny subset of them.  Inevitably, users will try
> > other option combinations and they'll fail building for some random
> > reason.  It will not be a very good user experience.
> >
> > Another related issue is user perception when doing a default build.
> > For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> > build with jemalloc disabled by default.  Inevitably, people will be
> > doing benchmarks with this (publicly or not) and they'll conclude Arrow
> > is not as performant as it claims to be.
> >
> > Perhaps we should look for another approach instead?
> >
> > For example we could have a single ARROW_BARE_CORE (whatever the name)
> > option that when enabled (not by default) builds the tiniest minimal
> > subset of Arrow.  It's more inflexible, but at least it's something that
> > we can reasonably test.
> >
> > Regards
> >
> > Antoine.

Re: [C++] The quest for zero-dependency builds

Posted by Francois Saint-Jacques <fs...@gmail.com>.

There's always the route of vendoring some library and not exposing
external CMake options. This would achieve the goal of
compile-out-of-the-box and enable important feature in the basic
build. We also simplify dependencies requirements (benefits CI or
developer). The downside is following security patches and grumpy
reaction from package maintainers. I think we should explore this
route for dependencies that match the following criteria:

- libarrow*.so don't export any of the symbols of the dependency and
not referenced in any public headers
- dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
thrift, protobuf
- dependency is not-ubiquitous on major platform and have a stable
API, e.g. excludes libz and openssl

A small list of candidates:
- RapidJSON (enables JSON)
- DoubleConversion (enables CSV)

There's a precedent, arrow already vendors small C++ libraries
(datetime, utf8cpp, variant, xxhash).

François


On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi all,
>
> I'm a bit concerned that we're planning to add many additional build
> options in the quest to have a core zero-dependency build in C++.
> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> https://issues.apache.org/jira/browse/ARROW-6612.
>
> The problem is that this is creating many possible configurations and we
> will only be testing a tiny subset of them.  Inevitably, users will try
> other option combinations and they'll fail building for some random
> reason.  It will not be a very good user experience.
>
> Another related issue is user perception when doing a default build.
> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> build with jemalloc disabled by default.  Inevitably, people will be
> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> is not as performant as it claims to be.
>
> Perhaps we should look for another approach instead?
>
> For example we could have a single ARROW_BARE_CORE (whatever the name)
> option that when enabled (not by default) builds the tiniest minimal
> subset of Arrow.  It's more inflexible, but at least it's something that
> we can reasonably test.
>
> Regards
>
> Antoine.

Re: [C++] The quest for zero-dependency builds

Posted by Tim Paine <t....@gmail.com>.

FWIW for perspective, we ended up just using our own Cmake file to build arrow, we needed a minimal subset of functionality on a tight size budget and it was easier doing that than configuring all the flags.

https://github.com/finos/perspective/blob/master/cmake/arrow/CMakeLists.txt



Tim Paine
tim.paine.nyc
908-721-1185

> On Oct 10, 2019, at 06:02, Antoine Pitrou <an...@python.org> wrote:
> 
> 
> Hi all,
> 
> I'm a bit concerned that we're planning to add many additional build
> options in the quest to have a core zero-dependency build in C++.
> See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> https://issues.apache.org/jira/browse/ARROW-6612.
> 
> The problem is that this is creating many possible configurations and we
> will only be testing a tiny subset of them.  Inevitably, users will try
> other option combinations and they'll fail building for some random
> reason.  It will not be a very good user experience.
> 
> Another related issue is user perception when doing a default build.
> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> build with jemalloc disabled by default.  Inevitably, people will be
> doing benchmarks with this (publicly or not) and they'll conclude Arrow
> is not as performant as it claims to be.
> 
> Perhaps we should look for another approach instead?
> 
> For example we could have a single ARROW_BARE_CORE (whatever the name)
> option that when enabled (not by default) builds the tiniest minimal
> subset of Arrow.  It's more inflexible, but at least it's something that
> we can reasonably test.
> 
> Regards
> 
> Antoine.