You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Richard Bachmann <ri...@cern.ch> on 2019/11/07 13:11:02 UTC

Apache Arrow build with needed dependencies only

Hello,
I'm contacting you on behalf of the LCG Releases team at CERN. We 
provide a common software stack for LHCb, ATLAS and others to be used at 
CERN and the worldwide computing grid.

Right now we're looking into optimizing the way we're building Apache 
Arrow (C++ & Python) and its dependencies. Ideally we'd like to build 
Arrow using only the minimum of necessary dependencies to run it, and to 
use packages already installed in the stack to fulfill these 
dependencies. The former would be nice to keep the stack clean, the 
latter would help us avoid duplication and failing builds due to mirrors 
going offline.

Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO 
<https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst> 
setting, which results in duplicate and non-essential packages being 
downloaded by Arrow, as well as dependency on external mirrors. Setting 
it to SYSTEM allows us to avoid the downloads, but then the build 
process fails due to missing unused dependencies.

Do you know if there is a recommended way to achieve this? The problem 
seems to stem from the fact that all listed dependencies are downloaded, 
whether they are needed or not. We have considered patching out the 
non-essential dependencies ('double-conversion', 'GTEST', etc.) from the 
dependency list, as well as formally adding the unneeded dependencies to 
the stack in order to run with the SYSTEM setting. However, if there is 
a proper way to do it we would of course prefer to follow that course of 
action.


Any help would be very appreciated.
Kind regards:

     - Richard Bachmann


Re: Apache Arrow build with needed dependencies only

Posted by Richard Bachmann <ri...@cern.ch>.
Hello Wes and Sebastien,
First off a correction from earlier: It appears I misinterpreted the 
documentation and thought that 'thirdparty/download_dependencies.sh' 
would download all dependencies no matter what, which isn't the case. 
Apologies.

We were _originally_ building Arrow with the following command:

${long_path}/bin/cmake 
${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \
     -DARROW_USE_SSE=ON \
     -DARROW_PYTHON=ON  \
     -DCMAKE_INSTALL_PREFIX=${path_to_install_dir} \
     -DCMAKE_CXX_COMPILER=${long_path}/bin/g++ \
     -DCMAKE_CXX_STANDARD=17 \
     -DARROW_WITH_ZSTD=OFF \
     -DARROW_BUILD_TESTS=OFF \
     -DARROW_BUILD_BENCHMARKS=OFF \
     -DARROW_PARQUET=ON \
     -DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3 \
     -DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3 \
     -DBoost_NO_BOOST_CMAKE=ON \
     -DBoost_ADDITIONAL_VERSIONS=1.70

This produced the following in our build logs:
     [  7%] Performing download step (download, verify and extract) for 
'rapidjson_ep'
     [  8%] Performing download step (download, verify and extract) for 
'double-conversion_ep'
     [  8%] Performing download step (download, verify and extract) for 
'snappy_ep'
     [  8%] Performing download step (download, verify and extract) for 
'lz4_ep'
     [  8%] Performing download step (download, verify and extract) for 
'jemalloc_ep'
     [  8%] Performing download step (download, verify and extract) for 
'gflags_ep'
     [  9%] Performing download step (download, verify and extract) for 
'thrift_ep'
     [  9%] Performing download step (download, verify and extract) for 
'brotli_ep'


Thank you for opening the Jira issue. I agree, the difficulty in telling 
why some of these packages are downloaded is a core part of the issue. 
In the example above I had some difficulty when trying to figure out why 
Snappy, for instance, was downloaded. The build's 
`projects/arrow-0.14.1/src/arrow/0.14.1/cpp/CMakeLists.txt` revealed 
that the setting ARROW_ORC is the likely cause, I think. Similarly it 
was unclear why jemalloc, which already exists in our stack, was not 
taken from the system. I now understand that this is done in order to 
use a specific version which you can reliably patch, but it would be 
nice to have some clearer labeling.

In order to avoid offline mirrors interrupting builds we have taken the 
following steps:
The packages downloaded above have now been added properly to the stack, 
and listed as dependencies of arrow. Arrow is now built like so:

ENVIRONMENT FLATBUFFERS_HOME=${flatbuffers_home} 
ARROW_JEMALLOC_URL=${local_jemalloc_tar.bz2}
${long_path}/bin/cmake 
${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \
     -DARROW_PYTHON=ON
     -DCMAKE_INSTALL_PREFIX=${path_to_install_dir}
     -DCMAKE_CXX_COMPILER=${long_path}/bin/g++
     -DCMAKE_CXX_STANDARD=17
     -DARROW_WITH_ZSTD=OFF
     -DARROW_BUILD_TESTS=OFF
     -DARROW_BUILD_BENCHMARKS=OFF
     -DARROW_PARQUET=ON
*    -DRapidJSON_ROOT=${rapidjson_home}**
**    -DRAPIDJSON_INCLUDE_DIR=${rapidjson_home}/include*
     "-DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3"
     "-DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3"
     -DBoost_NO_BOOST_CMAKE=ON \
     -DBoost_ADDITIONAL_VERSIONS=1.70

The dependencies are detected (no longer downloaded), except for 
jemalloc where the find function has been disabled. As a work-around the 
ARROW_JEMALLOC_URL is supplied to take the tarball from local storage.
Thrift is now built with CMake, identically to how Arrow would do it 
internally, with the addition of the -fPIC flag. We will look into what 
features can be safely disabled for Arrow and Thrift in the future. 
Thank you Sebastien for the pointer to the ALICE build script.

We ended up not going for the full 'offline builds' solution of 
specifying all URLs, as this would introduce additional complexities in 
the form of a 'special' set of packages which are not version controlled 
like the others.

Thank you for the advice.
Kind regards:

     - Richard

On 11/7/19 5:10 PM, Wes McKinney wrote:
> I just openedhttps://issues.apache.org/jira/browse/ARROW-7089  about
> increasing transparency around what options are causing thirdparty
> dependencies to be required
>
> On Thu, Nov 7, 2019 at 10:05 AM Wes McKinney<we...@gmail.com>  wrote:
>> hi Richard,
>>
>> On Thu, Nov 7, 2019 at 9:59 AM Richard Bachmann
>> <ri...@cern.ch>  wrote:
>>> Hello,
>>> I'm contacting you on behalf of the LCG Releases team at CERN. We
>>> provide a common software stack for LHCb, ATLAS and others to be used at
>>> CERN and the worldwide computing grid.
>>>
>>> Right now we're looking into optimizing the way we're building Apache
>>> Arrow (C++ & Python) and its dependencies. Ideally we'd like to build
>>> Arrow using only the minimum of necessary dependencies to run it, and to
>>> use packages already installed in the stack to fulfill these
>>> dependencies. The former would be nice to keep the stack clean, the
>>> latter would help us avoid duplication and failing builds due to mirrors
>>> going offline.
>>>
>>> Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO
>>> <https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst>
>>> setting, which results in duplicate and non-essential packages being
>>> downloaded by Arrow, as well as dependency on external mirrors. Setting
>>> it to SYSTEM allows us to avoid the downloads, but then the build
>>> process fails due to missing unused dependencies.
>> I'm surprised to hear this based on what I know about the build system
>> and from extensive local development.
>>
>> Can you show the exact CMake invocation you are using and indicate
>> which unused dependencies are being downloaded?
>>
>> In this Docker minimal build (unless something has been recently
>> broken) that the project can be built with only a small number of
>> third party dependencies:
>>
>> https://github.com/apache/arrow/tree/master/cpp/examples/minimal_build
>>
>> Note that we support a fully "offline" build to allow thirdparty
>> dependencies to be built in an air-gapped environment
>>
>> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds
>>
>>> Do you know if there is a recommended way to achieve this? The problem
>>> seems to stem from the fact that all listed dependencies are downloaded,
>>> whether they are needed or not. We have considered patching out the
>>> non-essential dependencies ('double-conversion', 'GTEST', etc.) from the
>>> dependency list, as well as formally adding the unneeded dependencies to
>>> the stack in order to run with the SYSTEM setting. However, if there is
>>> a proper way to do it we would of course prefer to follow that course of
>>> action.
>> We'll be able to know more based on how you're calling CMake and with
>> what options, but the build system should not be downloading any
>> dependencies that are not needed.
>>
>>> Any help would be very appreciated.
>>> Kind regards:
>>>
>>>       - Richard Bachmann
>>>

Re: Apache Arrow build with needed dependencies only

Posted by Wes McKinney <we...@gmail.com>.
I just opened https://issues.apache.org/jira/browse/ARROW-7089 about
increasing transparency around what options are causing thirdparty
dependencies to be required

On Thu, Nov 7, 2019 at 10:05 AM Wes McKinney <we...@gmail.com> wrote:
>
> hi Richard,
>
> On Thu, Nov 7, 2019 at 9:59 AM Richard Bachmann
> <ri...@cern.ch> wrote:
> >
> > Hello,
> > I'm contacting you on behalf of the LCG Releases team at CERN. We
> > provide a common software stack for LHCb, ATLAS and others to be used at
> > CERN and the worldwide computing grid.
> >
> > Right now we're looking into optimizing the way we're building Apache
> > Arrow (C++ & Python) and its dependencies. Ideally we'd like to build
> > Arrow using only the minimum of necessary dependencies to run it, and to
> > use packages already installed in the stack to fulfill these
> > dependencies. The former would be nice to keep the stack clean, the
> > latter would help us avoid duplication and failing builds due to mirrors
> > going offline.
> >
> > Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO
> > <https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst>
> > setting, which results in duplicate and non-essential packages being
> > downloaded by Arrow, as well as dependency on external mirrors. Setting
> > it to SYSTEM allows us to avoid the downloads, but then the build
> > process fails due to missing unused dependencies.
>
> I'm surprised to hear this based on what I know about the build system
> and from extensive local development.
>
> Can you show the exact CMake invocation you are using and indicate
> which unused dependencies are being downloaded?
>
> In this Docker minimal build (unless something has been recently
> broken) that the project can be built with only a small number of
> third party dependencies:
>
> https://github.com/apache/arrow/tree/master/cpp/examples/minimal_build
>
> Note that we support a fully "offline" build to allow thirdparty
> dependencies to be built in an air-gapped environment
>
> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds
>
> > Do you know if there is a recommended way to achieve this? The problem
> > seems to stem from the fact that all listed dependencies are downloaded,
> > whether they are needed or not. We have considered patching out the
> > non-essential dependencies ('double-conversion', 'GTEST', etc.) from the
> > dependency list, as well as formally adding the unneeded dependencies to
> > the stack in order to run with the SYSTEM setting. However, if there is
> > a proper way to do it we would of course prefer to follow that course of
> > action.
>
> We'll be able to know more based on how you're calling CMake and with
> what options, but the build system should not be downloading any
> dependencies that are not needed.
>
> >
> > Any help would be very appreciated.
> > Kind regards:
> >
> >      - Richard Bachmann
> >

Re: Apache Arrow build with needed dependencies only

Posted by Wes McKinney <we...@gmail.com>.
hi Richard,

On Thu, Nov 7, 2019 at 9:59 AM Richard Bachmann
<ri...@cern.ch> wrote:
>
> Hello,
> I'm contacting you on behalf of the LCG Releases team at CERN. We
> provide a common software stack for LHCb, ATLAS and others to be used at
> CERN and the worldwide computing grid.
>
> Right now we're looking into optimizing the way we're building Apache
> Arrow (C++ & Python) and its dependencies. Ideally we'd like to build
> Arrow using only the minimum of necessary dependencies to run it, and to
> use packages already installed in the stack to fulfill these
> dependencies. The former would be nice to keep the stack clean, the
> latter would help us avoid duplication and failing builds due to mirrors
> going offline.
>
> Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO
> <https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst>
> setting, which results in duplicate and non-essential packages being
> downloaded by Arrow, as well as dependency on external mirrors. Setting
> it to SYSTEM allows us to avoid the downloads, but then the build
> process fails due to missing unused dependencies.

I'm surprised to hear this based on what I know about the build system
and from extensive local development.

Can you show the exact CMake invocation you are using and indicate
which unused dependencies are being downloaded?

In this Docker minimal build (unless something has been recently
broken) that the project can be built with only a small number of
third party dependencies:

https://github.com/apache/arrow/tree/master/cpp/examples/minimal_build

Note that we support a fully "offline" build to allow thirdparty
dependencies to be built in an air-gapped environment

https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds

> Do you know if there is a recommended way to achieve this? The problem
> seems to stem from the fact that all listed dependencies are downloaded,
> whether they are needed or not. We have considered patching out the
> non-essential dependencies ('double-conversion', 'GTEST', etc.) from the
> dependency list, as well as formally adding the unneeded dependencies to
> the stack in order to run with the SYSTEM setting. However, if there is
> a proper way to do it we would of course prefer to follow that course of
> action.

We'll be able to know more based on how you're calling CMake and with
what options, but the build system should not be downloading any
dependencies that are not needed.

>
> Any help would be very appreciated.
> Kind regards:
>
>      - Richard Bachmann
>

Re: Apache Arrow build with needed dependencies only

Posted by Sebastien Binet <bi...@cern.ch>.
hi Richard,

On Thu, Nov 7, 2019 at 5:00 PM Richard Bachmann <ri...@cern.ch>
wrote:

> Hello,
> I'm contacting you on behalf of the LCG Releases team at CERN. We
> provide a common software stack for LHCb, ATLAS and others to be used at
> CERN and the worldwide computing grid.
>

you may want to reach for the ALICE community.
they are routinely building Arrow as part as their sw-stack.
here is the recipe:
- https://github.com/alisw/alidist/blob/master/arrow.sh

-s

PS: IIRC, lxplus has Go installed (a rather old version, though: 1.8), so
one could install Go-Arrow in one go:
$> go get github.com/apache/arrow/go/arrow/...