You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/04/28 14:22:09 UTC

[C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

hi folks,

I would like to highlight some outstanding problems with our packages

1. Our Arrow C++ static libraries are generally unusable.

Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
mode, libarrow.a (or other static libraries) cannot be used for
linking. That's because the static library has a dependency on the
bundled static wheels which are _not_ packaged with the Arrow static
libraries.

The preferred solution seems to be ARROW-7605. I demonstrated how this works in

https://github.com/apache/arrow/pull/6220

but I need someone to help with the PR to deal with other BUNDLED
dependencies. I likely won't be able to complete the PR myself in time
for the next release.

2. Our Python packages are unacceptably large

On Linux, wheels are now 64MB and after installation take up 218MB.
There is an immediate serious problem that has gone unresolved that is
easier to fix and a separate structural problem that is more difficult
to fix. See the directory listing

https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248

We're duplicating all of the shared libraries inside the wheel and on
disk. It's unfortunate that we've allowed this problem for a whole
year or more

https://issues.apache.org/jira/browse/ARROW-5082

I also recently opened

https://issues.apache.org/jira/browse/ARROW-8518

which describes a proposal to create some tools to assist with
building "parent" and "child" Python packages. This would enable us to
ship components like Flight and Gandiva as separate wheels. This is a
large project but one that will ultimately be necessary for the
long-term scalability and sustainability of the project.

I am not able to personally work on either of these projects in the
current release cycle, but I hope that some progress can be made on
these since they have lingered on for a long time, and it would be
good for us to "put our best foot forward" with the 1.0.0 release.

Thanks,
Wes

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

Posted by Wes McKinney <we...@gmail.com>.
Would anyone have some bandwidth in the next couple of months to help
with this?

On Thu, Apr 30, 2020 at 9:10 AM Wes McKinney <we...@gmail.com> wrote:
>
> The proposal is for any BUNDLED dependency to be merged into
> libarrow.a (or another one of the static libraries if the dependency
> is only used in e.g. one subcomponent), so this applies to the AWS SDK
> also
>
> On Thu, Apr 30, 2020 at 3:02 AM Rémi Dettai <rd...@gmail.com> wrote:
> >
> > Hi!
> >
> > Does your point 1 also apply to the AWS SDK dependency ? Currently it seems
> > that it cannot be built in BUNDLED mode. As stated in
> > https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make
> > a static build with the S3 dependency activated ! I would really like to
> > help on this because it is very important for my usecase that we can
> > assemble compact builds of Arrow, but I'm still very uncomfortable with
> > CMake :-(
> >
> > Thanks for your amazing work !
> >
> > Remi
> >
> > Le mar. 28 avr. 2020 à 16:22, Wes McKinney <we...@gmail.com> a écrit :
> >
> > > hi folks,
> > >
> > > I would like to highlight some outstanding problems with our packages
> > >
> > > 1. Our Arrow C++ static libraries are generally unusable.
> > >
> > > Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
> > > mode, libarrow.a (or other static libraries) cannot be used for
> > > linking. That's because the static library has a dependency on the
> > > bundled static wheels which are _not_ packaged with the Arrow static
> > > libraries.
> > >
> > > The preferred solution seems to be ARROW-7605. I demonstrated how this
> > > works in
> > >
> > > https://github.com/apache/arrow/pull/6220
> > >
> > > but I need someone to help with the PR to deal with other BUNDLED
> > > dependencies. I likely won't be able to complete the PR myself in time
> > > for the next release.
> > >
> > > 2. Our Python packages are unacceptably large
> > >
> > > On Linux, wheels are now 64MB and after installation take up 218MB.
> > > There is an immediate serious problem that has gone unresolved that is
> > > easier to fix and a separate structural problem that is more difficult
> > > to fix. See the directory listing
> > >
> > > https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248
> > >
> > > We're duplicating all of the shared libraries inside the wheel and on
> > > disk. It's unfortunate that we've allowed this problem for a whole
> > > year or more
> > >
> > > https://issues.apache.org/jira/browse/ARROW-5082
> > >
> > > I also recently opened
> > >
> > > https://issues.apache.org/jira/browse/ARROW-8518
> > >
> > > which describes a proposal to create some tools to assist with
> > > building "parent" and "child" Python packages. This would enable us to
> > > ship components like Flight and Gandiva as separate wheels. This is a
> > > large project but one that will ultimately be necessary for the
> > > long-term scalability and sustainability of the project.
> > >
> > > I am not able to personally work on either of these projects in the
> > > current release cycle, but I hope that some progress can be made on
> > > these since they have lingered on for a long time, and it would be
> > > good for us to "put our best foot forward" with the 1.0.0 release.
> > >
> > > Thanks,
> > > Wes
> > >

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

Posted by Wes McKinney <we...@gmail.com>.
The proposal is for any BUNDLED dependency to be merged into
libarrow.a (or another one of the static libraries if the dependency
is only used in e.g. one subcomponent), so this applies to the AWS SDK
also

On Thu, Apr 30, 2020 at 3:02 AM Rémi Dettai <rd...@gmail.com> wrote:
>
> Hi!
>
> Does your point 1 also apply to the AWS SDK dependency ? Currently it seems
> that it cannot be built in BUNDLED mode. As stated in
> https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make
> a static build with the S3 dependency activated ! I would really like to
> help on this because it is very important for my usecase that we can
> assemble compact builds of Arrow, but I'm still very uncomfortable with
> CMake :-(
>
> Thanks for your amazing work !
>
> Remi
>
> Le mar. 28 avr. 2020 à 16:22, Wes McKinney <we...@gmail.com> a écrit :
>
> > hi folks,
> >
> > I would like to highlight some outstanding problems with our packages
> >
> > 1. Our Arrow C++ static libraries are generally unusable.
> >
> > Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
> > mode, libarrow.a (or other static libraries) cannot be used for
> > linking. That's because the static library has a dependency on the
> > bundled static wheels which are _not_ packaged with the Arrow static
> > libraries.
> >
> > The preferred solution seems to be ARROW-7605. I demonstrated how this
> > works in
> >
> > https://github.com/apache/arrow/pull/6220
> >
> > but I need someone to help with the PR to deal with other BUNDLED
> > dependencies. I likely won't be able to complete the PR myself in time
> > for the next release.
> >
> > 2. Our Python packages are unacceptably large
> >
> > On Linux, wheels are now 64MB and after installation take up 218MB.
> > There is an immediate serious problem that has gone unresolved that is
> > easier to fix and a separate structural problem that is more difficult
> > to fix. See the directory listing
> >
> > https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248
> >
> > We're duplicating all of the shared libraries inside the wheel and on
> > disk. It's unfortunate that we've allowed this problem for a whole
> > year or more
> >
> > https://issues.apache.org/jira/browse/ARROW-5082
> >
> > I also recently opened
> >
> > https://issues.apache.org/jira/browse/ARROW-8518
> >
> > which describes a proposal to create some tools to assist with
> > building "parent" and "child" Python packages. This would enable us to
> > ship components like Flight and Gandiva as separate wheels. This is a
> > large project but one that will ultimately be necessary for the
> > long-term scalability and sustainability of the project.
> >
> > I am not able to personally work on either of these projects in the
> > current release cycle, but I hope that some progress can be made on
> > these since they have lingered on for a long time, and it would be
> > good for us to "put our best foot forward" with the 1.0.0 release.
> >
> > Thanks,
> > Wes
> >

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

Posted by Rémi Dettai <rd...@gmail.com>.
Hi!

Does your point 1 also apply to the AWS SDK dependency ? Currently it seems
that it cannot be built in BUNDLED mode. As stated in
https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make
a static build with the S3 dependency activated ! I would really like to
help on this because it is very important for my usecase that we can
assemble compact builds of Arrow, but I'm still very uncomfortable with
CMake :-(

Thanks for your amazing work !

Remi

Le mar. 28 avr. 2020 à 16:22, Wes McKinney <we...@gmail.com> a écrit :

> hi folks,
>
> I would like to highlight some outstanding problems with our packages
>
> 1. Our Arrow C++ static libraries are generally unusable.
>
> Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
> mode, libarrow.a (or other static libraries) cannot be used for
> linking. That's because the static library has a dependency on the
> bundled static wheels which are _not_ packaged with the Arrow static
> libraries.
>
> The preferred solution seems to be ARROW-7605. I demonstrated how this
> works in
>
> https://github.com/apache/arrow/pull/6220
>
> but I need someone to help with the PR to deal with other BUNDLED
> dependencies. I likely won't be able to complete the PR myself in time
> for the next release.
>
> 2. Our Python packages are unacceptably large
>
> On Linux, wheels are now 64MB and after installation take up 218MB.
> There is an immediate serious problem that has gone unresolved that is
> easier to fix and a separate structural problem that is more difficult
> to fix. See the directory listing
>
> https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248
>
> We're duplicating all of the shared libraries inside the wheel and on
> disk. It's unfortunate that we've allowed this problem for a whole
> year or more
>
> https://issues.apache.org/jira/browse/ARROW-5082
>
> I also recently opened
>
> https://issues.apache.org/jira/browse/ARROW-8518
>
> which describes a proposal to create some tools to assist with
> building "parent" and "child" Python packages. This would enable us to
> ship components like Flight and Gandiva as separate wheels. This is a
> large project but one that will ultimately be necessary for the
> long-term scalability and sustainability of the project.
>
> I am not able to personally work on either of these projects in the
> current release cycle, but I hope that some progress can be made on
> these since they have lingered on for a long time, and it would be
> good for us to "put our best foot forward" with the 1.0.0 release.
>
> Thanks,
> Wes
>