You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Zhuo Peng <br...@gmail.com> on 2019/06/20 22:47:49 UTC

How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Dear Arrow maintainers,

I work on several TFX (TensorFlow eXtended) [1] projects (e.g. TensorFlow
Data Validation [2]) and am trying to use Arrow in them. These projects are
mostly written in Python but has C++ code as Python extension modules,
therefore we use both Arrow’s C++ and Python APIs. Our projects are
distributed through PyPI as binary packages.

The python extension modules are compiled with the headers shipped within
pyarrow PyPI binary package and are linked with libarrow.so and
libarrow_python.so in the same package. So far we’ve seen two major
problems:

* There are STL container definitions in public headers.

It causes problems because the binary for template classes is generated at
compilation time. And the definition of those template classes might differ
from compiler to compiler. This might happen even if we use a different GCC
 version than the one that compiled pyarrow (for example, the layout of
std::unordered_map<> has changed in GCC 5.2 [3], and arrow::Schema used to
contain an std::unordered_map<> member [4].)

One might argue that everyone releasing manylinux1 packages should use
exactly the same compiler, as provided by the pypa docker image, however
the standard only specifies the maximum versions of corresponding
fundamental libraries [5]. Newer GCC versions could be backported to work
with older libraries [6].

A recent change in Arrow [7] has removed most (but not all [8]) of the STL
members in publicly accessible class declarations and will resolve our
immediate problem, but I wonder if there is, or there should be an explicit
policy on the ABI compatibility, especially regarding the usage of template
functions / classes in public interfaces?

* Our wheel cannot pass “auditwheel repair”

I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
but that’s what “auditwheel repair” attempts to do. But if we don’t allow
auditwheel to do so, it refuses to stamp on our wheel because it has
“external” dependencies.

This seems not an Arrow problem, but I wonder if others in the community
have had to deal with similar issues and what the resolution is. Our
current workaround is to manually stamp the wheel.


Thanks,
Zhuo


References:

[1] https://github.com/tensorflow/tfx
[2] https://github.com/tensorflow/data-validation
[3]
https://github.com/gcc-mirror/gcc/commit/54b755d349d17bb197511529746cd7cf8ea761c1#diff-f82d3b9fa19961eed132b10c9a73903e
[4]
https://github.com/apache/arrow/blob/b22848952f09d6f9487feaff80ee358ca41b1562/cpp/src/arrow/type.h#L532
[5] https://www.python.org/dev/peps/pep-0513/#id40
[6] https://github.com/pypa/auditwheel/issues/125#issuecomment-438513357
[7]
https://github.com/apache/arrow/commit/7a5562174cffb21b16f990f64d114c1a94a30556
[8]
https://github.com/apache/arrow/blob/a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93/cpp/src/arrow/ipc/dictionary.h#L91

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Philipp Moritz <pc...@gmail.com>.
Dear all,

I agree with Wes and Antoine, the way things are currently handled is not
sustainable. If we are using wheels, it can only work if everybody is using
the same toolchain. In the past the Arrow contributors have tried to "fix"
TensorFlows non-compliance with the manylinux1 standards from the Arrow
side, with little or no success (the most recent attempt which also doesn't
solve the problem is [1], and there are many before). See also the long
thread in [2]. Hopefully we can find a solution that works in the longer
term:

- switch everybody to conda or
- make TensorFlow compatible with the current manylinux standard or
- define a new manylinux standard that everybody uses (see also [3])

More help with this from the TensorFlow side would be greatly appreciated.

Best wishes,
Philipp.

[1] https://github.com/apache/arrow/pull/4232
[2]
https://groups.google.com/a/tensorflow.org/forum/m/#!topic/build/WgtWKA4t_bs
[3] https://github.com/apache/arrow/pull/4391

On Fri, Jun 21, 2019 at 10:29 AM Antoine Pitrou <an...@python.org> wrote:

>
> I'm not only thinking about Google though :-)
>
> More generally, our woes with the wheel compliance process (especially
> on Linux, but even on Windows and macOS we must be careful to bundle
> absolutely everything) make it a very costly workaround for our shyness
> to tell users to "just use conda".
>
> That's more of a rant at this point, though.
>
> Regards
>
> Antoine.
>
>
>
> Le 21/06/2019 à 18:54, Wes McKinney a écrit :
> > I agree that wheels are a nuisance for package maintainers and Google
> > would be doing everyone a favor if they would either stop using them
> > or conform to the standard (or an evolved version thereof).
> >
> > On Fri, Jun 21, 2019 at 11:35 AM Antoine Pitrou <so...@pitrou.net>
> wrote:
> >>
> >>
> >> Side note: it's not only STL containers, it's also any non-trivial
> >> stdlib type that appears in headers.  Such as std::shared_ptr<>.
> >>
> >> So I'm not sure the endeavour makes sense at all.  You'll have to
> >> try and follow the libstdc++ ABI spec:
> >> https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> On Fri, 21 Jun 2019 18:12:02 +0200
> >> Antoine Pitrou <so...@pitrou.net> wrote:
> >>> On Thu, 20 Jun 2019 15:47:49 -0700
> >>> Zhuo Peng <br...@gmail.com> wrote:
> >>>>
> >>>> One might argue that everyone releasing manylinux1 packages should use
> >>>> exactly the same compiler, as provided by the pypa docker image,
> however
> >>>> the standard only specifies the maximum versions of corresponding
> >>>> fundamental libraries [5]. Newer GCC versions could be backported to
> work
> >>>> with older libraries [6].
> >>>>
> >>>> A recent change in Arrow [7] has removed most (but not all [8]) of
> the STL
> >>>> members in publicly accessible class declarations and will resolve our
> >>>> immediate problem, but I wonder if there is, or there should be an
> explicit
> >>>> policy on the ABI compatibility, especially regarding the usage of
> template
> >>>> functions / classes in public interfaces?
> >>>
> >>> IMHO, the only reasonable policy for now is that there is no ABI
> >>> compatibility.  If you'd like to benefit from the PyArrow binary
> >>> packages, including the C++ API, then you need to use the same
> toolchain
> >>> (or an ABI-compatible toolchain, but I'm afraid there's no clear
> >>> specification of ABI compatibility in g++ / libstdc++ land).
> >>>
> >>>> * Our wheel cannot pass “auditwheel repair”
> >>>>
> >>>> I don’t think it’s correct to pull libarrow.so and libarrow_python.so
> into
> >>>> our wheel and have user’s Python load both our libarrow.so and
> pyarrow’s,
> >>>> but that’s what “auditwheel repair” attempts to do. But if we don’t
> allow
> >>>> auditwheel to do so, it refuses to stamp on our wheel because it has
> >>>> “external” dependencies.
> >>>
> >>> You know, I wish the scientific communities would stop producing wheels
> >>> and instead encourage users to switch to conda.  The wheel paradigm is
> >>> conceptually antiquated and is really a nuisance to package developers
> >>> and maintainers.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>
> >>
> >>
>

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antoine Pitrou <an...@python.org>.
I'm not only thinking about Google though :-)

More generally, our woes with the wheel compliance process (especially
on Linux, but even on Windows and macOS we must be careful to bundle
absolutely everything) make it a very costly workaround for our shyness
to tell users to "just use conda".

That's more of a rant at this point, though.

Regards

Antoine.



Le 21/06/2019 à 18:54, Wes McKinney a écrit :
> I agree that wheels are a nuisance for package maintainers and Google
> would be doing everyone a favor if they would either stop using them
> or conform to the standard (or an evolved version thereof).
> 
> On Fri, Jun 21, 2019 at 11:35 AM Antoine Pitrou <so...@pitrou.net> wrote:
>>
>>
>> Side note: it's not only STL containers, it's also any non-trivial
>> stdlib type that appears in headers.  Such as std::shared_ptr<>.
>>
>> So I'm not sure the endeavour makes sense at all.  You'll have to
>> try and follow the libstdc++ ABI spec:
>> https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Fri, 21 Jun 2019 18:12:02 +0200
>> Antoine Pitrou <so...@pitrou.net> wrote:
>>> On Thu, 20 Jun 2019 15:47:49 -0700
>>> Zhuo Peng <br...@gmail.com> wrote:
>>>>
>>>> One might argue that everyone releasing manylinux1 packages should use
>>>> exactly the same compiler, as provided by the pypa docker image, however
>>>> the standard only specifies the maximum versions of corresponding
>>>> fundamental libraries [5]. Newer GCC versions could be backported to work
>>>> with older libraries [6].
>>>>
>>>> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
>>>> members in publicly accessible class declarations and will resolve our
>>>> immediate problem, but I wonder if there is, or there should be an explicit
>>>> policy on the ABI compatibility, especially regarding the usage of template
>>>> functions / classes in public interfaces?
>>>
>>> IMHO, the only reasonable policy for now is that there is no ABI
>>> compatibility.  If you'd like to benefit from the PyArrow binary
>>> packages, including the C++ API, then you need to use the same toolchain
>>> (or an ABI-compatible toolchain, but I'm afraid there's no clear
>>> specification of ABI compatibility in g++ / libstdc++ land).
>>>
>>>> * Our wheel cannot pass “auditwheel repair”
>>>>
>>>> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
>>>> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
>>>> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
>>>> auditwheel to do so, it refuses to stamp on our wheel because it has
>>>> “external” dependencies.
>>>
>>> You know, I wish the scientific communities would stop producing wheels
>>> and instead encourage users to switch to conda.  The wheel paradigm is
>>> conceptually antiquated and is really a nuisance to package developers
>>> and maintainers.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>
>>
>>

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Wes McKinney <we...@gmail.com>.
I agree that wheels are a nuisance for package maintainers and Google
would be doing everyone a favor if they would either stop using them
or conform to the standard (or an evolved version thereof).

On Fri, Jun 21, 2019 at 11:35 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
>
> Side note: it's not only STL containers, it's also any non-trivial
> stdlib type that appears in headers.  Such as std::shared_ptr<>.
>
> So I'm not sure the endeavour makes sense at all.  You'll have to
> try and follow the libstdc++ ABI spec:
> https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
>
> Regards
>
> Antoine.
>
>
> On Fri, 21 Jun 2019 18:12:02 +0200
> Antoine Pitrou <so...@pitrou.net> wrote:
> > On Thu, 20 Jun 2019 15:47:49 -0700
> > Zhuo Peng <br...@gmail.com> wrote:
> > >
> > > One might argue that everyone releasing manylinux1 packages should use
> > > exactly the same compiler, as provided by the pypa docker image, however
> > > the standard only specifies the maximum versions of corresponding
> > > fundamental libraries [5]. Newer GCC versions could be backported to work
> > > with older libraries [6].
> > >
> > > A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> > > members in publicly accessible class declarations and will resolve our
> > > immediate problem, but I wonder if there is, or there should be an explicit
> > > policy on the ABI compatibility, especially regarding the usage of template
> > > functions / classes in public interfaces?
> >
> > IMHO, the only reasonable policy for now is that there is no ABI
> > compatibility.  If you'd like to benefit from the PyArrow binary
> > packages, including the C++ API, then you need to use the same toolchain
> > (or an ABI-compatible toolchain, but I'm afraid there's no clear
> > specification of ABI compatibility in g++ / libstdc++ land).
> >
> > > * Our wheel cannot pass “auditwheel repair”
> > >
> > > I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> > > our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> > > but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> > > auditwheel to do so, it refuses to stamp on our wheel because it has
> > > “external” dependencies.
> >
> > You know, I wish the scientific communities would stop producing wheels
> > and instead encourage users to switch to conda.  The wheel paradigm is
> > conceptually antiquated and is really a nuisance to package developers
> > and maintainers.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>
>
>

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antoine Pitrou <so...@pitrou.net>.
Side note: it's not only STL containers, it's also any non-trivial
stdlib type that appears in headers.  Such as std::shared_ptr<>.

So I'm not sure the endeavour makes sense at all.  You'll have to
try and follow the libstdc++ ABI spec:
https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html

Regards

Antoine.


On Fri, 21 Jun 2019 18:12:02 +0200
Antoine Pitrou <so...@pitrou.net> wrote:
> On Thu, 20 Jun 2019 15:47:49 -0700
> Zhuo Peng <br...@gmail.com> wrote:
> > 
> > One might argue that everyone releasing manylinux1 packages should use
> > exactly the same compiler, as provided by the pypa docker image, however
> > the standard only specifies the maximum versions of corresponding
> > fundamental libraries [5]. Newer GCC versions could be backported to work
> > with older libraries [6].
> > 
> > A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> > members in publicly accessible class declarations and will resolve our
> > immediate problem, but I wonder if there is, or there should be an explicit
> > policy on the ABI compatibility, especially regarding the usage of template
> > functions / classes in public interfaces?  
> 
> IMHO, the only reasonable policy for now is that there is no ABI
> compatibility.  If you'd like to benefit from the PyArrow binary
> packages, including the C++ API, then you need to use the same toolchain
> (or an ABI-compatible toolchain, but I'm afraid there's no clear
> specification of ABI compatibility in g++ / libstdc++ land).
> 
> > * Our wheel cannot pass “auditwheel repair”
> > 
> > I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> > our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> > but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> > auditwheel to do so, it refuses to stamp on our wheel because it has
> > “external” dependencies.  
> 
> You know, I wish the scientific communities would stop producing wheels
> and instead encourage users to switch to conda.  The wheel paradigm is
> conceptually antiquated and is really a nuisance to package developers
> and maintainers.
> 
> Regards
> 
> Antoine.
> 
> 
> 




Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antoine Pitrou <so...@pitrou.net>.
On Sat, 22 Jun 2019 09:54:14 -0400
Antonio Cavallo <an...@gmail.com> wrote:
> 
> You know, I wish the scientific communities would stop producing wheels
> > and instead encourage users to switch to conda.  The wheel paradigm is
> > conceptually antiquated and is really a nuisance to package developers
> > and maintainers.  
> 
> I cannot agree more and it is not only restricted to the scientific
> community (I would consider myself working in a hybrid env).
> 
> Maybe we should embrace it.
> 
> We could agree on a common "package" format (bin/lib/include/data) between
> pip and conda to begin with (pretty much like rpm based on cpio instead
> zip).
> While there might disagreements on the tooling around (the build and the
> dependency resolver), at least we can pin on single format and the
> installer (with bare minimal logic in it).

I'm not sure I understand your proposal here.  You mean invent another
package format that's not some already existing de-facto standard?

Personally, I think it would be more reasonable to encourage conda as a
de-facto standard everywhere people cannot use system-provided packages
(for example because they are outdated).

> Seminal projects in this space are IMHO:
> 
>   https://build.opensuse.org/ (or the CD/CI system before it became
> "fashionable")
>     Basically it allows to create a package for each "binary" platform
> instead a package that rules them all, automatically (embrace it if you
> cannot beat it)
> 
>   https://github.com/QuantStack/mamba (it's a step in the right direction
> to split the install part from the dependency resolution)
>     (https://medium.com/@wolfv/making-conda-fast-again-4da4debfb3b7)
> 
> Please let me know if that something interesting

I don't know.  It's certainly interesting in an abstract, conceptual
way.  Concretely for Arrow?  I'm not sure :-)  Perhaps other Arrow core
developers will have a more elaborate opinion.

Regards

Antoine.



Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antonio Cavallo <an...@gmail.com>.
>
>
> Zhuo Peng <br...@gmail.com> wrote:
> > fundamental libraries [5]. Newer GCC versions could be backported to work
> > with older libraries [6].


That would be great and it would require a design agreement between major
compilers (Gcc, Intel, LLVM and Portland etc).


>  then you need to use the same toolchain
> (or an ABI-compatible toolchain, but I'm afraid there's no clear
> specification of ABI compatibility in g++ / libstdc++ land).
>

A structural deficiency (and flaw) of unix where the sdk is the live
system.
RH did some work in that providing packages but they guaranteed
compatibility across a limited number of versions and only for RH.

You know, I wish the scientific communities would stop producing wheels
> and instead encourage users to switch to conda.  The wheel paradigm is
> conceptually antiquated and is really a nuisance to package developers
> and maintainers.


I cannot agree more and it is not only restricted to the scientific
community (I would consider myself working in a hybrid env).

Maybe we should embrace it.

We could agree on a common "package" format (bin/lib/include/data) between
pip and conda to begin with (pretty much like rpm based on cpio instead
zip).
While there might disagreements on the tooling around (the build and the
dependency resolver), at least we can pin on single format and the
installer (with bare minimal logic in it).

Seminal projects in this space are IMHO:

  https://build.opensuse.org/ (or the CD/CI system before it became
"fashionable")
    Basically it allows to create a package for each "binary" platform
instead a package that rules them all, automatically (embrace it if you
cannot beat it)

  https://github.com/QuantStack/mamba (it's a step in the right direction
to split the install part from the dependency resolution)
    (https://medium.com/@wolfv/making-conda-fast-again-4da4debfb3b7)

Please let me know if that something interesting

PS> rpm settled on a main package (bin+libs),  -develop (for headers+static
libraries), -debug to include symbols for debugging

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antoine Pitrou <so...@pitrou.net>.
On Thu, 20 Jun 2019 15:47:49 -0700
Zhuo Peng <br...@gmail.com> wrote:
> 
> One might argue that everyone releasing manylinux1 packages should use
> exactly the same compiler, as provided by the pypa docker image, however
> the standard only specifies the maximum versions of corresponding
> fundamental libraries [5]. Newer GCC versions could be backported to work
> with older libraries [6].
> 
> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> members in publicly accessible class declarations and will resolve our
> immediate problem, but I wonder if there is, or there should be an explicit
> policy on the ABI compatibility, especially regarding the usage of template
> functions / classes in public interfaces?

IMHO, the only reasonable policy for now is that there is no ABI
compatibility.  If you'd like to benefit from the PyArrow binary
packages, including the C++ API, then you need to use the same toolchain
(or an ABI-compatible toolchain, but I'm afraid there's no clear
specification of ABI compatibility in g++ / libstdc++ land).

> * Our wheel cannot pass “auditwheel repair”
> 
> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> auditwheel to do so, it refuses to stamp on our wheel because it has
> “external” dependencies.

You know, I wish the scientific communities would stop producing wheels
and instead encourage users to switch to conda.  The wheel paradigm is
conceptually antiquated and is really a nuisance to package developers
and maintainers.

Regards

Antoine.



Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Antoine Pitrou <so...@pitrou.net>.
On Fri, 28 Jun 2019 09:43:07 -0700
Zhuo Peng <br...@gmail.com> wrote:
> 
> Or maybe we could disallow STL classes in arrow's public headers. This
> might not be feasible, because std::shared_ptr and std::vector are used
> everywhere.

Indeed I'm not sure how that would be possible.

Also I suppose *any* piece of software that refers to STL classes in
its public headers would be affected, so I'm not sure how they deal
with the issue, or whether they simply do not care.

> Or maybe we only allow some "safe" STL classes in the public headers. But
> there is no guarantee for them to be safe. It's purely empirical.

Right.

Regards

Antoine.



Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Zhuo Peng <br...@gmail.com>.
Thanks everyone. I think there are two issues being discussed here and I'd
like to keep them separate:

1. the ABI compatibility of Arrow's pip binary release.
It's true that there is no ABI standard and the topic is messy, but
as Antoine pointed out:

> If you'd like to benefit from the PyArrow binary
> packages, including the C++ API, then you need to use the same toolchain
> (or an ABI-compatible toolchain, but I'm afraid there's no clear
> specification of ABI compatibility in g++ / libstdc++ land).

we should be safe. And I think manylinux (which says, everyone should use
GCC/libstdc++, and should not use a GNU ABI version newer than X) and GNU
ABI Policy and Guidelines [1] (which says, Binaries with equivalent
DT_SONAMEs are forward-compatibile, and IIUC the SONAME has been
libstdc++.so.6 for quite a while, since GCC 3.4).

2. the ODR (one definition rule) violation caused by template classes,
specifically STL classes.

Strictly speaking, this is not about ABI compatibility, and sticking to
manylinux does not prevent this problem. The problem is essentially because
the STL headers shipped with GCC change over versions, and there's no
guarantee that those STL classes will have the same layout forever, and the
layout did change without notice (see the example in my original post).

Again, note that manylinux does not specify which toolchain everyone should
use. It merely specifies the maximum version of those fundamental
libraries. And with manylinux2010, people might have more choices in
compiler versions. For example, devtoolset-6 and devtoolset-7 both qualify.

I guess I was asking for a policy or guideline regarding to how to
correctly build things depending on Arrow's pip release. Even if the
guideline says "you need to build your library in this docker image", it's
still an improvement from the current situation. It might greatly limit the
developer's choices, if they also want to depend on some other library, or
they want to use a newer / older GCC verison.

Or maybe we could disallow STL classes in arrow's public headers. This
might not be feasible, because std::shared_ptr and std::vector are used
everywhere.

Or maybe we only allow some "safe" STL classes in the public headers. But
there is no guarantee for them to be safe. It's purely empirical.

On Thu, Jun 20, 2019 at 3:47 PM Zhuo Peng <br...@gmail.com> wrote:

> Dear Arrow maintainers,
>
> I work on several TFX (TensorFlow eXtended) [1] projects (e.g. TensorFlow
> Data Validation [2]) and am trying to use Arrow in them. These projects are
> mostly written in Python but has C++ code as Python extension modules,
> therefore we use both Arrow’s C++ and Python APIs. Our projects are
> distributed through PyPI as binary packages.
>
> The python extension modules are compiled with the headers shipped within
> pyarrow PyPI binary package and are linked with libarrow.so and
> libarrow_python.so in the same package. So far we’ve seen two major
> problems:
>
> * There are STL container definitions in public headers.
>
> It causes problems because the binary for template classes is generated at
> compilation time. And the definition of those template classes might differ
> from compiler to compiler. This might happen even if we use a different GCC
>  version than the one that compiled pyarrow (for example, the layout of
> std::unordered_map<> has changed in GCC 5.2 [3], and arrow::Schema used to
> contain an std::unordered_map<> member [4].)
>
> One might argue that everyone releasing manylinux1 packages should use
> exactly the same compiler, as provided by the pypa docker image, however
> the standard only specifies the maximum versions of corresponding
> fundamental libraries [5]. Newer GCC versions could be backported to work
> with older libraries [6].
>
> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> members in publicly accessible class declarations and will resolve our
> immediate problem, but I wonder if there is, or there should be an explicit
> policy on the ABI compatibility, especially regarding the usage of template
> functions / classes in public interfaces?
>
> * Our wheel cannot pass “auditwheel repair”
>
> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> auditwheel to do so, it refuses to stamp on our wheel because it has
> “external” dependencies.
>
> This seems not an Arrow problem, but I wonder if others in the community
> have had to deal with similar issues and what the resolution is. Our
> current workaround is to manually stamp the wheel.
>
>
> Thanks,
> Zhuo
>
>
> References:
>
> [1] https://github.com/tensorflow/tfx
> [2] https://github.com/tensorflow/data-validation
> [3]
> https://github.com/gcc-mirror/gcc/commit/54b755d349d17bb197511529746cd7cf8ea761c1#diff-f82d3b9fa19961eed132b10c9a73903e
> [4]
> https://github.com/apache/arrow/blob/b22848952f09d6f9487feaff80ee358ca41b1562/cpp/src/arrow/type.h#L532
> [5] https://www.python.org/dev/peps/pep-0513/#id40
> [6] https://github.com/pypa/auditwheel/issues/125#issuecomment-438513357
> [7]
> https://github.com/apache/arrow/commit/7a5562174cffb21b16f990f64d114c1a94a30556
> [8]
> https://github.com/apache/arrow/blob/a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93/cpp/src/arrow/ipc/dictionary.h#L91
>

Re: How should a Python/C++ project depend on Arrow (issues with ABI and wheel)?

Posted by Wes McKinney <we...@gmail.com>.
hi Zhuo,

On Thu, Jun 20, 2019 at 5:48 PM Zhuo Peng <br...@gmail.com> wrote:
>
> Dear Arrow maintainers,
>
> I work on several TFX (TensorFlow eXtended) [1] projects (e.g. TensorFlow
> Data Validation [2]) and am trying to use Arrow in them. These projects are
> mostly written in Python but has C++ code as Python extension modules,
> therefore we use both Arrow’s C++ and Python APIs. Our projects are
> distributed through PyPI as binary packages.
>
> The python extension modules are compiled with the headers shipped within
> pyarrow PyPI binary package and are linked with libarrow.so and
> libarrow_python.so in the same package. So far we’ve seen two major
> problems:
>
> * There are STL container definitions in public headers.
>

I think this should be regarded as a bug (exporting compiled STL
symbols). It seems like you agree but we have let some symbols leak in
large part because the scope of the project is large and we need more
contributors (who understand the issue and the solutions) to help look
after these issues.

> It causes problems because the binary for template classes is generated at
> compilation time. And the definition of those template classes might differ
> from compiler to compiler. This might happen even if we use a different GCC
>  version than the one that compiled pyarrow (for example, the layout of
> std::unordered_map<> has changed in GCC 5.2 [3], and arrow::Schema used to
> contain an std::unordered_map<> member [4].)
>
> One might argue that everyone releasing manylinux1 packages should use
> exactly the same compiler, as provided by the pypa docker image, however
> the standard only specifies the maximum versions of corresponding
> fundamental libraries [5]. Newer GCC versions could be backported to work
> with older libraries [6].
>
> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> members in publicly accessible class declarations and will resolve our
> immediate problem, but I wonder if there is, or there should be an explicit
> policy on the ABI compatibility, especially regarding the usage of template
> functions / classes in public interfaces?
>
> * Our wheel cannot pass “auditwheel repair”
>
> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> auditwheel to do so, it refuses to stamp on our wheel because it has
> “external” dependencies.
>
> This seems not an Arrow problem, but I wonder if others in the community
> have had to deal with similar issues and what the resolution is. Our
> current workaround is to manually stamp the wheel.
>

You aren't vendoring libarrow, right (if so, that's a bigger issue)?
I'm not an expert on how to appease auditwheel but this seems like
something we should sort out so that other projects' wheels can depend
on the pyarrow wheels. For the record, the whole wheel infrastructure
is poorly adapted for this scenario, which conda handles much more
gracefully.

>
> Thanks,
> Zhuo
>
>
> References:
>
> [1] https://github.com/tensorflow/tfx
> [2] https://github.com/tensorflow/data-validation
> [3]
> https://github.com/gcc-mirror/gcc/commit/54b755d349d17bb197511529746cd7cf8ea761c1#diff-f82d3b9fa19961eed132b10c9a73903e
> [4]
> https://github.com/apache/arrow/blob/b22848952f09d6f9487feaff80ee358ca41b1562/cpp/src/arrow/type.h#L532
> [5] https://www.python.org/dev/peps/pep-0513/#id40
> [6] https://github.com/pypa/auditwheel/issues/125#issuecomment-438513357
> [7]
> https://github.com/apache/arrow/commit/7a5562174cffb21b16f990f64d114c1a94a30556
> [8]
> https://github.com/apache/arrow/blob/a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93/cpp/src/arrow/ipc/dictionary.h#L91