You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alex Hagerman <al...@unexpectedeof.net> on 2018/04/19 11:17:34 UTC

Sync Call Notes

Notes from yesterdays sync call:

Uwe suggested adding in checks for the C++ ABI to detect breaking 
changes. Discussed adding this to a CI build job daily.

Wes asked if certain C++ symbols could be marked experimental when 
performing the C++ ABI checks.

Uwe also mentioned the potential of using PIMPLs to hide pointers and 
implementation to prevent future C++ ABI breakage. He mentioned Parquet 
C++ has a similar setup.


Discussion around having all build artifacts and testing ready for 
automation in relationship to the release of 0.10.0


Wes mentioned that the C++ and Java implementation threads around Unions 
has gone stale, but should be completed before the 1.0 release.


It was mentioned that integration test should cover schema and field 
metadata.


Discussion around upcoming Man AHL hackathon and PyCon sprints.

  * JIRA tickets have been updated by Uwe with extra details and notes
  * Some tickets have been marked with beginner tags
  * Suggested to use conda to make the environment setup/build process
    easier for new contributors
  * Pandas extension series mentioned as a potentially good
    hackathon/sprint topic


Mention of revisiting Arrow roadmap discussions to potentially break the 
roadmap into smaller JIRA tickets for contributors.


Re: Sync Call Notes

Posted by Antoine Pitrou <an...@python.org>.
Done, see https://issues.apache.org/jira/browse/ARROW-2522



Le 28/04/2018 à 02:01, Wes McKinney a écrit :
> Yes, I'd say let's definitely bump the SO version with each major
> release. Is there a JIRA for this already? If not let's create one
> 
> On Tue, Apr 24, 2018 at 2:02 PM, Antoine Pitrou <an...@python.org> wrote:
>>
>> Le 24/04/2018 à 19:58, Wes McKinney a écrit :
>>>
>>> In summary, until the Arrow developer group grows significantly
>>> larger, I think we should expect the users of these libraries to "live
>>> at HEAD". I do think we should make ABI changes transparent and
>>> well-documented so the pain is minimized. For the moment, we still
>>> have a lot of development work to do for more people to "care" about
>>> Apache Arrow and invest in its success long term.
>>
>> Perhaps we can also version the installed SO files and bump their
>> version at each feature relase?  Right now it's just "libarrow.so.0"
>> (pointing to "libarrow.so.0.0.0").
>>
>> Regards
>>
>> Antoine.

Re: Sync Call Notes

Posted by Wes McKinney <we...@gmail.com>.
Yes, I'd say let's definitely bump the SO version with each major
release. Is there a JIRA for this already? If not let's create one

On Tue, Apr 24, 2018 at 2:02 PM, Antoine Pitrou <an...@python.org> wrote:
>
> Le 24/04/2018 à 19:58, Wes McKinney a écrit :
>>
>> In summary, until the Arrow developer group grows significantly
>> larger, I think we should expect the users of these libraries to "live
>> at HEAD". I do think we should make ABI changes transparent and
>> well-documented so the pain is minimized. For the moment, we still
>> have a lot of development work to do for more people to "care" about
>> Apache Arrow and invest in its success long term.
>
> Perhaps we can also version the installed SO files and bump their
> version at each feature relase?  Right now it's just "libarrow.so.0"
> (pointing to "libarrow.so.0.0.0").
>
> Regards
>
> Antoine.

Re: Sync Call Notes

Posted by Antoine Pitrou <an...@python.org>.
Le 24/04/2018 à 19:58, Wes McKinney a écrit :
> 
> In summary, until the Arrow developer group grows significantly
> larger, I think we should expect the users of these libraries to "live
> at HEAD". I do think we should make ABI changes transparent and
> well-documented so the pain is minimized. For the moment, we still
> have a lot of development work to do for more people to "care" about
> Apache Arrow and invest in its success long term.

Perhaps we can also version the installed SO files and bump their
version at each feature relase?  Right now it's just "libarrow.so.0"
(pointing to "libarrow.so.0.0.0").

Regards

Antoine.

Re: Sync Call Notes

Posted by Wes McKinney <we...@gmail.com>.
Personally, I am not really in favor of ABI stability in the short
term for a few reasons

* We don't have enough maintainers as is to keep up with the
development flow in the project

* It will may harm forward progress in the project's design. Because
the development team is so small and there are so few maintainers,
there has not been a great deal of feedback on the general factoring
of the C++ code. When the size of the development team grows, it would
be valuable to be able to revisit design decisions based on feedback
of new contributors yet to join the project

Basically, many ABI decisions have been made hurriedly and I think we
need the flexibility to fix our mistakes while the project is growing.

I think it would be more valuable to develop shared / reusable build
infrastructure to better accommodate an evolving ABI so that
rebuilding packages is not too onerous for downstream dependencies. In
large companies like Google that maintain monorepos, this problem is
solved by requiring all call sites associated with an ABI to be fixed
all at once. We probably won't be able to create a monorepo for all
projects that use Arrow, but we could make Turbodbc package rebuilds
easier, for example

In summary, until the Arrow developer group grows significantly
larger, I think we should expect the users of these libraries to "live
at HEAD". I do think we should make ABI changes transparent and
well-documented so the pain is minimized. For the moment, we still
have a lot of development work to do for more people to "care" about
Apache Arrow and invest in its success long term.

- Wes

On Thu, Apr 19, 2018 at 1:38 PM, Antoine Pitrou <an...@python.org> wrote:
>
> Hi Uwe,
>
> Le 19/04/2018 à 18:42, Uwe L. Korn a écrit :
>>> 1) are we ok with paying the cost of pimpls? (mostly the indirection
>>> cost I guess, and the fact that we can't have inline methods/accessors
>>> anymore)
>>
>> I'm not sure about how much of the cost we're ready to pay. There is a certain element to keeping a stable ABI (this is done fantastically by the NumPy people), you can do patch releases without consumers worrying if they need to rebuild their binaries.
>>
>> The indirection on paths that call expensive functions is certainly no problem, i.e. if you have a table and select a column, this is an operation you don't do often, thus I think the overhead is acceptable. On the other hand, accessing the null_count or the length of an array is definitely an operation that is performed quite often. These should be as fast as possible.
>>
>> I cannot give you a certain answer, once I have the relevant time, I'll try to implement and profile some of the possible approaches.
>>
>>> 2) how do we do for things like ArrayData, which seems publicly exposed
>>> by design?
>>
>> ArrayData is marked as internal and thus I would feel ok to break its ABI between non-major releases. If people really depend on its usage, then we should think of a clear way to make it public / non-internal.
>
> Perhaps we need a three-tiered approach?
>
> 1) a public and stable namespace ("arrow") with the goal to reach ABI
> stability post-1.0;
>
> 2) a public but still moving namespace ("arrow::unstable"?) where we
> generally try not to remove existing functionality and to honor API
> compatibility, but do not guarantee any sort of ABI stability;
> (this could have ArrayData, PrimitiveArray...)
>
> 3) an internal-use namespace ("arrow::internal"), which third-party
> projects can use at their own risk.
> (this should get all our internal helpers, including almost all CPython
> helpers)
>
> Regards
>
> Antoine.

Re: Sync Call Notes

Posted by Antoine Pitrou <an...@python.org>.
Hi Uwe,

Le 19/04/2018 à 18:42, Uwe L. Korn a écrit :
>> 1) are we ok with paying the cost of pimpls? (mostly the indirection
>> cost I guess, and the fact that we can't have inline methods/accessors
>> anymore)
> 
> I'm not sure about how much of the cost we're ready to pay. There is a certain element to keeping a stable ABI (this is done fantastically by the NumPy people), you can do patch releases without consumers worrying if they need to rebuild their binaries.
> 
> The indirection on paths that call expensive functions is certainly no problem, i.e. if you have a table and select a column, this is an operation you don't do often, thus I think the overhead is acceptable. On the other hand, accessing the null_count or the length of an array is definitely an operation that is performed quite often. These should be as fast as possible.
> 
> I cannot give you a certain answer, once I have the relevant time, I'll try to implement and profile some of the possible approaches. 
> 
>> 2) how do we do for things like ArrayData, which seems publicly exposed
>> by design?
> 
> ArrayData is marked as internal and thus I would feel ok to break its ABI between non-major releases. If people really depend on its usage, then we should think of a clear way to make it public / non-internal.

Perhaps we need a three-tiered approach?

1) a public and stable namespace ("arrow") with the goal to reach ABI
stability post-1.0;

2) a public but still moving namespace ("arrow::unstable"?) where we
generally try not to remove existing functionality and to honor API
compatibility, but do not guarantee any sort of ABI stability;
(this could have ArrayData, PrimitiveArray...)

3) an internal-use namespace ("arrow::internal"), which third-party
projects can use at their own risk.
(this should get all our internal helpers, including almost all CPython
helpers)

Regards

Antoine.

Re: Sync Call Notes

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Antoine,

comments inline.

On Thu, Apr 19, 2018, at 2:21 PM, Antoine Pitrou wrote:
> On Thu, 19 Apr 2018 07:17:34 -0400
> Alex Hagerman <al...@unexpectedeof.net> wrote:
> > Notes from yesterdays sync call:
> > 
> > Uwe suggested adding in checks for the C++ ABI to detect breaking 
> > changes. Discussed adding this to a CI build job daily.
> > 
> > Wes asked if certain C++ symbols could be marked experimental when 
> > performing the C++ ABI checks.
> > 
> > Uwe also mentioned the potential of using PIMPLs to hide pointers and 
> > implementation to prevent future C++ ABI breakage. He mentioned Parquet 
> > C++ has a similar setup.  
> 
> Some questions:
> 
> 1) are we ok with paying the cost of pimpls? (mostly the indirection
> cost I guess, and the fact that we can't have inline methods/accessors
> anymore)

I'm not sure about how much of the cost we're ready to pay. There is a certain element to keeping a stable ABI (this is done fantastically by the NumPy people), you can do patch releases without consumers worrying if they need to rebuild their binaries.

The indirection on paths that call expensive functions is certainly no problem, i.e. if you have a table and select a column, this is an operation you don't do often, thus I think the overhead is acceptable. On the other hand, accessing the null_count or the length of an array is definitely an operation that is performed quite often. These should be as fast as possible.

I cannot give you a certain answer, once I have the relevant time, I'll try to implement and profile some of the possible approaches. 

> 2) how do we do for things like ArrayData, which seems publicly exposed
> by design?

ArrayData is marked as internal and thus I would feel ok to break its ABI between non-major releases. If people really depend on its usage, then we should think of a clear way to make it public / non-internal.

> More generally, is it wise to focus on ABI compatibility even before a
> 1.0 is released?

Probably not. I care about that already because of two things:

 * I want to have a stable ABI for the core features for 1.0
 * I maintain a set of binary packages that depend on Arrow C++, having a stable ABI makes their maintenance much simpler. Most notably, for each new Arrow patch release, we also need to do a patch release of turbodbc currently. This is not something I would burden on everyone that depends on Arrow.

Re: Sync Call Notes

Posted by Li Jin <ic...@gmail.com>.
Thanks for the note! Sorry my calendar doesn’t pop up so I missed the sync.

On Thu, Apr 19, 2018 at 8:22 AM Antoine Pitrou <so...@pitrou.net> wrote:

> On Thu, 19 Apr 2018 07:17:34 -0400
> Alex Hagerman <al...@unexpectedeof.net> wrote:
> > Notes from yesterdays sync call:
> >
> > Uwe suggested adding in checks for the C++ ABI to detect breaking
> > changes. Discussed adding this to a CI build job daily.
> >
> > Wes asked if certain C++ symbols could be marked experimental when
> > performing the C++ ABI checks.
> >
> > Uwe also mentioned the potential of using PIMPLs to hide pointers and
> > implementation to prevent future C++ ABI breakage. He mentioned Parquet
> > C++ has a similar setup.
>
> Some questions:
>
> 1) are we ok with paying the cost of pimpls? (mostly the indirection
> cost I guess, and the fact that we can't have inline methods/accessors
> anymore)
>
> 2) how do we do for things like ArrayData, which seems publicly exposed
> by design?
>
> More generally, is it wise to focus on ABI compatibility even before a
> 1.0 is released?
>
> Regards
>
> Antoine.
>

Re: Sync Call Notes

Posted by Antoine Pitrou <so...@pitrou.net>.
On Thu, 19 Apr 2018 07:17:34 -0400
Alex Hagerman <al...@unexpectedeof.net> wrote:
> Notes from yesterdays sync call:
> 
> Uwe suggested adding in checks for the C++ ABI to detect breaking 
> changes. Discussed adding this to a CI build job daily.
> 
> Wes asked if certain C++ symbols could be marked experimental when 
> performing the C++ ABI checks.
> 
> Uwe also mentioned the potential of using PIMPLs to hide pointers and 
> implementation to prevent future C++ ABI breakage. He mentioned Parquet 
> C++ has a similar setup.  

Some questions:

1) are we ok with paying the cost of pimpls? (mostly the indirection
cost I guess, and the fact that we can't have inline methods/accessors
anymore)

2) how do we do for things like ArrayData, which seems publicly exposed
by design?

More generally, is it wise to focus on ABI compatibility even before a
1.0 is released?

Regards

Antoine.