You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2020/11/09 22:47:17 UTC

[PROPOSAL] Supporting multiple major pyarrow versions

Hi everyone,

The Python SDK has a dependency on pyarrow [1], currently only used by
ParquetIO for its parquet reader and writer. The arrow project recently hit
a major milestone with their 1.0 release. They now make forward- and
backward- compatibility guarantees for the IPC format, which is very
exciting and useful! But they're not making similar guarantees for releases
of the arrow libraries. They intend for regular library releases (targeting
a 3 month cadence) to be major version bumps, with possible breaking API
changes [2].

If we only support a single major version of pyarrow, as we do for other
Python dependencies, this could present quite a challenge for any beam
users that also have their own pyarrow dependency. If Beam keeps up with
the latest arrow release, they'd have to upgrade pyarrow in lockstep with
Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
might be locked out of new features in pyarrow.

In order to alleviate this I think we should maintain support for multiple
major pyarrow versions, and make an effort to keep up with new Arrow
releases.

I've verified that every major release going back to our current lower
bound, 0.15.1, up to the latest 2.x release will work with the current
ParquetIO code*. So this should just be a matter of:
1) Expanding the bounds in setup.py
2) Adding test suites to run ParquetIO tests with older versions to catch
any regressions (In an offline discussion +Udi Meiri
<eh...@google.com> volunteered
to help out with this).

I went ahead and created BEAM-11211 to track this, but please let me know
if there are any objections or concerns.

Brian

* There's actually a small regression just in 1.x, it can't write with LZ4
compression, but this can be easily detected at pipeline construction time.

[1]
https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
[2] https://arrow.apache.org/docs/format/Versioning.html

Re: [PROPOSAL] Supporting multiple major pyarrow versions

Posted by Ahmet Altay <al...@google.com>.

On Mon, Nov 9, 2020 at 3:37 PM Brian Hulette <bh...@google.com> wrote:

>
>
> On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <al...@google.com> wrote:
>
>> This sounds reasonable. A few questions:
>> - Do we need to expand the test matrix every 3 months or so to add
>> support for different versions of pyarrow?
>>
>
> Yes I think so. If there's a concern that this will become excessive we
> might consider just testing the newest and the oldest supported.
>

Sounds reasonable to me. And I suppose we can exclude certain major
versions in case we catch issues with them similar to the LZ4 regression.


>
> - Which pyarrow version will we ship in the default container?
>>
>
> I think we should just ship the latest supported one since that's what any
> users without their own pyarrow dependency will use.
>

+1


>
> - Related to the LZ4 regression, how did we catch this? If this is a one
>> off that is probably fine. It would make the less maintainable overtime if
>> we need to have branching code for different pyarrow versions.
>>
>
> It was caught by a unit test [1], but it's also documented in the release
> notes for arrow 1.0.0 [2]
>

It is great that it was caught by a unit test. We could easily things like
this from the release notes.


>
> [1]
> https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335
> [2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
>
>
>> On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <bh...@google.com> wrote:
>>
>>> Hi everyone,
>>>
>>> The Python SDK has a dependency on pyarrow [1], currently only used by
>>> ParquetIO for its parquet reader and writer. The arrow project recently hit
>>> a major milestone with their 1.0 release. They now make forward- and
>>> backward- compatibility guarantees for the IPC format, which is very
>>> exciting and useful! But they're not making similar guarantees for releases
>>> of the arrow libraries. They intend for regular library releases (targeting
>>> a 3 month cadence) to be major version bumps, with possible breaking API
>>> changes [2].
>>>
>>> If we only support a single major version of pyarrow, as we do for other
>>> Python dependencies, this could present quite a challenge for any beam
>>> users that also have their own pyarrow dependency. If Beam keeps up with
>>> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
>>> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
>>> might be locked out of new features in pyarrow.
>>>
>>> In order to alleviate this I think we should maintain support for
>>> multiple major pyarrow versions, and make an effort to keep up with new
>>> Arrow releases.
>>>
>>> I've verified that every major release going back to our current lower
>>> bound, 0.15.1, up to the latest 2.x release will work with the current
>>> ParquetIO code*. So this should just be a matter of:
>>> 1) Expanding the bounds in setup.py
>>> 2) Adding test suites to run ParquetIO tests with older versions to
>>> catch any regressions (In an offline discussion +Udi Meiri
>>> <eh...@google.com> volunteered to help out with this).
>>>
>>> I went ahead and created BEAM-11211 to track this, but please let me
>>> know if there are any objections or concerns.
>>>
>>> Brian
>>>
>>> * There's actually a small regression just in 1.x, it can't write with
>>> LZ4 compression, but this can be easily detected at pipeline construction
>>> time.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
>>> [2] https://arrow.apache.org/docs/format/Versioning.html
>>>
>>

Re: [PROPOSAL] Supporting multiple major pyarrow versions

Posted by Brian Hulette <bh...@google.com>.

On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <al...@google.com> wrote:

> This sounds reasonable. A few questions:
> - Do we need to expand the test matrix every 3 months or so to add support
> for different versions of pyarrow?
>

Yes I think so. If there's a concern that this will become excessive we
might consider just testing the newest and the oldest supported.

- Which pyarrow version will we ship in the default container?
>

I think we should just ship the latest supported one since that's what any
users without their own pyarrow dependency will use.

- Related to the LZ4 regression, how did we catch this? If this is a one
> off that is probably fine. It would make the less maintainable overtime if
> we need to have branching code for different pyarrow versions.
>

It was caught by a unit test [1], but it's also documented in the release
notes for arrow 1.0.0 [2]

[1]
https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335
[2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/


> On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <bh...@google.com> wrote:
>
>> Hi everyone,
>>
>> The Python SDK has a dependency on pyarrow [1], currently only used by
>> ParquetIO for its parquet reader and writer. The arrow project recently hit
>> a major milestone with their 1.0 release. They now make forward- and
>> backward- compatibility guarantees for the IPC format, which is very
>> exciting and useful! But they're not making similar guarantees for releases
>> of the arrow libraries. They intend for regular library releases (targeting
>> a 3 month cadence) to be major version bumps, with possible breaking API
>> changes [2].
>>
>> If we only support a single major version of pyarrow, as we do for other
>> Python dependencies, this could present quite a challenge for any beam
>> users that also have their own pyarrow dependency. If Beam keeps up with
>> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
>> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
>> might be locked out of new features in pyarrow.
>>
>> In order to alleviate this I think we should maintain support for
>> multiple major pyarrow versions, and make an effort to keep up with new
>> Arrow releases.
>>
>> I've verified that every major release going back to our current lower
>> bound, 0.15.1, up to the latest 2.x release will work with the current
>> ParquetIO code*. So this should just be a matter of:
>> 1) Expanding the bounds in setup.py
>> 2) Adding test suites to run ParquetIO tests with older versions to catch
>> any regressions (In an offline discussion +Udi Meiri <eh...@google.com> volunteered
>> to help out with this).
>>
>> I went ahead and created BEAM-11211 to track this, but please let me know
>> if there are any objections or concerns.
>>
>> Brian
>>
>> * There's actually a small regression just in 1.x, it can't write with
>> LZ4 compression, but this can be easily detected at pipeline construction
>> time.
>>
>> [1]
>> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
>> [2] https://arrow.apache.org/docs/format/Versioning.html
>>
>

Re: [PROPOSAL] Supporting multiple major pyarrow versions

Posted by Ahmet Altay <al...@google.com>.

This sounds reasonable. A few questions:
- Do we need to expand the test matrix every 3 months or so to add support
for different versions of pyarrow?
- Which pyarrow version will we ship in the default container?
- Related to the LZ4 regression, how did we catch this? If this is a one
off that is probably fine. It would make the less maintainable overtime if
we need to have branching code for different pyarrow versions.

On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <bh...@google.com> wrote:

> Hi everyone,
>
> The Python SDK has a dependency on pyarrow [1], currently only used by
> ParquetIO for its parquet reader and writer. The arrow project recently hit
> a major milestone with their 1.0 release. They now make forward- and
> backward- compatibility guarantees for the IPC format, which is very
> exciting and useful! But they're not making similar guarantees for releases
> of the arrow libraries. They intend for regular library releases (targeting
> a 3 month cadence) to be major version bumps, with possible breaking API
> changes [2].
>
> If we only support a single major version of pyarrow, as we do for other
> Python dependencies, this could present quite a challenge for any beam
> users that also have their own pyarrow dependency. If Beam keeps up with
> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
> might be locked out of new features in pyarrow.
>
> In order to alleviate this I think we should maintain support for multiple
> major pyarrow versions, and make an effort to keep up with new Arrow
> releases.
>
> I've verified that every major release going back to our current lower
> bound, 0.15.1, up to the latest 2.x release will work with the current
> ParquetIO code*. So this should just be a matter of:
> 1) Expanding the bounds in setup.py
> 2) Adding test suites to run ParquetIO tests with older versions to catch
> any regressions (In an offline discussion +Udi Meiri <eh...@google.com> volunteered
> to help out with this).
>
> I went ahead and created BEAM-11211 to track this, but please let me know
> if there are any objections or concerns.
>
> Brian
>
> * There's actually a small regression just in 1.x, it can't write with LZ4
> compression, but this can be easily detected at pipeline construction time.
>
> [1]
> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
> [2] https://arrow.apache.org/docs/format/Versioning.html
>