You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2021/09/23 17:25:27 UTC

[PROPOSAL] Drop official support for pandas 1.0.x in the DataFrame API

Hi all,
I probably haven't discussed this topic enough on dev@, so I have a lot of
background to provide here:

My goal with the DataFrame API has been to provide an API that is
compatible with the version of pandas being used (both at pipeline
construction time, and on worker nodes), subject to the documented
"Differences from pandas" [1]. This is what we verify in frames_test.py [2]
and pandas_doctests_test.py [3].

We've already decided not to support pandas<1.0. 1.0 was a major milestone
that made several breaking changes, so it's quite challenging to maintain
support on both sides of that boundary. However, even maintaining support
across multiple minor versions (1.0, 1.1, ...) is challenging for
our unique usage, since these versions add new features that must be
conditionally supported, fix bugs that we previously worked around, etc. So
to make sure we maintain support, I've been looking at continuously
verifying the DataFrame API with multiple minor pandas versions [4].


So on to the issue at hand: While working on [4], I discovered that we
already have quite a few issues with pandas 1.0.x [5]. I looked into
working around the issues, but it requires quite a bit of logic that's
difficult to reason about and maintain. So I'm wondering if we should:
- officially drop support for pandas 1.0.x in the DataFrame API, and
- going forward have a policy to maintain support for the latest 3 minor
versions of pandas in the DataFrame API

This will reduce our maintenance burden, as well as the cost of verifying
supported versions. I collected some data about the pandas minor versions
to help inform this decision, detailed below. Note that pandas minor
versions are released ~every 6 months, and 1.0.x represents about 8% of
pandas usage (according to download data from the last 30 days).

If anyone objects to either of the above proposals, please let me know.

Brian


pandas downloads by minor version, over the last 30 days [6]:
1.3.x: 16.88%
1.2.x: 9.73%
1.1.x: 17.64%
1.0.x: 8.76%
0.25.x: 14.36%

>=1.0: 53.01%
>=1.1: 44.25%

pandas minor version release dates from PyPI [7]:
1.0.0: 2020-01
1.1.0: 2020-07
1.2.0: 2020-12
1.3.0: 2021-07

[1]
https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py
[3]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/pandas_doctests_test.py
[4] https://issues.apache.org/jira/browse/BEAM-12907
[5] https://issues.apache.org/jira/browse/BEAM-12945
[6]
https://docs.google.com/spreadsheets/d/1V-CfygItUmJaJfPbu-kVfPo1e822quYt52ewfOhbEBM/edit?usp=sharing&resourcekey=0-TQtWEo9vy7hElMOPa39-ag
[7] https://pypi.org/project/pandas/#history

Re: [PROPOSAL] Drop official support for pandas 1.0.x in the DataFrame API

Posted by Robert Bradshaw <ro...@google.com>.
On Thu, Sep 23, 2021 at 11:07 AM Brian Hulette <bh...@google.com> wrote:
>
> It's not that hard just to work around the implementation issues, we could easily conditionally add those missing operations. What gets messier is punching holes in our verification to work around problems, e.g. [1]. Unfortunately supporting 1.0.x is particularly bad as there are a lot of bad doctests that we'll need to skip.
>
> What we might do is at least fix the import issue with pandas 1.0, but then not explicitly verify it continuously. We could claim pandas versions outside of the verified range are best effort.

+1.

> [1] https://github.com/apache/beam/blob/9431cf5ff21d85d73a0acf69efa52c75364bcc97/sdks/python/apache_beam/dataframe/frames_test.py#L708
>
> On Thu, Sep 23, 2021 at 10:54 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> How hard would it be to have degraded support for old pandas versions?
>> E.g. for the issue mentioned, just have those operations missing if a
>> recent enough pandas is not installed?
>>
>> On Thu, Sep 23, 2021 at 10:25 AM Brian Hulette <bh...@google.com> wrote:
>> >
>> > Hi all,
>> > I probably haven't discussed this topic enough on dev@, so I have a lot of background to provide here:
>> >
>> > My goal with the DataFrame API has been to provide an API that is compatible with the version of pandas being used (both at pipeline construction time, and on worker nodes), subject to the documented "Differences from pandas" [1]. This is what we verify in frames_test.py [2] and pandas_doctests_test.py [3].
>> >
>> > We've already decided not to support pandas<1.0. 1.0 was a major milestone that made several breaking changes, so it's quite challenging to maintain support on both sides of that boundary. However, even maintaining support across multiple minor versions (1.0, 1.1, ...) is challenging for our unique usage, since these versions add new features that must be conditionally supported, fix bugs that we previously worked around, etc. So to make sure we maintain support, I've been looking at continuously verifying the DataFrame API with multiple minor pandas versions [4].
>> >
>> >
>> > So on to the issue at hand: While working on [4], I discovered that we already have quite a few issues with pandas 1.0.x [5]. I looked into working around the issues, but it requires quite a bit of logic that's difficult to reason about and maintain. So I'm wondering if we should:
>> > - officially drop support for pandas 1.0.x in the DataFrame API, and
>> > - going forward have a policy to maintain support for the latest 3 minor versions of pandas in the DataFrame API
>> >
>> > This will reduce our maintenance burden, as well as the cost of verifying supported versions. I collected some data about the pandas minor versions to help inform this decision, detailed below. Note that pandas minor versions are released ~every 6 months, and 1.0.x represents about 8% of pandas usage (according to download data from the last 30 days).
>> >
>> > If anyone objects to either of the above proposals, please let me know.
>> >
>> > Brian
>> >
>> >
>> > pandas downloads by minor version, over the last 30 days [6]:
>> > 1.3.x: 16.88%
>> > 1.2.x: 9.73%
>> > 1.1.x: 17.64%
>> > 1.0.x: 8.76%
>> > 0.25.x: 14.36%
>> >
>> > >=1.0: 53.01%
>> > >=1.1: 44.25%
>> >
>> > pandas minor version release dates from PyPI [7]:
>> > 1.0.0: 2020-01
>> > 1.1.0: 2020-07
>> > 1.2.0: 2020-12
>> > 1.3.0: 2021-07
>> >
>> > [1] https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/
>> > [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py
>> > [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/pandas_doctests_test.py
>> > [4] https://issues.apache.org/jira/browse/BEAM-12907
>> > [5] https://issues.apache.org/jira/browse/BEAM-12945
>> > [6] https://docs.google.com/spreadsheets/d/1V-CfygItUmJaJfPbu-kVfPo1e822quYt52ewfOhbEBM/edit?usp=sharing&resourcekey=0-TQtWEo9vy7hElMOPa39-ag
>> > [7] https://pypi.org/project/pandas/#history

Re: [PROPOSAL] Drop official support for pandas 1.0.x in the DataFrame API

Posted by Brian Hulette <bh...@google.com>.
It's not that hard just to work around the implementation issues, we could
easily conditionally add those missing operations. What gets messier is
punching holes in our verification to work around problems, e.g. [1].
Unfortunately supporting 1.0.x is particularly bad as there are a lot of
bad doctests that we'll need to skip.

What we might do is at least fix the import issue with pandas 1.0, but then
not explicitly verify it continuously. We could claim pandas versions
outside of the verified range are best effort.

Brian

[1]
https://github.com/apache/beam/blob/9431cf5ff21d85d73a0acf69efa52c75364bcc97/sdks/python/apache_beam/dataframe/frames_test.py#L708

On Thu, Sep 23, 2021 at 10:54 AM Robert Bradshaw <ro...@google.com>
wrote:

> How hard would it be to have degraded support for old pandas versions?
> E.g. for the issue mentioned, just have those operations missing if a
> recent enough pandas is not installed?
>
> On Thu, Sep 23, 2021 at 10:25 AM Brian Hulette <bh...@google.com>
> wrote:
> >
> > Hi all,
> > I probably haven't discussed this topic enough on dev@, so I have a lot
> of background to provide here:
> >
> > My goal with the DataFrame API has been to provide an API that is
> compatible with the version of pandas being used (both at pipeline
> construction time, and on worker nodes), subject to the documented
> "Differences from pandas" [1]. This is what we verify in frames_test.py [2]
> and pandas_doctests_test.py [3].
> >
> > We've already decided not to support pandas<1.0. 1.0 was a major
> milestone that made several breaking changes, so it's quite challenging to
> maintain support on both sides of that boundary. However, even maintaining
> support across multiple minor versions (1.0, 1.1, ...) is challenging for
> our unique usage, since these versions add new features that must be
> conditionally supported, fix bugs that we previously worked around, etc. So
> to make sure we maintain support, I've been looking at continuously
> verifying the DataFrame API with multiple minor pandas versions [4].
> >
> >
> > So on to the issue at hand: While working on [4], I discovered that we
> already have quite a few issues with pandas 1.0.x [5]. I looked into
> working around the issues, but it requires quite a bit of logic that's
> difficult to reason about and maintain. So I'm wondering if we should:
> > - officially drop support for pandas 1.0.x in the DataFrame API, and
> > - going forward have a policy to maintain support for the latest 3 minor
> versions of pandas in the DataFrame API
> >
> > This will reduce our maintenance burden, as well as the cost of
> verifying supported versions. I collected some data about the pandas minor
> versions to help inform this decision, detailed below. Note that pandas
> minor versions are released ~every 6 months, and 1.0.x represents about 8%
> of pandas usage (according to download data from the last 30 days).
> >
> > If anyone objects to either of the above proposals, please let me know.
> >
> > Brian
> >
> >
> > pandas downloads by minor version, over the last 30 days [6]:
> > 1.3.x: 16.88%
> > 1.2.x: 9.73%
> > 1.1.x: 17.64%
> > 1.0.x: 8.76%
> > 0.25.x: 14.36%
> >
> > >=1.0: 53.01%
> > >=1.1: 44.25%
> >
> > pandas minor version release dates from PyPI [7]:
> > 1.0.0: 2020-01
> > 1.1.0: 2020-07
> > 1.2.0: 2020-12
> > 1.3.0: 2021-07
> >
> > [1]
> https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/
> > [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py
> > [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/pandas_doctests_test.py
> > [4] https://issues.apache.org/jira/browse/BEAM-12907
> > [5] https://issues.apache.org/jira/browse/BEAM-12945
> > [6]
> https://docs.google.com/spreadsheets/d/1V-CfygItUmJaJfPbu-kVfPo1e822quYt52ewfOhbEBM/edit?usp=sharing&resourcekey=0-TQtWEo9vy7hElMOPa39-ag
> > [7] https://pypi.org/project/pandas/#history
>

Re: [PROPOSAL] Drop official support for pandas 1.0.x in the DataFrame API

Posted by Robert Bradshaw <ro...@google.com>.
How hard would it be to have degraded support for old pandas versions?
E.g. for the issue mentioned, just have those operations missing if a
recent enough pandas is not installed?

On Thu, Sep 23, 2021 at 10:25 AM Brian Hulette <bh...@google.com> wrote:
>
> Hi all,
> I probably haven't discussed this topic enough on dev@, so I have a lot of background to provide here:
>
> My goal with the DataFrame API has been to provide an API that is compatible with the version of pandas being used (both at pipeline construction time, and on worker nodes), subject to the documented "Differences from pandas" [1]. This is what we verify in frames_test.py [2] and pandas_doctests_test.py [3].
>
> We've already decided not to support pandas<1.0. 1.0 was a major milestone that made several breaking changes, so it's quite challenging to maintain support on both sides of that boundary. However, even maintaining support across multiple minor versions (1.0, 1.1, ...) is challenging for our unique usage, since these versions add new features that must be conditionally supported, fix bugs that we previously worked around, etc. So to make sure we maintain support, I've been looking at continuously verifying the DataFrame API with multiple minor pandas versions [4].
>
>
> So on to the issue at hand: While working on [4], I discovered that we already have quite a few issues with pandas 1.0.x [5]. I looked into working around the issues, but it requires quite a bit of logic that's difficult to reason about and maintain. So I'm wondering if we should:
> - officially drop support for pandas 1.0.x in the DataFrame API, and
> - going forward have a policy to maintain support for the latest 3 minor versions of pandas in the DataFrame API
>
> This will reduce our maintenance burden, as well as the cost of verifying supported versions. I collected some data about the pandas minor versions to help inform this decision, detailed below. Note that pandas minor versions are released ~every 6 months, and 1.0.x represents about 8% of pandas usage (according to download data from the last 30 days).
>
> If anyone objects to either of the above proposals, please let me know.
>
> Brian
>
>
> pandas downloads by minor version, over the last 30 days [6]:
> 1.3.x: 16.88%
> 1.2.x: 9.73%
> 1.1.x: 17.64%
> 1.0.x: 8.76%
> 0.25.x: 14.36%
>
> >=1.0: 53.01%
> >=1.1: 44.25%
>
> pandas minor version release dates from PyPI [7]:
> 1.0.0: 2020-01
> 1.1.0: 2020-07
> 1.2.0: 2020-12
> 1.3.0: 2021-07
>
> [1] https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/
> [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py
> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/pandas_doctests_test.py
> [4] https://issues.apache.org/jira/browse/BEAM-12907
> [5] https://issues.apache.org/jira/browse/BEAM-12945
> [6] https://docs.google.com/spreadsheets/d/1V-CfygItUmJaJfPbu-kVfPo1e822quYt52ewfOhbEBM/edit?usp=sharing&resourcekey=0-TQtWEo9vy7hElMOPa39-ag
> [7] https://pypi.org/project/pandas/#history