You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Andrey Anshin <an...@taragol.is> on 2023/10/03 22:03:10 UTC

Re: [DISCUSS] Future of Pendulum in Airflow

Good news, new pre-release of pendulum available:
https://pypi.org/project/pendulum/3.0.0b1/ ,
https://github.com/sdispater/pendulum/releases/tag/3.0.0b1

1. I checked and there is no memory leak on the unsuccessful parse.
2. Also check some timezone which previously doesn't work well, and as
expected it work now fine

I do not check on Airflow codebase yet



----
Best Wishes
*Andrey Anshin*



On Thu, 28 Sept 2023 at 18:03, Bolke de Bruin <bd...@gmail.com> wrote:

> FYI:
>
> I've just added:
>
> https://github.com/apache/airflow/pull/34667
>
> which documents how to use newer timezone information with Pendulum.
>
> Also work seems to be progressing (albeit slowly) on Pendulum 3:
>
> https://github.com/sdispater/pendulum/issues/600#issuecomment-1711299677
>
> Bolke
>
>
>
> On Thu, 28 Sept 2023 at 15:12, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > for serialization I am not too worried about ZoneInfo. We do not use
> > pickling by default as we roll our own serialization format. We probably
> > just need the key (zoneinfo.key).
> >
> > I'm not sure what happened about this:
> >
> > https://github.com/sdispater/pendulum/issues/590
> >
> > Bolke
> >
> > On Thu, 28 Sept 2023 at 14:59, Andrey Anshin <an...@taragol.is>
> > wrote:
> >
> >> I agree with all problems that you mention about datetime tz-aware data.
> >> I lived for almost 30 years in a country which had in different periods
> of
> >> time up to 10 time zones, and on a regular basis changed it
> >> (merge/unmerge)
> >> , disable DST, temporarily enable DST. In addition I also worked in a
> >> different bank for about 10 years (legacy systems which don't update
> >> tzdata
> >> for ages) . I think I had most of the bad cases with time zones. And I
> >> think everyone somehow has a problem with different time zones:
> Calendars
> >> +
> >> events, flight booking systems which don't know about timezones and you
> >> might find that your connecting flight flew away an hour ago, etc.
> >>
> >> In addition the error might happen in different places, databases (not
> >> updated tzdata, or db doesn't work correctly), client libraries, OS,
> etc.
> >> The person who finally solves tz-aware data should be granted all awards
> >> in
> >> the World.
> >>
> >> > For example, we got recently bitten by datetime.tzname() (which is
> >> supposed
> >> to 'time zone name') returning short-hand notation timezones (e.g. PST)
> >> > instead of full timezone names (e.g. "Europe/Amsterdam") which makes
> >> deserialization non deterministic.
> >>
> >> Yeah, and even ZoneInfo doesn't solve the problem with `datetime.tzname`
> >> because final implementation depends on different factors, tzinfo
> >> implementation and internals of datetime.
> >>
> >> > moving to zoneinfo seems to make sense though and will also be in
> >> Pendulum 3
> >>
> >> I've have a look couple days ago about zoneinfo, it also have some
> >> "pitfalls", e.g. if timezone created from file it can't be easily
> >> serialized
> >> https://docs.python.org/3.9/library/zoneinfo.html#the-zoneinfo-class
> >>
> >> > Pendulum has proven us in the past, maybe we indeed should help the
> >> project if possible and if that isn't possible verify formal correctness
> >> of
> >> any other library
> >>
> >> I guess all other libraries might have a different kind of issue
> including
> >> compatibility with databases.
> >> More close replacement it is dateutil, but it also maintained by one
> >> person
> >> last release was 2 years ago and contains quite a few issues with
> >> timezones/DTS (no blame, that is just a fact)
> >>
> >>
> >> On Thu, 28 Sept 2023 at 15:39, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>
> >> > Thanks for starting the discussion Andrey.
> >> >
> >> > Some background on the choice for Pendulum at the time. In the early
> >> days
> >> > of Airflow it wasn't timezone aware. Originating from Airbnb which
> had a
> >> > reasonable mature data organization the view was everything needs to
> be
> >> in
> >> > UTC. According to Maxime the engineers would dream in UTC ;-).
> However,
> >> in
> >> > the real world which also needs to deal with legacy that didn't hold.
> >> Often
> >> > systems of record did not store timezone information but were
> localized
> >> > nevertheless. Cutoff times in banks happen in localized time and if
> you
> >> > want to meet those, Airflow needed to do better.
> >> >
> >> > Doing timezones and being timezone aware proved to be exceptionally
> >> hard.
> >> > Many libraries get it wrong [1] and fail silently (i.e. Arrow) or
> apply
> >> DST
> >> > transitions wrongly (pytz). When dealing with payments that stuff
> cannot
> >> > happen. To make things worse, in Python timezone support is pretty
> >> > convoluted, while some standardization happened in 3.9 by using IANA
> >> > provided timezone information from the local system, its API is messy.
> >> For
> >> > example, we got recently bitten by datetime.tzname() (which is
> >> > supposed to 'time
> >> > zone name') returning short-hand notation timezones (e.g. PST) instead
> >> of
> >> > full timezone names (e.g. "Europe/Amsterdam") which makes
> >> deserialization
> >> > non deterministic.
> >> >
> >> > So, what I am trying to say, is tread carefully when doing changes as
> >> > proposed in [2] (moving to zoneinfo seems to make sense though and
> will
> >> > also be in Pendulum 3). Make sure those changes are formally correct
> and
> >> > don't assume because they are now part of python itself (pytz was the
> >> > defacto standard for a long time). Pendulum has proven us in the past,
> >> > maybe we indeed should help the project if possible and if that isn't
> >> > possible verify formal correctness of any other library.
> >> >
> >> > Bolke
> >> >
> >> > [1] https://pendulum.eustace.io/faq/
> >> > [2] https://github.com/apache/airflow/issues/19450
> >> >
> >> > On Thu, 28 Sept 2023 at 11:03, Andrey Anshin <
> andrey.anshin@taragol.is>
> >> > wrote:
> >> >
> >> > > This discussion is more about the known problem of pendulum and how
> we
> >> > > could deal with it and maybe how we (as Community) might help autor.
> >> > >
> >> > > The library is mostly supported by a single author Sébastien
> Eustace (
> >> > > https://github.com/sdispater) and it seems like we bump into the
> >> > situation
> >> > > which is described in xkcd #2347 (
> >> > > https://imgs.xkcd.com/comics/dependency.png). To be honest it is
> not
> >> > > something new when library mainly supported by one author so there
> is
> >> > > always a risk that the library will no longer be supported /
> abandoned
> >> > > And if takes in account that pendulum provides core functionality in
> >> > > Airflow it could have dramatical impact in the future.
> >> > >
> >> > > Pendulum is a really nice library which helps a lot of developers to
> >> work
> >> > > with dates/datetimes. However there is one major problem, the last
> >> > release
> >> > > of this library happened more than 3 years ago (
> >> > > https://pypi.org/project/pendulum/#history) in the time when
> Airflow
> >> > > 1.10.11 was released
> >> > >
> >> > > Fortunately, the project is not abandoned and on a regular basis
> >> commits
> >> > > add into the master branch. However these commits are not included
> >> into
> >> > any
> >> > > final release and that's why some things related to datetime don't
> >> work
> >> > as
> >> > > expected in Airflow. There are list of known (for me) issues which
> are
> >> > > affect Airflow
> >> > >
> >> > > *Memory Leak on parse*:
> >> > > - https://github.com/sdispater/pendulum/issues/720, this one
> fixed  2
> >> > > years
> >> > > ago but not available yet (
> >> > https://github.com/sdispater/pendulum/pull/563
> >> > > ).
> >> > > Since we use parse dates in airflow codebase: datetime parameters
> and
> >> > > datetime in logs this one could be a reason for memory leakage in
> >> > Airflow:
> >> > > - https://github.com/apache/airflow/discussions/24694
> >> > > - https://github.com/apache/airflow/discussions/28597
> >> > >
> >> > > *Incorrect time zones*, known issues and should be already fixed in
> >> > master
> >> > > branch
> >> > > - https://github.com/sdispater/pendulum/issues/700, Mexico do not
> use
> >> > DST
> >> > > anymore
> >> > > - https://github.com/sdispater/pendulum/issues/706, Egypt reinstate
> >> DST
> >> > >
> >> > > We add clarification in
> https://github.com/apache/airflow/pull/30467,
> >> > > however it seems like there is no other way rather than patching
> >> Pendulum
> >> > > right now.
> >> > >
> >> > > All these issues should be solved as soon as pendulum 3 is released.
> >> The
> >> > > current announced estimation is end of september/ beginning of
> >> October:
> >> > >
> >>
> https://github.com/sdispater/pendulum/issues/600#issuecomment-1711299677
> >> > >
> >> > > So in theory we would have a fixed version of pendulum soon, and it
> >> might
> >> > > break something in Airflow but from my point of view it is better
> than
> >> > > current status.
> >> > >
> >> > > However there might be a situation where the release of the pendulum
> >> > would
> >> > > be postponed, so maybe better to have a backup plan. What could we
> do
> >> in
> >> > > this case?
> >> > >
> >> > > Maybe we should start to use zoneinfo.ZoneInfo instead of pendulum
> >> > > datetime? https://github.com/apache/airflow/issues/19450
> >> > > Pros:
> >> > > - stdlib (python 3.9+)
> >> > > - In pendulum 3.0 Timezone based on zoneinfo.Zoneinfo
> >> > >
> >> > > Cons:
> >> > > - Current serialization model can't deal with backport packages.
> E.g.
> >> > > timezone which are serialized in backport_zoneinfo can't be
> >> deserialized
> >> > in
> >> > > zoneinfo
> >> > >
> >> > > Maybe we should replace parse datetime with another solution. Does
> >> anyone
> >> > > know a good replacement?
> >> > >
> >> > > Maybe someone from Airflow Community could propose their help with
> >> > > maintenance of library:
> >> > > - https://github.com/sdispater/pendulum/issues/590
> >> > >
> >> > > Maybe we should get rid of the pendulum at all, as a last resort
> >> > solution.
> >> > > I can't imagine how we could do that, because a lot of stuff depends
> >> on
> >> > the
> >> > > pendulum and removing it would be a breaking change.
> >> > >
> >> > > ----
> >> > > Best Wishes
> >> > > *Andrey Anshin*
> >> > >
> >> >
> >> >
> >> > --
> >> >
> >> > --
> >> > Bolke de Bruin
> >> > bdbruin@gmail.com
> >> >
> >>
> >
> >
> > --
> >
> > --
> > Bolke de Bruin
> > bdbruin@gmail.com
> >
>
>
> --
>
> --
> Bolke de Bruin
> bdbruin@gmail.com
>