You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Tzu-Li (Gordon) Tai" <tz...@apache.org> on 2020/11/02 10:43:02 UTC

[DISCUSS] Releasing StateFun hotfix version 2.2.1

Hi,

We’re currently thinking about releasing StateFun 2.2.1, to address a
critical bug that causes restores from checkpoints / savepoints to fail
under certain circumstances [1].

To provide a bit more context, the full fix for this issue is two-fold:

   1. *Fix restoring from checkpoints / savepoints taken with the same
   StateFun version:* this has already been fixed in StateFun, with changes
   backported to `flink-statefun/release-2.2`.
   2. *Allow restoring from older savepoints taken with StateFun <= 2.2.0:*
   this requires a few fixes to Flink around restoring heap-based timers [2]
   and iterating through key groups in restored raw keyed state streams [3].
   These fixes will be included in Flink 1.11.3 [4], meaning that to fix this,
   StateFun will need to wait until Flink 1.11.3 is out and upgrade its Flink
   dependency.

The main discussion point here is whether or not it makes sense for
StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems
1) and 2) can be solved together in a single hotfix release.

The other option is to release StateFun 2.2.1 already with fixes for
problem 1) only, and have another follow-up hotfix release 2.2.2 after
Flink 1.11.3 is available.

I propose to keep a close eye on the progress of Flink 1.11.3 (you can
track progress on the 1.11.3 discussion thread [4]), and *make a decision
here mid-week on Wednesday, Nov. 4th*.
If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
because it could take a while, we can start with a StateFun 2.2.1 RC right
away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can
wait for a few more days.

What do you think?

Cheers,
Gordon

[1] https://issues.apache.org/jira/browse/FLINK-19692
[2] https://github.com/apache/flink/pull/13761
[3] https://github.com/apache/flink/pull/13772
[4]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html

Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.
Thanks everyone for the feedback.

I've just updated the status of Flink 1.11.3 earlier, in its corresponding
discussion thread [1].

From the looks of it, it seems like it makes sense to proceed with StateFun
2.2.1 without waiting for Flink 1.11.3.
Since this is also the consensus we've reached here, I have proceeded to
create RC1 for StateFun 2.2.1 [2].

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-StateFun-hotfix-version-2-2-1-td46239.html

On Tue, Nov 3, 2020 at 10:42 PM Robert Metzger <rm...@apache.org> wrote:

> Hi Gordon,
> thanks a lot for this clarification.
>
> In this case I would vote for releasing StateFun 2.2.1 asap and not wait
> for 1.11.3.
>
> Thanks a lot for your efforts!
>
>
> On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <tz...@apache.org>
> wrote:
>
>> Hi Robert,
>>
>> So far we've only seen a single user report the issue, but the severity
>> of FLINK-19692 is actually pretty huge.
>> TL;DR: If a checkpoint / savepoint that contains feedback events (which
>> is considered normal under typical StateFun operations) is attempted to be
>> restored from, the restore would always fail.
>>
>> That's why we came up with the discussion to potentially release a
>> "partial" solution with StateFun 2.2.1 already so that at least there is a
>> StateFun release available that works properly with failure recoveries,
>> and then after that release another follow-up StateFun hotfix release
>> 2.2.2, which would include Flink 1.11.3, to address the remaining part of
>> the problem.
>>
>> BR,
>> Gordon
>>
>> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Thanks a lot for starting this thread.
>>> How many users are affected by the problem? Is it somebody else besides
>>> the initial issue reporter?
>>> If it is just one person, I would suggest to rather help pushing the
>>> 1.11.3 release over the line or work on more StateFun features ;)
>>>
>>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <ig...@ververica.com> wrote:
>>>
>>>> Hi Gordon,
>>>> Thanks for driving this discussion!
>>>>
>>>> I would go with the second suggestion - having two consecutive StateFun
>>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>>>> might take a while, and this hot-fix release is important enough to get
>>>> out
>>>> as early as possible.
>>>>
>>>> Cheers,
>>>> Igal.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <
>>>> tzulitai@apache.org>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>>>> > critical bug that causes restores from checkpoints / savepoints to
>>>> fail
>>>> > under certain circumstances [1].
>>>> >
>>>> > To provide a bit more context, the full fix for this issue is
>>>> two-fold:
>>>> >
>>>> >    1. *Fix restoring from checkpoints / savepoints taken with the same
>>>> >    StateFun version:* this has already been fixed in StateFun, with
>>>> >    changes backported to `flink-statefun/release-2.2`.
>>>> >    2. *Allow restoring from older savepoints taken with StateFun <=
>>>> >    2.2.0:* this requires a few fixes to Flink around restoring
>>>> heap-based
>>>> >    timers [2] and iterating through key groups in restored raw keyed
>>>> state
>>>> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
>>>> meaning that
>>>> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out
>>>> and
>>>> >    upgrade its Flink dependency.
>>>> >
>>>> > The main discussion point here is whether or not it makes sense for
>>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>>>> problems
>>>> > 1) and 2) can be solved together in a single hotfix release.
>>>> >
>>>> > The other option is to release StateFun 2.2.1 already with fixes for
>>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>>>> > Flink 1.11.3 is available.
>>>> >
>>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>>>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>>>> decision
>>>> > here mid-week on Wednesday, Nov. 4th*.
>>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>>>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>>>> right
>>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner,
>>>> we can
>>>> > wait for a few more days.
>>>> >
>>>> > What do you think?
>>>> >
>>>> > Cheers,
>>>> > Gordon
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>>>> > [2] https://github.com/apache/flink/pull/13761
>>>> > [3] https://github.com/apache/flink/pull/13772
>>>> > [4]
>>>> >
>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>>>> >
>>>>
>>>

Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

Posted by Robert Metzger <rm...@apache.org>.
Hi Gordon,
thanks a lot for this clarification.

In this case I would vote for releasing StateFun 2.2.1 asap and not wait
for 1.11.3.

Thanks a lot for your efforts!


On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi Robert,
>
> So far we've only seen a single user report the issue, but the severity of
> FLINK-19692 is actually pretty huge.
> TL;DR: If a checkpoint / savepoint that contains feedback events (which is
> considered normal under typical StateFun operations) is attempted to be
> restored from, the restore would always fail.
>
> That's why we came up with the discussion to potentially release a
> "partial" solution with StateFun 2.2.1 already so that at least there is a
> StateFun release available that works properly with failure recoveries,
> and then after that release another follow-up StateFun hotfix release
> 2.2.2, which would include Flink 1.11.3, to address the remaining part of
> the problem.
>
> BR,
> Gordon
>
> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rm...@apache.org> wrote:
>
>> Thanks a lot for starting this thread.
>> How many users are affected by the problem? Is it somebody else besides
>> the initial issue reporter?
>> If it is just one person, I would suggest to rather help pushing the
>> 1.11.3 release over the line or work on more StateFun features ;)
>>
>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <ig...@ververica.com> wrote:
>>
>>> Hi Gordon,
>>> Thanks for driving this discussion!
>>>
>>> I would go with the second suggestion - having two consecutive StateFun
>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>>> might take a while, and this hot-fix release is important enough to get
>>> out
>>> as early as possible.
>>>
>>> Cheers,
>>> Igal.
>>>
>>>
>>>
>>>
>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tzulitai@apache.org
>>> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>>> > critical bug that causes restores from checkpoints / savepoints to fail
>>> > under certain circumstances [1].
>>> >
>>> > To provide a bit more context, the full fix for this issue is two-fold:
>>> >
>>> >    1. *Fix restoring from checkpoints / savepoints taken with the same
>>> >    StateFun version:* this has already been fixed in StateFun, with
>>> >    changes backported to `flink-statefun/release-2.2`.
>>> >    2. *Allow restoring from older savepoints taken with StateFun <=
>>> >    2.2.0:* this requires a few fixes to Flink around restoring
>>> heap-based
>>> >    timers [2] and iterating through key groups in restored raw keyed
>>> state
>>> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
>>> meaning that
>>> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out
>>> and
>>> >    upgrade its Flink dependency.
>>> >
>>> > The main discussion point here is whether or not it makes sense for
>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>>> problems
>>> > 1) and 2) can be solved together in a single hotfix release.
>>> >
>>> > The other option is to release StateFun 2.2.1 already with fixes for
>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>>> > Flink 1.11.3 is available.
>>> >
>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>>> decision
>>> > here mid-week on Wednesday, Nov. 4th*.
>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>>> right
>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner,
>>> we can
>>> > wait for a few more days.
>>> >
>>> > What do you think?
>>> >
>>> > Cheers,
>>> > Gordon
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>>> > [2] https://github.com/apache/flink/pull/13761
>>> > [3] https://github.com/apache/flink/pull/13772
>>> > [4]
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>>> >
>>>
>>

Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.
Hi Robert,

So far we've only seen a single user report the issue, but the severity of
FLINK-19692 is actually pretty huge.
TL;DR: If a checkpoint / savepoint that contains feedback events (which is
considered normal under typical StateFun operations) is attempted to be
restored from, the restore would always fail.

That's why we came up with the discussion to potentially release a
"partial" solution with StateFun 2.2.1 already so that at least there is a
StateFun release available that works properly with failure recoveries,
and then after that release another follow-up StateFun hotfix release
2.2.2, which would include Flink 1.11.3, to address the remaining part of
the problem.

BR,
Gordon

On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rm...@apache.org> wrote:

> Thanks a lot for starting this thread.
> How many users are affected by the problem? Is it somebody else besides
> the initial issue reporter?
> If it is just one person, I would suggest to rather help pushing the
> 1.11.3 release over the line or work on more StateFun features ;)
>
> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <ig...@ververica.com> wrote:
>
>> Hi Gordon,
>> Thanks for driving this discussion!
>>
>> I would go with the second suggestion - having two consecutive StateFun
>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>> might take a while, and this hot-fix release is important enough to get
>> out
>> as early as possible.
>>
>> Cheers,
>> Igal.
>>
>>
>>
>>
>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>> > critical bug that causes restores from checkpoints / savepoints to fail
>> > under certain circumstances [1].
>> >
>> > To provide a bit more context, the full fix for this issue is two-fold:
>> >
>> >    1. *Fix restoring from checkpoints / savepoints taken with the same
>> >    StateFun version:* this has already been fixed in StateFun, with
>> >    changes backported to `flink-statefun/release-2.2`.
>> >    2. *Allow restoring from older savepoints taken with StateFun <=
>> >    2.2.0:* this requires a few fixes to Flink around restoring
>> heap-based
>> >    timers [2] and iterating through key groups in restored raw keyed
>> state
>> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
>> meaning that
>> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out and
>> >    upgrade its Flink dependency.
>> >
>> > The main discussion point here is whether or not it makes sense for
>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>> problems
>> > 1) and 2) can be solved together in a single hotfix release.
>> >
>> > The other option is to release StateFun 2.2.1 already with fixes for
>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>> > Flink 1.11.3 is available.
>> >
>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>> decision
>> > here mid-week on Wednesday, Nov. 4th*.
>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>> right
>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we
>> can
>> > wait for a few more days.
>> >
>> > What do you think?
>> >
>> > Cheers,
>> > Gordon
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>> > [2] https://github.com/apache/flink/pull/13761
>> > [3] https://github.com/apache/flink/pull/13772
>> > [4]
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>> >
>>
>

Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

Posted by Robert Metzger <rm...@apache.org>.
Thanks a lot for starting this thread.
How many users are affected by the problem? Is it somebody else besides the
initial issue reporter?
If it is just one person, I would suggest to rather help pushing the 1.11.3
release over the line or work on more StateFun features ;)

On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <ig...@ververica.com> wrote:

> Hi Gordon,
> Thanks for driving this discussion!
>
> I would go with the second suggestion - having two consecutive StateFun
> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
> might take a while, and this hot-fix release is important enough to get out
> as early as possible.
>
> Cheers,
> Igal.
>
>
>
>
> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
> wrote:
>
> > Hi,
> >
> > We’re currently thinking about releasing StateFun 2.2.1, to address a
> > critical bug that causes restores from checkpoints / savepoints to fail
> > under certain circumstances [1].
> >
> > To provide a bit more context, the full fix for this issue is two-fold:
> >
> >    1. *Fix restoring from checkpoints / savepoints taken with the same
> >    StateFun version:* this has already been fixed in StateFun, with
> >    changes backported to `flink-statefun/release-2.2`.
> >    2. *Allow restoring from older savepoints taken with StateFun <=
> >    2.2.0:* this requires a few fixes to Flink around restoring heap-based
> >    timers [2] and iterating through key groups in restored raw keyed
> state
> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
> meaning that
> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out and
> >    upgrade its Flink dependency.
> >
> > The main discussion point here is whether or not it makes sense for
> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
> problems
> > 1) and 2) can be solved together in a single hotfix release.
> >
> > The other option is to release StateFun 2.2.1 already with fixes for
> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
> > Flink 1.11.3 is available.
> >
> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
> > track progress on the 1.11.3 discussion thread [4]), and *make a decision
> > here mid-week on Wednesday, Nov. 4th*.
> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
> > because it could take a while, we can start with a StateFun 2.2.1 RC
> right
> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we
> can
> > wait for a few more days.
> >
> > What do you think?
> >
> > Cheers,
> > Gordon
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-19692
> > [2] https://github.com/apache/flink/pull/13761
> > [3] https://github.com/apache/flink/pull/13772
> > [4]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
> >
>

Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

Posted by Igal Shilman <ig...@ververica.com>.
Hi Gordon,
Thanks for driving this discussion!

I would go with the second suggestion - having two consecutive StateFun
releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
might take a while, and this hot-fix release is important enough to get out
as early as possible.

Cheers,
Igal.




On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi,
>
> We’re currently thinking about releasing StateFun 2.2.1, to address a
> critical bug that causes restores from checkpoints / savepoints to fail
> under certain circumstances [1].
>
> To provide a bit more context, the full fix for this issue is two-fold:
>
>    1. *Fix restoring from checkpoints / savepoints taken with the same
>    StateFun version:* this has already been fixed in StateFun, with
>    changes backported to `flink-statefun/release-2.2`.
>    2. *Allow restoring from older savepoints taken with StateFun <=
>    2.2.0:* this requires a few fixes to Flink around restoring heap-based
>    timers [2] and iterating through key groups in restored raw keyed state
>    streams [3]. These fixes will be included in Flink 1.11.3 [4], meaning that
>    to fix this, StateFun will need to wait until Flink 1.11.3 is out and
>    upgrade its Flink dependency.
>
> The main discussion point here is whether or not it makes sense for
> StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems
> 1) and 2) can be solved together in a single hotfix release.
>
> The other option is to release StateFun 2.2.1 already with fixes for
> problem 1) only, and have another follow-up hotfix release 2.2.2 after
> Flink 1.11.3 is available.
>
> I propose to keep a close eye on the progress of Flink 1.11.3 (you can
> track progress on the 1.11.3 discussion thread [4]), and *make a decision
> here mid-week on Wednesday, Nov. 4th*.
> If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
> because it could take a while, we can start with a StateFun 2.2.1 RC right
> away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can
> wait for a few more days.
>
> What do you think?
>
> Cheers,
> Gordon
>
> [1] https://issues.apache.org/jira/browse/FLINK-19692
> [2] https://github.com/apache/flink/pull/13761
> [3] https://github.com/apache/flink/pull/13772
> [4]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>