You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Chesnay Schepler <ch...@apache.org> on 2023/06/01 15:25:37 UTC

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

The version in the state is the serializer version, and applies to the 
entire state, independent of what it contains.
If you use Kryo2 for reading and Kryo5 for writing (which also implies 
writing the new serializer version into state), then I'd assume that a 
migration is an all-or-nothing kind of deal.
IOW, you'd have to load a savepoint and write out an entirely new 
savepoint with the new state.
Otherwise you may have only re-written part of the checkpoint, and now 
contains a mix of Kryo2/Kryo5 serialized classes, which should then fail 
_hard_ on any recovery attempt because we wouldn't use Kryo2 to read 
anything.

If I'm right, then as is this sounds like quite a trap for users to fall 
into because from what I gathered this is the default behavior in the PR 
(I could be wrong though since I haven't fully dug through the 8k lines 
PR yet...)

What we kind of want is this:
1) Kryo5 is used as the default for new jobs. (maybe not even that, 
making it an explicit opt-in)
2) Kryo2 is used for reading AND writing for existing* jobs by default.
3) Users can explicitly (and easily!) do a full migration of their jobs, 
after which 2) should no longer apply.



In the PR you mentioned running into issues on Java 17; to have have 
some error stacktraces and examples data/serializers still around?

On 30/05/2023 00:38, Kurt Ostfeld wrote:
>> I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
> I hope I've solved this. The pull request is supposed to do exactly this. Please let me know if you can propose a scenario that would break this.
>
> The pull-request has both Kryo 2.x and 5.x dependencies. It looks at the state version number written to the state to determine which version of Kryo to use for deserialization. Kryo 2.x is not used to write new state.
>
> ------- Original Message -------
> On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler <kk...@transpac.com> wrote:
>
>
>>
>> Hi Kurt,
>>
>> I personally think it’s a very nice improvement, and that the longer-term goal of removing built-in Kryo support for state serialization (while a good one) warrants a separate FLIP.
>>
>> Perhaps an intermediate approach would be to disable the use of Kryo for state serialization by default, and force a user to disregard warnings and explicitly enable it if they want to go down that path.
>>
>> I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
>>
>> — Ken
>>
>>
>>> On May 29, 2023, at 2:21 PM, Kurt Ostfeld kurtostfeld@proton.me.INVALID wrote:
>>>
>>> Hi everyone. I would like to start the discussion thread for FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0 [1].
>>>
>>> There is a pull-request associated with this linked in the FLIP.
>>>
>>> I'd particularly like to hear about:
>>>
>>> - Chesnay Schepler's request to consider removing Kryo serializers from the execution config. Is this a reasonable task to add into this FLIP? Is there consensus on how to resolve that? Would that be better addressed in a separate future FLIP after the Kryo upgrade FLIP is completed?
>>>
>>> - Backwards compatibility. The automated CI tests have a lot of backwards compatibility tests that are passing. I also wrote a Flink application with keyed state using custom Kryo v2 serializers and then an upgraded version with both Kryo v2 and Kryo v5 serializers to stress test the upgrade process. I'd like to hear about additional scenarios that need to be tested.
>>>
>>> - Is this worth pursuing or is the Flink project looking to go in a different direction? I'd like to do some more work on the pull request if this is being seriously considered for adoption.
>>>
>>> I'm looking forward to hearing everyone's feedback and suggestions.
>>>
>>> Thank you,
>>> Kurt
>>>
>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> Custom big data solutions
>> Flink, Pinot, Solr, Elasticsearch
>>


Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Kurt Ostfeld <ku...@proton.me.INVALID>.
If the Flink project is planning to completely drop all stateful upgrade compatibility within the near year for a Flink 2.0 release, then providing a stateful migration pathway from Kryo 2.x to Kryo 5.x is probably unnecessary. Is that correct? Is the Flink project pretty confident that Flink 2.0 will not be compatible with Flink 1.x state?




------- Original Message -------
On Monday, June 5th, 2023 at 7:51 AM, Martijn Visser <ma...@apache.org> wrote:


> 
> 
> Hi ConradJam,
> 
> That assumes that it will be possible to upgrade statefully to Flink 2.0:
> given that it is a major breaking change, I wouldn't assume that will be
> possible.
> 
> Best regards,
> 
> Martijn
> 
> On Mon, Jun 5, 2023 at 2:37 PM ConradJam jam.gzczy@gmail.com wrote:
> 
> > Here I have a suggestion, because I mentioned Flink2.0 earlier, I am
> > wondering if there is a possibility: whether the user can perform the
> > migration of all states to Kryo5 when performing the first start-up
> > task of migrating to version 2.0 in the future, until we give up
> > maintaining Kryo2 later
> > 
> > Don't know if my idea coincides with Chesnay's
> > 
> > Chesnay Schepler chesnay@apache.org 于2023年6月1日周四 23:25写道:
> > 
> > > The version in the state is the serializer version, and applies to the
> > > entire state, independent of what it contains.
> > > If you use Kryo2 for reading and Kryo5 for writing (which also implies
> > > writing the new serializer version into state), then I'd assume that a
> > > migration is an all-or-nothing kind of deal.
> > > IOW, you'd have to load a savepoint and write out an entirely new
> > > savepoint with the new state.
> > > Otherwise you may have only re-written part of the checkpoint, and now
> > > contains a mix of Kryo2/Kryo5 serialized classes, which should then fail
> > > hard on any recovery attempt because we wouldn't use Kryo2 to read
> > > anything.
> > > 
> > > If I'm right, then as is this sounds like quite a trap for users to fall
> > > into because from what I gathered this is the default behavior in the PR
> > > (I could be wrong though since I haven't fully dug through the 8k lines
> > > PR yet...)
> > > 
> > > What we kind of want is this:
> > > 1) Kryo5 is used as the default for new jobs. (maybe not even that,
> > > making it an explicit opt-in)
> > > 2) Kryo2 is used for reading AND writing for existing* jobs by default.
> > > 3) Users can explicitly (and easily!) do a full migration of their jobs,
> > > after which 2) should no longer apply.
> > > 
> > > In the PR you mentioned running into issues on Java 17; to have have
> > > some error stacktraces and examples data/serializers still around?
> > > 
> > > On 30/05/2023 00:38, Kurt Ostfeld wrote:
> > > 
> > > > > I’d assumed that there wasn’t a good way to migrate state stored with
> > > > > an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > > > > I hope I've solved this. The pull request is supposed to do exactly
> > > > > this. Please let me know if you can propose a scenario that would break
> > > > > this.
> > > > 
> > > > The pull-request has both Kryo 2.x and 5.x dependencies. It looks at
> > > > the state version number written to the state to determine which version of
> > > > Kryo to use for deserialization. Kryo 2.x is not used to write new state.
> > > > 
> > > > ------- Original Message -------
> > > > On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler <
> > > > kkrugler_lists@transpac.com> wrote:
> > > > 
> > > > > Hi Kurt,
> > > > > 
> > > > > I personally think it’s a very nice improvement, and that the
> > > > > longer-term goal of removing built-in Kryo support for state serialization
> > > > > (while a good one) warrants a separate FLIP.
> > > > > 
> > > > > Perhaps an intermediate approach would be to disable the use of Kryo
> > > > > for state serialization by default, and force a user to disregard warnings
> > > > > and explicitly enable it if they want to go down that path.
> > > > > 
> > > > > I’d assumed that there wasn’t a good way to migrate state stored with
> > > > > an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > > > > 
> > > > > — Ken
> > > > > 
> > > > > > On May 29, 2023, at 2:21 PM, Kurt Ostfeld
> > > > > > kurtostfeld@proton.me.INVALID wrote:
> > > > > > 
> > > > > > Hi everyone. I would like to start the discussion thread for
> > > > > > FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0 [1].
> > > > > > 
> > > > > > There is a pull-request associated with this linked in the FLIP.
> > > > > > 
> > > > > > I'd particularly like to hear about:
> > > > > > 
> > > > > > - Chesnay Schepler's request to consider removing Kryo serializers
> > > > > > from the execution config. Is this a reasonable task to add into this FLIP?
> > > > > > Is there consensus on how to resolve that? Would that be better addressed
> > > > > > in a separate future FLIP after the Kryo upgrade FLIP is completed?
> > > > > > 
> > > > > > - Backwards compatibility. The automated CI tests have a lot of
> > > > > > backwards compatibility tests that are passing. I also wrote a Flink
> > > > > > application with keyed state using custom Kryo v2 serializers and then an
> > > > > > upgraded version with both Kryo v2 and Kryo v5 serializers to stress test
> > > > > > the upgrade process. I'd like to hear about additional scenarios that need
> > > > > > to be tested.
> > > > > > 
> > > > > > - Is this worth pursuing or is the Flink project looking to go in a
> > > > > > different direction? I'd like to do some more work on the pull request if
> > > > > > this is being seriously considered for adoption.
> > > > > > 
> > > > > > I'm looking forward to hearing everyone's feedback and suggestions.
> > > > > > 
> > > > > > Thank you,
> > > > > > Kurt
> > > > > > 
> > > > > > [1]
> > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0
> > > > > 
> > > > > --------------------------
> > > > > Ken Krugler
> > > > > http://www.scaleunlimited.com
> > > > > Custom big data solutions
> > > > > Flink, Pinot, Solr, Elasticsearch
> > 
> > --
> > Best
> > 
> > ConradJam

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Martijn Visser <ma...@apache.org>.
Hi ConradJam,

That assumes that it will be possible to upgrade statefully to Flink 2.0:
given that it is a major breaking change, I wouldn't assume that will be
possible.

Best regards,

Martijn

On Mon, Jun 5, 2023 at 2:37 PM ConradJam <ja...@gmail.com> wrote:

> Here I have a suggestion, because I mentioned Flink2.0 earlier, I am
> wondering if there is a possibility: whether the user can perform the
> migration of all states to Kryo5 when performing the first start-up
> task of migrating to version 2.0 in the future, until we give up
> maintaining Kryo2 later
>
> Don't know if my idea coincides with Chesnay's
>
> Chesnay Schepler <ch...@apache.org> 于2023年6月1日周四 23:25写道:
> >
> > The version in the state is the serializer version, and applies to the
> > entire state, independent of what it contains.
> > If you use Kryo2 for reading and Kryo5 for writing (which also implies
> > writing the new serializer version into state), then I'd assume that a
> > migration is an all-or-nothing kind of deal.
> > IOW, you'd have to load a savepoint and write out an entirely new
> > savepoint with the new state.
> > Otherwise you may have only re-written part of the checkpoint, and now
> > contains a mix of Kryo2/Kryo5 serialized classes, which should then fail
> > _hard_ on any recovery attempt because we wouldn't use Kryo2 to read
> > anything.
> >
> > If I'm right, then as is this sounds like quite a trap for users to fall
> > into because from what I gathered this is the default behavior in the PR
> > (I could be wrong though since I haven't fully dug through the 8k lines
> > PR yet...)
> >
> > What we kind of want is this:
> > 1) Kryo5 is used as the default for new jobs. (maybe not even that,
> > making it an explicit opt-in)
> > 2) Kryo2 is used for reading AND writing for existing* jobs by default.
> > 3) Users can explicitly (and easily!) do a full migration of their jobs,
> > after which 2) should no longer apply.
> >
> >
> >
> > In the PR you mentioned running into issues on Java 17; to have have
> > some error stacktraces and examples data/serializers still around?
> >
> > On 30/05/2023 00:38, Kurt Ostfeld wrote:
> > >> I’d assumed that there wasn’t a good way to migrate state stored with
> an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > > I hope I've solved this. The pull request is supposed to do exactly
> this. Please let me know if you can propose a scenario that would break
> this.
> > >
> > > The pull-request has both Kryo 2.x and 5.x dependencies. It looks at
> the state version number written to the state to determine which version of
> Kryo to use for deserialization. Kryo 2.x is not used to write new state.
> > >
> > > ------- Original Message -------
> > > On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler <
> kkrugler_lists@transpac.com> wrote:
> > >
> > >
> > >>
> > >> Hi Kurt,
> > >>
> > >> I personally think it’s a very nice improvement, and that the
> longer-term goal of removing built-in Kryo support for state serialization
> (while a good one) warrants a separate FLIP.
> > >>
> > >> Perhaps an intermediate approach would be to disable the use of Kryo
> for state serialization by default, and force a user to disregard warnings
> and explicitly enable it if they want to go down that path.
> > >>
> > >> I’d assumed that there wasn’t a good way to migrate state stored with
> an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > >>
> > >> — Ken
> > >>
> > >>
> > >>> On May 29, 2023, at 2:21 PM, Kurt Ostfeld
> kurtostfeld@proton.me.INVALID wrote:
> > >>>
> > >>> Hi everyone. I would like to start the discussion thread for
> FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0 [1].
> > >>>
> > >>> There is a pull-request associated with this linked in the FLIP.
> > >>>
> > >>> I'd particularly like to hear about:
> > >>>
> > >>> - Chesnay Schepler's request to consider removing Kryo serializers
> from the execution config. Is this a reasonable task to add into this FLIP?
> Is there consensus on how to resolve that? Would that be better addressed
> in a separate future FLIP after the Kryo upgrade FLIP is completed?
> > >>>
> > >>> - Backwards compatibility. The automated CI tests have a lot of
> backwards compatibility tests that are passing. I also wrote a Flink
> application with keyed state using custom Kryo v2 serializers and then an
> upgraded version with both Kryo v2 and Kryo v5 serializers to stress test
> the upgrade process. I'd like to hear about additional scenarios that need
> to be tested.
> > >>>
> > >>> - Is this worth pursuing or is the Flink project looking to go in a
> different direction? I'd like to do some more work on the pull request if
> this is being seriously considered for adoption.
> > >>>
> > >>> I'm looking forward to hearing everyone's feedback and suggestions.
> > >>>
> > >>> Thank you,
> > >>> Kurt
> > >>>
> > >>> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0
> > >>
> > >> --------------------------
> > >> Ken Krugler
> > >> http://www.scaleunlimited.com
> > >> Custom big data solutions
> > >> Flink, Pinot, Solr, Elasticsearch
> > >>
> >
>
>
> --
> Best
>
> ConradJam
>

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by ConradJam <ja...@gmail.com>.
Here I have a suggestion, because I mentioned Flink2.0 earlier, I am
wondering if there is a possibility: whether the user can perform the
migration of all states to Kryo5 when performing the first start-up
task of migrating to version 2.0 in the future, until we give up
maintaining Kryo2 later

Don't know if my idea coincides with Chesnay's

Chesnay Schepler <ch...@apache.org> 于2023年6月1日周四 23:25写道:
>
> The version in the state is the serializer version, and applies to the
> entire state, independent of what it contains.
> If you use Kryo2 for reading and Kryo5 for writing (which also implies
> writing the new serializer version into state), then I'd assume that a
> migration is an all-or-nothing kind of deal.
> IOW, you'd have to load a savepoint and write out an entirely new
> savepoint with the new state.
> Otherwise you may have only re-written part of the checkpoint, and now
> contains a mix of Kryo2/Kryo5 serialized classes, which should then fail
> _hard_ on any recovery attempt because we wouldn't use Kryo2 to read
> anything.
>
> If I'm right, then as is this sounds like quite a trap for users to fall
> into because from what I gathered this is the default behavior in the PR
> (I could be wrong though since I haven't fully dug through the 8k lines
> PR yet...)
>
> What we kind of want is this:
> 1) Kryo5 is used as the default for new jobs. (maybe not even that,
> making it an explicit opt-in)
> 2) Kryo2 is used for reading AND writing for existing* jobs by default.
> 3) Users can explicitly (and easily!) do a full migration of their jobs,
> after which 2) should no longer apply.
>
>
>
> In the PR you mentioned running into issues on Java 17; to have have
> some error stacktraces and examples data/serializers still around?
>
> On 30/05/2023 00:38, Kurt Ostfeld wrote:
> >> I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > I hope I've solved this. The pull request is supposed to do exactly this. Please let me know if you can propose a scenario that would break this.
> >
> > The pull-request has both Kryo 2.x and 5.x dependencies. It looks at the state version number written to the state to determine which version of Kryo to use for deserialization. Kryo 2.x is not used to write new state.
> >
> > ------- Original Message -------
> > On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler <kk...@transpac.com> wrote:
> >
> >
> >>
> >> Hi Kurt,
> >>
> >> I personally think it’s a very nice improvement, and that the longer-term goal of removing built-in Kryo support for state serialization (while a good one) warrants a separate FLIP.
> >>
> >> Perhaps an intermediate approach would be to disable the use of Kryo for state serialization by default, and force a user to disregard warnings and explicitly enable it if they want to go down that path.
> >>
> >> I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
> >>
> >> — Ken
> >>
> >>
> >>> On May 29, 2023, at 2:21 PM, Kurt Ostfeld kurtostfeld@proton.me.INVALID wrote:
> >>>
> >>> Hi everyone. I would like to start the discussion thread for FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0 [1].
> >>>
> >>> There is a pull-request associated with this linked in the FLIP.
> >>>
> >>> I'd particularly like to hear about:
> >>>
> >>> - Chesnay Schepler's request to consider removing Kryo serializers from the execution config. Is this a reasonable task to add into this FLIP? Is there consensus on how to resolve that? Would that be better addressed in a separate future FLIP after the Kryo upgrade FLIP is completed?
> >>>
> >>> - Backwards compatibility. The automated CI tests have a lot of backwards compatibility tests that are passing. I also wrote a Flink application with keyed state using custom Kryo v2 serializers and then an upgraded version with both Kryo v2 and Kryo v5 serializers to stress test the upgrade process. I'd like to hear about additional scenarios that need to be tested.
> >>>
> >>> - Is this worth pursuing or is the Flink project looking to go in a different direction? I'd like to do some more work on the pull request if this is being seriously considered for adoption.
> >>>
> >>> I'm looking forward to hearing everyone's feedback and suggestions.
> >>>
> >>> Thank you,
> >>> Kurt
> >>>
> >>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0
> >>
> >> --------------------------
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> Custom big data solutions
> >> Flink, Pinot, Solr, Elasticsearch
> >>
>


-- 
Best

ConradJam

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Kurt Ostfeld <ku...@proton.me.INVALID>.
ok:
- I start a Flink 1.17.1 cluster, run the job, then run `flink stop` and generate a savepoint. This savepoint will have Kryo 2.x data from standard Flink 1.17.1.
- I start a Flink 1.18-SNAPSHOT cluster with the pull-request, run the job with resume from the savepoint from Flink 1.17, then I kill the cluster. I have a checkpoint with metadata. I believe this checkpoint is all using Kryo 5.x serialization.
- I restart the cluster, run the job resuming from the checkpoint, and everything runs successfully. The job picks up where it left off and there are no errors, all output data looks correct.

Am I following the scenario correctly? Why would a checkpoint created by the new pull-request code have Kryo 2.x serialized data?

Here is the code for my test app that I'm using. The checkpoint configuration settings are mostly from https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/fault-tolerance/checkpointing/

https://github.com/kurtostfeld/flink-kryo-upgrade-demo/blob/main/flink-kryo-upgraded/src/main/java/demo/app/Main.java


------- Original Message -------
On Thursday, June 8th, 2023 at 9:33 AM, Chesnay Schepler <ch...@apache.org> wrote:


> 
> 
> On 08/06/2023 16:06, Kurt Ostfeld wrote:
> 
> > If I understand correctly, the scenario is resuming from multiple checkpoint files or from a savepoint and checkpoint files which may be generated by different versions of Flink
> 
> 
> No; it's the same version of Flink, you just didn't do a full migration
> of the savepoint from the start.
> 
> So, load old savepoint -> create an incremental checkpoint (which writes
> 
> bit new state with Kryo5) -> jobs fails -> try recover job (which now
> 
> has to read state was written with either Kryo2 or Kryo5).
> 
> On 08/06/2023 16:06, Kurt Ostfeld wrote:
> 
> > This pull-request build supports Java records
> 
> 
> We'd have to see but of the top of my head I doubt we want to use Kryo
> for that, and rather extend our PojoSerializer. At least so far that was
> the plan.

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Chesnay Schepler <ch...@apache.org>.
On 08/06/2023 16:06, Kurt Ostfeld wrote:
>   If I understand correctly, the scenario is resuming from multiple checkpoint files or from a savepoint and checkpoint files which may be generated by different versions of Flink

No; it's the same version of Flink, you just didn't do a full migration 
of the savepoint from the start.

So, load old savepoint -> create an incremental checkpoint (which writes 
bit new state with Kryo5) -> jobs fails -> try recover job (which now 
has to read state was written with either Kryo2 or Kryo5).

On 08/06/2023 16:06, Kurt Ostfeld wrote:
> This pull-request build supports Java records

We'd have to see but of the top of my head I doubt we want to use Kryo 
for that, and rather extend our PojoSerializer. At least so far that was 
the plan.


Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Kurt Ostfeld <ku...@proton.me.INVALID>.
Thank you very much for the feedback.

- With this pull-request build, Flink runs successfully with a JDK 17 runtime for applications without saved state or with applications with saved state from this pull-request build which is using Kryo 5.x. FYI, the Maven build is still run with JDK 8 or 11 but the Flink jobmanager and taskmanager can be run with a JDK 17 runtime.
- Kryo 2.x is still on the classpath for backwards compatibility purposes, and if you try to load a savepoint from Flink 1.17 or older which uses the Kryo 2.x serialization library with JDK 17+, that will fail with exceptions.
- A stateful upgrade pathway looks like this: Applications run a Flink cluster with this pull-request under JDK 8 or 11, load an existing savepoint with Kryo 2.x data, write out a new savepoint which automatically uses Kryo 5.x, restart the Flink cluster with a JDK 17 runtime, and resume from the new savepoint successfully.
- This pull-request build supports Java records (which obviously requires JDK17+ at runtime) with the Flink DataStream API. Kryo 5.x supports records so this works without any extra configuration. A simple demo is here: https://github.com/kurtostfeld/flink-kryo-upgrade-demo/blob/main/flink-record-demo/src/main/java/demo/app/Main.java. The app is built with JDK 17, Flink's Maven build still runs with JDK 8/11, but the Flink cluster uses JDK 17 at runtime.


I need to investigate the scenario you describe. If I understand correctly, the scenario is resuming from multiple checkpoint files or from a savepoint and checkpoint files which may be generated by different versions of Flink and therefore may be using different Kryo library versions. Is that accurate? We need to accommodate that scenario and I will investigate.




------- Original Message -------
On Thursday, June 1st, 2023 at 10:25 AM, Chesnay Schepler <ch...@apache.org> wrote:


>
>
> The version in the state is the serializer version, and applies to the
> entire state, independent of what it contains.
> If you use Kryo2 for reading and Kryo5 for writing (which also implies
> writing the new serializer version into state), then I'd assume that a
> migration is an all-or-nothing kind of deal.
> IOW, you'd have to load a savepoint and write out an entirely new
> savepoint with the new state.
> Otherwise you may have only re-written part of the checkpoint, and now
> contains a mix of Kryo2/Kryo5 serialized classes, which should then fail
> hard on any recovery attempt because we wouldn't use Kryo2 to read
> anything.
>
> If I'm right, then as is this sounds like quite a trap for users to fall
> into because from what I gathered this is the default behavior in the PR
> (I could be wrong though since I haven't fully dug through the 8k lines
> PR yet...)
>
> What we kind of want is this:
> 1) Kryo5 is used as the default for new jobs. (maybe not even that,
> making it an explicit opt-in)
> 2) Kryo2 is used for reading AND writing for existing* jobs by default.
> 3) Users can explicitly (and easily!) do a full migration of their jobs,
> after which 2) should no longer apply.
>
>
>
> In the PR you mentioned running into issues on Java 17; to have have
> some error stacktraces and examples data/serializers still around?
>
> On 30/05/2023 00:38, Kurt Ostfeld wrote:
>
> > > I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > > I hope I've solved this. The pull request is supposed to do exactly this. Please let me know if you can propose a scenario that would break this.
> >
> > The pull-request has both Kryo 2.x and 5.x dependencies. It looks at the state version number written to the state to determine which version of Kryo to use for deserialization. Kryo 2.x is not used to write new state.
> >
> > ------- Original Message -------
> > On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler kkrugler_lists@transpac.com wrote:
> >
> > > Hi Kurt,
> > >
> > > I personally think it’s a very nice improvement, and that the longer-term goal of removing built-in Kryo support for state serialization (while a good one) warrants a separate FLIP.
> > >
> > > Perhaps an intermediate approach would be to disable the use of Kryo for state serialization by default, and force a user to disregard warnings and explicitly enable it if they want to go down that path.
> > >
> > > I’d assumed that there wasn’t a good way to migrate state stored with an older version of Kryo to a newer version - if you’ve solved that, kudos.
> > >
> > > — Ken
> > >
> > > > On May 29, 2023, at 2:21 PM, Kurt Ostfeld kurtostfeld@proton.me.INVALID wrote:
> > > >
> > > > Hi everyone. I would like to start the discussion thread for FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0 [1].
> > > >
> > > > There is a pull-request associated with this linked in the FLIP.
> > > >
> > > > I'd particularly like to hear about:
> > > >
> > > > - Chesnay Schepler's request to consider removing Kryo serializers from the execution config. Is this a reasonable task to add into this FLIP? Is there consensus on how to resolve that? Would that be better addressed in a separate future FLIP after the Kryo upgrade FLIP is completed?
> > > >
> > > > - Backwards compatibility. The automated CI tests have a lot of backwards compatibility tests that are passing. I also wrote a Flink application with keyed state using custom Kryo v2 serializers and then an upgraded version with both Kryo v2 and Kryo v5 serializers to stress test the upgrade process. I'd like to hear about additional scenarios that need to be tested.
> > > >
> > > > - Is this worth pursuing or is the Flink project looking to go in a different direction? I'd like to do some more work on the pull request if this is being seriously considered for adoption.
> > > >
> > > > I'm looking forward to hearing everyone's feedback and suggestions.
> > > >
> > > > Thank you,
> > > > Kurt
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0
> > >
> > > --------------------------
> > > Ken Krugler
> > > http://www.scaleunlimited.com
> > > Custom big data solutions
> > > Flink, Pinot, Solr, Elasticsearch

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

Posted by Kurt Ostfeld <ku...@proton.me.INVALID>.
Regarding this comment: "The version in the state is the serializer version, and applies to the entire state, independent of what it contains. If you use Kryo2 for reading and Kryo5 for writing (which also implies writing the new serializer version into state), then I'd assume that a migration is an all-or-nothing kind of deal."

Much of Flink uses the TypeSerializerSnapshot classes for serialization. With that, the fully qualified package+class name of a subclass of TypeSerializerSnapshot is written to the state as a string. The pull-request uses this class name to determine the correct version of Kryo to use. Flink up to and including 1.17.x uses `org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot` for Kryo 2.x serialized data. The pull request uses `org.apache.flink.api.java.typeutils.runtime.kryo5.KryoSerializerSnapshot` for Kryo 5.x serialized data. Serialized state mixes different types of snapshots and if it has both Kryo 2.x and Kryo 5.x snapshot data, that works without problems and uses the correct version of Kryo to deserialize successfully.

The state version number is used to determine the serialized Kryo version at only one point in the source code where the Snapshot classes are not used:

https://github.com/kurtostfeld/flink/blob/e013e9e95096efb41d376f3b6584b5d3d78dc916/flink-runtime/src/main/java/org/apache/flink/runtime/state/heap/StateTableByKeyGroupReaders.java#L73

This seems to work from my testing. If I can find a scenario where this doesn't work I can come up with a revised solution.

I'd like to conclude that this pull-request demonstrates that a backward compatible Kryo upgrade is possible and is mostly done. More testing from a wider pool of people would be needed to proceed, but this demonstrates it is possible. However, whether this proceeds at all is up to the Flink project. If the plan of the Flink project is to drop all backward compatibility anyway for a 2.0 release as Martijn Visser suggested in this thread, then the Kryo upgrade can be done in a much much simpler fashion, and doing a more complex backward compatible upgrade seems unnecessary.