You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Robert Metzger <rm...@apache.org> on 2021/08/03 11:06:57 UTC

Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I
know that savepoint migration across multiple versions is not supported
(1.9 can only migrate to 1.10, not 1.11). We have test coverage for these
migrations, and I would be surprised if this "Savepoint" class migration is
not covered in these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods <WW...@spireon.com> wrote:

> I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very
> interesting reason that the Savepoint class was renamed and repackaged
> between those two releases.   Apparently a Kryo serializer has that class
> registered in the 1.9 runtime.     I can’t think of a good reason for that
> class to be registered with Kryo; none of the job operators reference any
> such thing.   Yet there it is causing the following exception and
> preventing upgrade to a new runtime.
>
>
>
> Caused by: java.lang.IllegalStateException: Missing value for the key
> 'org.apache.flink.runtime.checkpoint.savepoint.Savepoint'
> at
> org.apache.flink.util.LinkedOptionalMap.unwrapOptionals(LinkedOptionalMap.java:190)
> ~[flink-dist_2.11-1.11.3.jar:1.11.3]
> at
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot.restoreSerializer(KryoSerializerSnapshot.java:86)
> ~[flink-dist_2.11-1.11.3.jar:1.11.3]
>
>
>
> There doesn’t seem to be any way to unregister a class from Kryo.   And
> the mechanism for dealing with missing classes looks to me like it has
> never worked as advertised.    Instead of registering a dummy class for a
> missing class name a null gets registered instead, leading to the exception
> which prevents restoring the savepoint.   The code that returns a null
> instead of a dummy is here  -
> https://github.com/apache/flink/blob/e8cfe6701b9768d1f1fe4488640cba5f9b42d73f/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerSnapshotData.java#L263
>
>
>
> Resulting in this log.
>
>
>
> 2021-07-27 18:38:11,703 WARN
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotData
> [] - Cannot find registered class
> org.apache.flink.runtime.checkpoint.savepoint.Savepoint for Kryo
> serialization in classpath; using a dummy class as a placeholder.
> java.lang.ClassNotFoundException:
> org.apache.flink.runtime.checkpoint.savepoint.Savepoint
>
>
>
> One way or another I need to be able to restore a 1.9 savepoint into
> 1.11.   Perhaps the Kryo registration needs to be cleansed from wherever it
> is lurking in the 1.9 savepoint,  or an effective dummy needs to be
> substituted when reading it into 1.11.
>
>
>
> Has anyone else encountered this problem, or have any advice to offer?
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Posted by Weston Woods <WW...@spireon.com>.
I am able to reproduce this failure by loading the production savepoint into a locally running 1.11 flink job using the state processor API.    The same sequence of events occurs; the Kryo snapshot deserializer stores a null for the refactored Savepoint interface which causes subsequent failures to restore operator state.   The state backend is rocksdb.

Bodily copying the 1.9.0 source code for org.apache.flink.runtime.checkpoint.savepoint.Savepoint into my test job allows it to load the savepoint and restore the operator states.     But that is a terrible workaround and I am looking for a good solution.



From: Robert Metzger <rm...@apache.org>
Date: Wednesday, August 4, 2021 at 10:21 AM
To: Weston Woods <WW...@spireon.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>, Timo Walther <tw...@apache.org>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Hi Weston,

Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a 1.11 runtime and it worked. So in principle this seems to be supported.

I'm including Timo into this thread, he has a lot of experience with the serializers.

On Tue, Aug 3, 2021 at 6:59 PM Weston Woods <WW...@spireon.com>> wrote:
Robert,

Thanks for your reply.    How should I interpret the savepoint compatibility table here https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table> if a 1.9 savepoint cannot be restored into a 1.11 runtime?



From: Robert Metzger <rm...@apache.org>>
Date: Tuesday, August 3, 2021 at 11:52 AM
To: Weston Woods <WW...@spireon.com>>
Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I know that savepoint migration across multiple versions is not supported (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these migrations, and I would be surprised if this "Savepoint" class migration is not covered in these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods <WW...@spireon.com>> wrote:
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very interesting reason that the Savepoint class was renamed and repackaged between those two releases.   Apparently a Kryo serializer has that class registered in the 1.9 runtime.     I can’t think of a good reason for that class to be registered with Kryo; none of the job operators reference any such thing.   Yet there it is causing the following exception and preventing upgrade to a new runtime.

Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Posted by Robert Metzger <rm...@apache.org>.
Hi Weston,

Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a
1.11 runtime and it worked. So in principle this seems to be supported.

I'm including Timo into this thread, he has a lot of experience with the
serializers.

On Tue, Aug 3, 2021 at 6:59 PM Weston Woods <WW...@spireon.com> wrote:

> Robert,
>
>
>
> Thanks for your reply.    How should I interpret the savepoint
> compatibility table here
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table
> if a 1.9 savepoint cannot be restored into a 1.11 runtime?
>
>
>
>
>
>
>
> *From: *Robert Metzger <rm...@apache.org>
> *Date: *Tuesday, August 3, 2021 at 11:52 AM
> *To: *Weston Woods <WW...@spireon.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: Savepoint class refactor in 1.11 causing restore from 1.9
> savepoint to fail
>
>
>
> Hi Weston,
>
> I haven never looked into the savepoint migration code paths myself, but I
> know that savepoint migration across multiple versions is not supported
> (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these
> migrations, and I would be surprised if this "Savepoint" class migration is
> not covered in these tests.
>
>
>
> Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?
>
>
>
> On Fri, Jul 30, 2021 at 11:53 PM Weston Woods <WW...@spireon.com> wrote:
>
> I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very
> interesting reason that the Savepoint class was renamed and repackaged
> between those two releases.   Apparently a Kryo serializer has that class
> registered in the 1.9 runtime.     I can’t think of a good reason for that
> class to be registered with Kryo; none of the job operators reference any
> such thing.   Yet there it is causing the following exception and
> preventing upgrade to a new runtime.
>
>

Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Posted by Weston Woods <WW...@spireon.com>.
Robert,

Thanks for your reply.    How should I interpret the savepoint compatibility table here https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table if a 1.9 savepoint cannot be restored into a 1.11 runtime?



From: Robert Metzger <rm...@apache.org>
Date: Tuesday, August 3, 2021 at 11:52 AM
To: Weston Woods <WW...@spireon.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I know that savepoint migration across multiple versions is not supported (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these migrations, and I would be surprised if this "Savepoint" class migration is not covered in these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods <WW...@spireon.com>> wrote:
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very interesting reason that the Savepoint class was renamed and repackaged between those two releases.   Apparently a Kryo serializer has that class registered in the 1.9 runtime.     I can’t think of a good reason for that class to be registered with Kryo; none of the job operators reference any such thing.   Yet there it is causing the following exception and preventing upgrade to a new runtime.