You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by shravan <my...@microfocus.com> on 2021/02/19 08:19:05 UTC

Flink 1.11.3 not able to restore with savepoint taken on Flink 1.9.3

Hi,

We are trying to upgrade Flink from version 1.9.3 to 1.11.3. As part of the
upgrade testing, we are observing below exception when Flink 1.11.3 tries to
restore from a savepoint taken with Flink 1.9.3.

java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:222)
	at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
	at
org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
	at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$1(StreamTask.java:506)
	at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
	at
org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
	at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:526)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore operator
state backend for StreamSource_6ae2e79afadd77f926d57cdd7bfa1e1b_(1/8) from
any of the 1 provided restore options.
	at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
	at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:283)
	at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:156)
	... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed
when trying to restore operator state backend
	at
org.apache.flink.runtime.state.DefaultOperatorStateBackendBuilder.build(DefaultOperatorStateBackendBuilder.java:86)
	at
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createOperatorStateBackend(RocksDBStateBackend.java:552)
	at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$operatorStateBackend$0(StreamTaskStateInitializerImpl.java:274)
	at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
	at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
	... 11 more
Caused by: java.io.EOFException: No more bytes left.
	at
org.apache.flink.api.java.typeutils.runtime.NoFetchingInput.require(NoFetchingInput.java:79)
	at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:355)
	at com.esotericsoftware.kryo.io.Input.readInt(Input.java:350)
	at
com.esotericsoftware.kryo.serializers.UnsafeCacheFields$UnsafeIntField.read(UnsafeCacheFields.java:46)
	at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
	at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:346)
	at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:151)
	at
org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:37)
	at
org.apache.flink.runtime.state.OperatorStateRestoreOperation.deserializeOperatorStateValues(OperatorStateRestoreOperation.java:191)
	at
org.apache.flink.runtime.state.OperatorStateRestoreOperation.restore(OperatorStateRestoreOperation.java:165)
	at
org.apache.flink.runtime.state.DefaultOperatorStateBackendBuilder.build(DefaultOperatorStateBackendBuilder.java:83)
	... 15 more



The Flink pipeline and its operators are the same.

On checking in Flink docs, savepoint restore is supported between 1.9.x and
1.11.x versions.

Please provide inputs on how to resolve this issue.

Regards,
Shravan



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink 1.11.3 not able to restore with savepoint taken on Flink 1.9.3

Posted by Arvid Heise <ar...@apache.org>.
A common pitiful when upgrading a Flink application with savepoints is that
no explicit UIDs have been assigned to the operators. You can amend that by
first adding UIDs to the job in 1.9.3 and create a savepoint with UIDs.
Then try upgrading again.

On Fri, Feb 19, 2021 at 9:57 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi,
>
> I'm not aware of any breaking changes in the savepoint formats from 1.9.3
> to
> 1.11.3.
>
> Let's first try to rule out any obvious causes of this:
> - Were any data types / classes that were used in state changed across the
> restores? Remember that keys types are also written as part of state
> snapshots.
> - Did you register any Kryo types in the 1.9.3 execution, had changed those
> configuration across the restores?
> - Was unaligned checkpointing enabled in the 1.11.3 restore?
>
> As of now it's a bit hard to debug this with just an EOFException, as the
> corrupted read could have happened anywhere before that point. If it's
> possible to reproduce a minimal job of yours that has the same restore
> behaviour, that could also help a lot.
>
> Thanks,
> Gordon
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Flink 1.11.3 not able to restore with savepoint taken on Flink 1.9.3

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.
Hi,

I'm not aware of any breaking changes in the savepoint formats from 1.9.3 to
1.11.3.

Let's first try to rule out any obvious causes of this:
- Were any data types / classes that were used in state changed across the
restores? Remember that keys types are also written as part of state
snapshots.
- Did you register any Kryo types in the 1.9.3 execution, had changed those
configuration across the restores?
- Was unaligned checkpointing enabled in the 1.11.3 restore?

As of now it's a bit hard to debug this with just an EOFException, as the
corrupted read could have happened anywhere before that point. If it's
possible to reproduce a minimal job of yours that has the same restore
behaviour, that could also help a lot.

Thanks,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/