You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/10/23 12:42:32 UTC

[GitHub] [flink] tzulitai opened a new pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

tzulitai opened a new pull request #13773:
URL: https://github.com/apache/flink/pull/13773


   This is a backport of #13772 to `release-1.11`. Only the last 2 commits are relevant for FLINK-19748.
   Please see #13772 for a detailed description.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 86d9c5d8166af669754a8e8356a3cfced610c186 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306",
       "triggerID" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 86d9c5d8166af669754a8e8356a3cfced610c186 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201) 
   * 5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thanks you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   You should create a savepoint, and try to restore as you did in your previous test.
   Note that you should not need to apply any Flink fixes for this.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new StateFun build that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. This does not affect you if you don't have StateFun applications running in production yet. Enabling this requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks, so we decided to go ahead with the above option first to move faster.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 86d9c5d8166af669754a8e8356a3cfced610c186 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201) 
   * 5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Antti-Kaikkonen commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
Antti-Kaikkonen commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717756123


   @tzulitai I tried it and restoring from a savepoint worked. As FlinkStatefunCountTo1M doesn't actually use state I also tried with my other app that uses statefun-flink-datastream and it was able to restore from a savepoint without errors. Thank you very much! 
   
   I only tested with rocksdb state backend and rocksdb timers. The Flink version tested was 1.11.2 from https://flink.apache.org/downloads.html.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thanks you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new version that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. Enabling that requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306",
       "triggerID" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thanks you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new version that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. Enabling that requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks, so we decided to go ahead with the above option first to move faster.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717757992


   @Antti-Kaikkonen great news, thank you. We'll keep you updated on the JIRA regarding an official release candidate that includes the fixes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 86d9c5d8166af669754a8e8356a3cfced610c186 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Antti-Kaikkonen edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
Antti-Kaikkonen edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717293281


   I tried to build this from source and got an error when trying to restore a stateful function from a savepoint:
   
   1)
   ```
   git clone https://github.com/tzulitai/flink.git
   cd flink
   git checkout FLINK-19748-backport_1.11
   mvn clean package -DskipTests
   ```
   2)
   add to flink-conf.yaml:
   ```
   classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
   #optionally use rocksdb
   state.backend: rocksdb
   taskmanager.numberOfTaskSlots: 2
   parallelism.default: 2
   ```
   
   2)
   Run https://github.com/Antti-Kaikkonen/FlinkStatefunCountTo1M with parallelism 2
   
   3)
   create a savepoint
   
   4)
   try to restore from the savepoint and the error is thrown in the **feedback-union -> functions** task:
   ```
   java.lang.Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:222)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.StreamCorruptedException: invalid stream header: 008E0A20
   	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:918)
   	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:376)
   	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.<init>(InstantiationUtil.java:69)
   	at org.apache.flink.util.InstantiationUtil$FailureTolerantObjectInputStream.<init>(InstantiationUtil.java:227)
   	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:572)
   	at org.apache.flink.streaming.api.operators.InternalTimersSnapshotReaderWriters$InternalTimersSnapshotReaderPreVersioned.restoreKeyAndNamespaceSerializers(InternalTimersSnapshotReaderWriters.java:308)
   	at org.apache.flink.streaming.api.operators.InternalTimersSnapshotReaderWriters$AbstractInternalTimersSnapshotReader.readTimersSnapshot(InternalTimersSnapshotReaderWriters.java:261)
   	at org.apache.flink.streaming.api.operators.InternalTimerServiceSerializationProxy.read(InternalTimerServiceSerializationProxy.java:115)
   	at org.apache.flink.core.io.PostVersionedIOReadableWritable.read(PostVersionedIOReadableWritable.java:76)
   	at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.restoreStateForKeyGroup(InternalTimeServiceManager.java:217)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:252)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:181)
   	... 9 more
   ```
   I'm getting the same error with the default state backend and the rocksdb state backend. When I tried with rocksdb backend and heap timers I get a different error already when creating a savepoint.
   
   **Edit:** Apparently I had accidentally built the FLINK-19741-backport_1.11 branch. I have now updated the above description to reflect this pull request (FLINK-19748-backport_1.11) and added the error I that got with FLINK-19741-backport_1.11 (pull request #13762) below:
   ```
   Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:220)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.StatePartitionStreamProvider.getStream(StatePartitionStreamProvider.java:58)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:251)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:179)
   	... 9 more
   Caused by: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.medescriptionmory.ByteStreamStateHandle$ByteStateHandleInputStream.seek(ByteStreamStateHandle.java:124)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:458)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:411)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:244)
   	... 10 more
   ```
   which is the same error as in my original bug description https://issues.apache.org/jira/projects/FLINK/issues/FLINK-19692


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715370664


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8201",
       "triggerID" : "86d9c5d8166af669754a8e8356a3cfced610c186",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306",
       "triggerID" : "5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5b5b7d57cb6ffd630bab3c69b0e7d78a8e619c90 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8306) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thanks you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   You should create a savepoint, and try to restore as you did in your previous test.
   Note that you should not need to apply any Flink fixes for this.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new version that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. This does not affect you if you don't have StateFun applications running in production yet. Enabling this requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks, so we decided to go ahead with the above option first to move faster.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Antti-Kaikkonen edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
Antti-Kaikkonen edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717293281


   I tried to build this from source and got the [same](https://issues.apache.org/jira/projects/FLINK/issues/FLINK-19692) error with slightly different line numbers when restoring a stateful function from a savepoint:
   
   1)
   ```
   git clone https://github.com/tzulitai/flink.git
   cd flink
   git checkout FLINK-19748-backport_1.11
   mvn clean package -DskipTests
   ```
   2)
   add to flink-conf.yaml:
   ```
   classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
   #optionally use rocksdb
   state.backend: rocksdb
   taskmanager.numberOfTaskSlots: 2
   parallelism.default: 2
   ```
   
   2)
   Run https://github.com/Antti-Kaikkonen/FlinkStatefunCountTo1M with parallelism 2
   
   3)
   create a savepoint
   
   4)
   try to restore from the savepoint and the error is thrown in the **feedback-union -> functions** task:
   ```
   Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:220)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.StatePartitionStreamProvider.getStream(StatePartitionStreamProvider.java:58)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:251)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:179)
   	... 9 more
   Caused by: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.memory.ByteStreamStateHandle$ByteStateHandleInputStream.seek(ByteStreamStateHandle.java:124)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:458)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:411)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:244)
   	... 10 more
   ```
   I'm getting the same error with the default state backend and the rocksdb state backend. When I tried with rocksdb backend and heap timers I get a different error already when creating a savepoint.
   
   **Edit: I realized that I should have probably built from https://github.com/apache/flink/pull/13761 instead of this pull request. I'm testing it next**
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thank you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   You should create a savepoint, and try to restore as you did in your previous test.
   Note that you should not need to apply any Flink fixes for this.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new StateFun build that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. This does not affect you if you don't have StateFun applications running in production yet. Enabling this requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks, so we decided to go ahead with the above option first to move faster.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-715318488


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 86d9c5d8166af669754a8e8356a3cfced610c186 (Fri Oct 23 12:44:40 UTC 2020)
   
   **Warnings:**
    * Documentation files were touched, but no `.zh.md` files: Update Chinese documentation or file Jira ticket.
    * **This pull request references an unassigned [Jira ticket](https://issues.apache.org/jira/browse/FLINK-19748).** According to the [code contribution guide](https://flink.apache.org/contributing/contribute-code.html), tickets need to be assigned before starting with the implementation work.
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tzulitai edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
tzulitai edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717730004


   @Antti-Kaikkonen thanks you for trying the branch out.
   
   I think the exceptions you encountered are expected in the experiments you've tried out.
   
   Can you adjust your experiments to do the following, and then report back again?:
   
   Try out your application `FlinkStatefunCountTo1M` with a new build of StateFun that includes the changes in https://github.com/apache/flink-statefun/pull/168?
   
   You should be able to just pull that branch, do a clean build (`mvn clean install -DskipTests`), and then change the StateFun dependency in your application to `2.3-SNAPSHOT`.
   
   You should create a savepoint, and try to restore as you did in your previous test.
   Note that you should not need to apply any Flink fixes for this.
   
   ---
   
   Let me briefly explain our release plans here to address the issue you reported, and why the above adjustment makes sense:
   
   1. With the StateFun changes in https://github.com/apache/flink-statefun/pull/168 (and not including ANY Flink changes), we're expecting that restoring from checkpoints / savepoints should work properly now for all checkpoints / savepoints taken with a new version that includes https://github.com/apache/flink-statefun/pull/168. This would already address FLINK-19692, and we're planning to push out a StateFun hotfix release immediately to unblock you and other users that may be encountering the same issue.
   
   2. What https://github.com/apache/flink-statefun/pull/168 doesn't yet solve, is the ability to safely restore / upgrade from a savepoint taken with StateFun versions <= 2.2.0. Enabling that requires this PR and #13761 to be fixed in Flink, release a new Flink version, and ultimately yet another follow-up StateFun hotfix releases that uses the new Flink version. That is a lengthier process, with an estimate of another 3-4 weeks, so we decided to go ahead with the above option first to move faster.
   
   ---
   
   TL;DR: It would be tremendously helpful if you can re-do your experiment only with a new StateFun build including https://github.com/apache/flink-statefun/pull/168 alone. Please do let me know of the results!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Antti-Kaikkonen commented on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
Antti-Kaikkonen commented on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717293281


   I tried to build this from source and got the [same](https://issues.apache.org/jira/projects/FLINK/issues/FLINK-19692) error with slightly different line numbers when restoring a stateful function from a savepoint:
   
   1)
   ```
   git clone https://github.com/tzulitai/flink.git
   cd flink
   git checkout FLINK-19748-backport_1.11
   mvn clean package -DskipTests
   ```
   2)
   add to flink-conf.yaml:
   ```
   classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
   #optionally use rocksdb
   state.backend: rocksdb
   taskmanager.numberOfTaskSlots: 2
   parallelism.default: 2
   ```
   
   2)
   Run https://github.com/Antti-Kaikkonen/FlinkStatefunCountTo1M with parallelism 2
   
   3)
   create a savepoint
   
   4)
   try to restore from the savepoint and the error is thrown in the **feedback-union -> functions** task:
   ```
   Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:220)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.StatePartitionStreamProvider.getStream(StatePartitionStreamProvider.java:58)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:251)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:179)
   	... 9 more
   Caused by: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.memory.ByteStreamStateHandle$ByteStateHandleInputStream.seek(ByteStreamStateHandle.java:124)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:458)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:411)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:244)
   	... 10 more
   ```
   I'm getting the same error with the default state backend and the rocksdb state backend. When I tried with rocksdb backend and heap timers I get a different error already when creating a savepoint.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Antti-Kaikkonen edited a comment on pull request #13773: [backport-1.11] [FLINK-19748] Iterating key groups in raw keyed stream on restore fails if some key groups weren't written

Posted by GitBox <gi...@apache.org>.
Antti-Kaikkonen edited a comment on pull request #13773:
URL: https://github.com/apache/flink/pull/13773#issuecomment-717293281


   I tried to build this from source and got an error when trying to restore a stateful function from a savepoint:
   
   1)
   ```
   git clone https://github.com/tzulitai/flink.git
   cd flink
   git checkout FLINK-19748-backport_1.11
   mvn clean package -DskipTests
   ```
   2)
   add to flink-conf.yaml:
   ```
   classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
   #optionally use rocksdb
   state.backend: rocksdb
   taskmanager.numberOfTaskSlots: 2
   parallelism.default: 2
   ```
   
   2)
   Run https://github.com/Antti-Kaikkonen/FlinkStatefunCountTo1M with parallelism 2
   
   3)
   create a savepoint
   
   4)
   try to restore from the savepoint and the error is thrown in the **feedback-union -> functions** task:
   ```
   java.lang.Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:222)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.StreamCorruptedException: invalid stream header: 008E0A20
   	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:918)
   	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:376)
   	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.<init>(InstantiationUtil.java:69)
   	at org.apache.flink.util.InstantiationUtil$FailureTolerantObjectInputStream.<init>(InstantiationUtil.java:227)
   	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:572)
   	at org.apache.flink.streaming.api.operators.InternalTimersSnapshotReaderWriters$InternalTimersSnapshotReaderPreVersioned.restoreKeyAndNamespaceSerializers(InternalTimersSnapshotReaderWriters.java:308)
   	at org.apache.flink.streaming.api.operators.InternalTimersSnapshotReaderWriters$AbstractInternalTimersSnapshotReader.readTimersSnapshot(InternalTimersSnapshotReaderWriters.java:261)
   	at org.apache.flink.streaming.api.operators.InternalTimerServiceSerializationProxy.read(InternalTimerServiceSerializationProxy.java:115)
   	at org.apache.flink.core.io.PostVersionedIOReadableWritable.read(PostVersionedIOReadableWritable.java:76)
   	at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.restoreStateForKeyGroup(InternalTimeServiceManager.java:217)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:252)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:181)
   	... 9 more
   ```
   I'm getting the same error with the default state backend and the rocksdb state backend. When I tried with rocksdb backend and heap timers I get a different error already when creating a savepoint.
   
   **Edit: I realized that I should have probably built from https://github.com/apache/flink/pull/13761 instead of this pull request. I'm testing it next**
   
   **Edit2:** Apparently my previous attempt was already with FLINK-19741-backport_1.11. I have now tested both FLINK-19741-backport_1.11 and FLINK-19748-backport_1.11 and both of them throw a different error in the **feedback-union -> functions** task. The error with FLINK-19748-backport_1.11 is now is now updated in the description and the error with FLINK-19741-backport_1.11 is below:
   ```
   Exception: Exception while creating StreamOperatorStateContext.
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:220)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:248)
   	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.StatePartitionStreamProvider.getStream(StatePartitionStreamProvider.java:58)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:251)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:179)
   	... 9 more
   Caused by: java.io.IOException: position out of bounds
   	at org.apache.flink.runtime.state.memory.ByteStreamStateHandle$ByteStateHandleInputStream.seek(ByteStreamStateHandle.java:124)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:458)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl$KeyGroupStreamIterator.next(StreamTaskStateInitializerImpl.java:411)
   	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.internalTimeServiceManager(StreamTaskStateInitializerImpl.java:244)
   	... 10 more
   ```
   which is the same error as in my original bug description https://issues.apache.org/jira/projects/FLINK/issues/FLINK-19692


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org