You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2022/04/13 09:32:00 UTC

[jira] [Comment Edited] (FLINK-27127) Local recovery is not triggered on task manager process restart

    [ https://issues.apache.org/jira/browse/FLINK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521525#comment-17521525 ] 

Chesnay Schepler edited comment on FLINK-27127 at 4/13/22 9:31 AM:
-------------------------------------------------------------------

First of, thank you for providing such a good reproducer, it really made it easy to debug.

-You need to explicitly configure {{state.backend.rocksdb.localdir}} to some directory in the working directory, like {{/pv/rocksdb}}.-
-This is not mentioned in the documentation; I'll fix that. I will also check why this doesn't work out-of-the-box, or specifically why we write to {{/pv/tmp}} which is cleared on restart.-

After double-checking this didn't solve the issue. Maybe it's got something to do with how the restart pans out...



was (Author: zentol):
First of, thank you for providing such a good reproducer, it really made it easy to debug.

You need to explicitly configure {{state.backend.rocksdb.localdir}} to some directory in the working directory, like {{/pv/rocksdb}}.

This is not mentioned in the documentation; I'll fix that. I will also check why this doesn't work out-of-the-box, or specifically why we write to {{/pv/tmp}} which is cleared on restart.


> Local recovery is not triggered on task manager process restart
> ---------------------------------------------------------------
>
>                 Key: FLINK-27127
>                 URL: https://issues.apache.org/jira/browse/FLINK-27127
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.15.0
>            Reporter: Abdullah alkhawatrah
>            Assignee: Chesnay Schepler
>            Priority: Blocker
>
> Hey,
> I am experimenting with the support of local recovery after process restart introduced in 1.15. I am trying this on minikube.
> So far, it seems that every time a pod restarts, remote recovery is triggered.
> I have created a repo with everything needed to test it locally with minikube: [https://github.com/akhawatrahTW/flink-local-recovery-test].
> The readme contains the steps to reproduce.
>  
> Based on the documentation, I was expecting to have local recovery triggered on pod restarts since the needed configs are set: [https://github.com/akhawatrahTW/flink-local-recovery-test/blob/bfef14e45f475ba953a05b50b8829d9d33bdcec6/k8s/flink-configuration-configmap.yaml#L27.]
> So was expecting to see something similar to this in the logs of the recreated task manager pod:
> *Expected:*
> {code:java}
> 2022-04-07 09:17:17,637 INFO  org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation [] - Starting to restore from state handle: IncrementalLocalKeyedStateHandle{metaDataState=File State: file:/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/c2756339-8938-4949-84ff-d7ee3f4c55cf [479 bytes]} DirectoryKeyedStateHandle{directoryStateHandle=DirectoryStateHandle{directory=/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/5455302ce9554a1f81365aee368f267e}, keyGroupRange=KeyGroupRange{startKeyGroup=86, endKeyGroup=127}} without rescaling.{code}
>  
>  
> But for some reason, remote recovery it triggered:
> *Actual:*
> {code:java}
> 2022-04-07 09:17:18,405 INFO  org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation [] - Finished restoring from state handle: IncrementalRemoteKeyedStateHandle{backendIdentifier=544f3300-36bd-40a6-9ee3-f78b0e47dfd6, stateHandleId=c2753d01-2f6b-49f0-9ca1-df6b54c61490, keyGroupRange=KeyGroupRange{startKeyGroup=0, endKeyGroup=42}, checkpointId=1208, sharedState={001526.sst=ByteStreamStateHandle{handleName='f5a113d0-8094-40e7-a1b1-adc4cfc690c2', dataBytes=23107}, 001527.sst=ByteStreamStateHandle{handleName='3806411e-8213-406a-bbd8-e498ab19d118', dataBytes=15579}, 001528.sst=ByteStreamStateHandle{handleName='4fef6ead-1522-4f61-a6ad-399b334b41ca', dataBytes=15839}, 001529.sst=ByteStreamStateHandle{handleName='f1324a0c-3eae-46b0-acc2-c03d32b0c24a', dataBytes=16055}}, privateState={OPTIONS-001237=ByteStreamStateHandle{handleName='2e36d07b-5f91-4c9d-9778-5a16bb6254d5', dataBytes=9924}, MANIFEST-001234=ByteStreamStateHandle{handleName='4c95b38a-4afa-4154-9c89-9518d6384a25', dataBytes=27356}, CURRENT=ByteStreamStateHandle{handleName='17bd5bab-c369-470a-bf29-e76279cef2ba', dataBytes=16}}, metaStateHandle=ByteStreamStateHandle{handleName='15827f44-0ab2-4562-b8eb-812b8d260206', dataBytes=479}, registered=false} without rescaling.{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)