You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Dawid Wysakowicz (Jira)" <ji...@apache.org> on 2022/03/07 10:28:00 UTC
[jira] [Commented] (FLINK-26274) Test local recovery works across TaskManager process restarts
[ https://issues.apache.org/jira/browse/FLINK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502183#comment-17502183 ]
Dawid Wysakowicz commented on FLINK-26274:
------------------------------------------
I tried testing the feature in a following way:
1. I set up a minikube with Flink cluster following the documentation and a MinIO (for a dfs implementation)
2. I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 ./examples/streaming/StateMachineExample.jar --backend rocks --incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}}
3. I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the new taskmanagers logs to see if it tried restoring from the local state which it did
4. I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot allocation. I got into an infinite restart loop. I am attaching jobmanagers logs (jobmanager_local_restore_2.log). Have I done something wrong? Why can't it restore?
* flink-conf in ConfigMap:
{code}
"flink-conf.yaml": "state.backend.local-recovery: true
process.taskmanager.working-dir: /pv
jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 2
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
queryable-state.proxy.ports: 6125
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
parallelism.default: 2
s3.endpoint: http://minio-hl:9000
s3.path.style.access: true
s3.access.key: flinkflink
s3.secret.key: flinkflink
state.storage.fs.memory-threshold: 0
",
{code}
* taskmanager config:
{code}
containers:
- name: taskmanager
image: dawidwys/flink:1.15-SN-s3
args:
- taskmanager
- '-Dtaskmanager.resource-id=$(POD_NAME)'
{code}
* commit id: 26d7c09 [^jobmanager_local_restore_2.log]
> Test local recovery works across TaskManager process restarts
> -------------------------------------------------------------
>
> Key: FLINK-26274
> URL: https://issues.apache.org/jira/browse/FLINK-26274
> Project: Flink
> Issue Type: Technical Debt
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Till Rohrmann
> Assignee: Dawid Wysakowicz
> Priority: Blocker
> Labels: release-testing
> Fix For: 1.15.0
>
> Attachments: jobmanager_local_restore_2.log
>
>
> This ticket is a testing task for [FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw].
> When enabling local recovery and configuring a working directory that can be re-read after a process failure, Flink should now be able to recover locally. We should test whether this is the case. Please take a look at the documentation [1, 2] to see how to configure Flink to make use of it.
> [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/working_directory/
> [2] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts
--
This message was sent by Atlassian Jira
(v8.20.1#820001)