You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Dawid Wysakowicz (Jira)" <ji...@apache.org> on 2022/03/07 10:28:00 UTC

[jira] [Commented] (FLINK-26274) Test local recovery works across TaskManager process restarts

    [ https://issues.apache.org/jira/browse/FLINK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502183#comment-17502183 ] 

Dawid Wysakowicz commented on FLINK-26274:
------------------------------------------

I tried testing the feature in a following way:
1. I set up a minikube with Flink cluster following the documentation and a MinIO (for a dfs implementation)
2. I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 ./examples/streaming/StateMachineExample.jar --backend rocks --incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}}
3. I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the new taskmanagers logs to see if it tried restoring from the local state which it did
4. I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot allocation. I got into an infinite restart loop. I am attaching jobmanagers logs (jobmanager_local_restore_2.log). Have I done something wrong? Why can't it restore?

* flink-conf in ConfigMap:
{code}
	"flink-conf.yaml": "state.backend.local-recovery: true
		process.taskmanager.working-dir: /pv
		jobmanager.rpc.address: flink-jobmanager
		taskmanager.numberOfTaskSlots: 2
		blob.server.port: 6124
		jobmanager.rpc.port: 6123
		taskmanager.rpc.port: 6122
		queryable-state.proxy.ports: 6125
		jobmanager.memory.process.size: 1600m
		taskmanager.memory.process.size: 1728m
		parallelism.default: 2
		s3.endpoint: http://minio-hl:9000
		s3.path.style.access: true
		s3.access.key: flinkflink
		s3.secret.key: flinkflink
		state.storage.fs.memory-threshold: 0
		",
{code}

* taskmanager config:
{code}
      containers:
        - name: taskmanager
          image: dawidwys/flink:1.15-SN-s3
          args:
            - taskmanager
            - '-Dtaskmanager.resource-id=$(POD_NAME)'
{code}

* commit id: 26d7c09 [^jobmanager_local_restore_2.log] 

> Test local recovery works across TaskManager process restarts
> -------------------------------------------------------------
>
>                 Key: FLINK-26274
>                 URL: https://issues.apache.org/jira/browse/FLINK-26274
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Till Rohrmann
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>         Attachments: jobmanager_local_restore_2.log
>
>
> This ticket is a testing task for [FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw].
> When enabling local recovery and configuring a working directory that can be re-read after a process failure, Flink should now be able to recover locally. We should test whether this is the case. Please take a look at the documentation [1, 2] to see how to configure Flink to make use of it.
> [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/working_directory/
> [2] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts



--
This message was sent by Atlassian Jira
(v8.20.1#820001)