You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (Jira)" <ji...@apache.org> on 2022/03/14 02:12:00 UTC
[jira] [Comment Edited] (FLINK-26391) Release Testing: Application Mode recovery does not re-trigger a job which failed during cleanup (FLINK-11813)

    [ https://issues.apache.org/jira/browse/FLINK-26391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499874#comment-17499874 ] 

Yang Wang edited comment on FLINK-26391 at 3/14/22, 2:11 AM:
-------------------------------------------------------------

Test this ticket on Yarn with following steps.

1. [PASS] Submit a Flink application with specified high-availability cluster-id
{code:java}
./bin/flink run-application -t yarn-application -d -Dhigh-availability.cluster-id=test-job-result-store -Dhigh-availability=ZooKeeper -Dhigh-availability.zookeeper.quorum=i22xxxx:12181 -Dhigh-availability.storageDir=hdfs://flinkdev/tmp/flink-ha-yiqi -Djob-result-store.delete-on-commit=true examples/streaming/StateMachineExample.jar
{code}
2. [PASS] Wait for the job running and cancel it via the webUI
3. [PASS] Verify the job result store at {{hdfs://flinkdev/tmp/flink-ha-yiqi/job-result-store/test-job-result-store/00000000000000000000000000000000.json}}
The content is as following.
{code:java}
{"result":{"id":"00000000000000000000000000000000","application-status":"CANCELED","accumulator-results":{},"net-runtime":74563},"version":1}
{code}
4. [PASS] Rename the generated job result store file to dirty
5. [PASS] Using the command in step1 to submit a Flink application again
6. [PASS] Verify the job does not run again and finish directly. Also the dirty job result store has been marked clean.
7. [PASS] Start the Flink application again using the command in step1 with {{job-result-store.delete-on-commit=true}}
8. [{color:#ff0000}NOT PASS{color}] Verify the clean job result store file is deleted

 

cc [~mapohl] I am not sure whether the step8 is the expected behavior. Because when I start a new Flink application with delete-on-commit=true at the very beginning, we will not have the retained job result store file.

 

 

Update:

The step8 in the above comment is not a bug. Because I set the {{job-result-store.delete-on-commit=true}} in the initial run, the users need to clean up the file job result store manually.


was (Author: fly_in_gis):
Test this ticket on Yarn with following steps.


1. [PASS] Submit a Flink application with specified high-availability cluster-id
{code:java}
./bin/flink run-application -t yarn-application -d -Dhigh-availability.cluster-id=test-job-result-store -Dhigh-availability=ZooKeeper -Dhigh-availability.zookeeper.quorum=i22xxxx:12181 -Dhigh-availability.storageDir=hdfs://flinkdev/tmp/flink-ha-yiqi -Djob-result-store.delete-on-commit=true examples/streaming/StateMachineExample.jar
{code}
2. [PASS] Wait for the job running and cancel it via the webUI
3. [PASS] Verify the job result store at {{hdfs://flinkdev/tmp/flink-ha-yiqi/job-result-store/test-job-result-store/00000000000000000000000000000000.json}}
The content is as following.
{code:java}
{"result":{"id":"00000000000000000000000000000000","application-status":"CANCELED","accumulator-results":{},"net-runtime":74563},"version":1}
{code}
4. [PASS] Rename the generated job result store file to dirty
5. [PASS] Using the command in step1 to submit a Flink application again
6. [PASS] Verify the job does not run again and finish directly. Also the dirty job result store has been marked clean.
7. [PASS] Start the Flink application again using the command in step1 with {{job-result-store.delete-on-commit=true}}
8. [{color:#FF0000}NOT PASS{color}] Verify the clean job result store file is deleted

 

cc [~mapohl] I am not sure whether the step8 is the expected behavior. Because when I start a new Flink application with delete-on-commit=true at the very beginning, we will not have the retained job result store file.

> Release Testing: Application Mode recovery does not re-trigger a job which failed during cleanup (FLINK-11813)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26391
>                 URL: https://issues.apache.org/jira/browse/FLINK-26391
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Assignee: Yang Wang
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> FLINK-11813 is about not being able to determine whether a job has been terminated globally before a failover happened. Testing this behavior can be achieved by running a job in HA mode to enable the file-based {{JobResultStore}} (JRS).
> You can specify [job-result-store.storage-path|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-storage-path] to point to a directory which you can access. [job-result-store.delete-on-commit|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-delete-on-commit] can be used to make the JRS artifacts not being deleted after a job finished.
> You can make a job finish to generate a the JRS artifact for this job in the specified directory. Renaming the generated file from {{<job-id>.json}} to {{<job-id>_DIRTY.json}} will simulate the job not being cleaned up properly. Starting the job in application mode once more (through specifying the corresponding Job ID) should lead to the job not being started again (you might want to enable {{debug}} logging to verify the logs), i.e.:
> * Cleanup should be performed. 
> * No JobMaster-related logs should appear in the Flink logs.
> * cleanup-related logs should appear in the Flink logs.
> * At the end, the {{_DIRTY.json}} file extension should have been removed from the JRS artifact again



--
This message was sent by Atlassian Jira
(v8.20.1#820001)