You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/07/16 14:05:33 UTC

[GitHub] [flink] 1u0 opened a new pull request #9131: [FLINK-12858][checkpointing] Stop-with-savepoint, workaround: fail whole job when savepoint is declined by a task

1u0 opened a new pull request #9131: [FLINK-12858][checkpointing] Stop-with-savepoint, workaround: fail whole job when savepoint is declined by a task
URL: https://github.com/apache/flink/pull/9131
 
 
   ## What is the purpose of the change
   
   This pull request is an attempt to address hanging Flink job when stop-with-savepoint fails due to decline of the savepoint by job's task. In such cases, the job manager would fail the whole execution graph (which may trigger a job restart).
   
   ## Brief change log
   
     - The `LegacyScheduler` is modified to track `CheckpointException`s in `stopWithSavepoint()` that originate from tasks and fails the execution graph for such exceptions.
   
   ## Verifying this change
   
   This change was partially tested by manual e2e test:
    * configured Flink cluster with savepoints/checkpoints setup and `task.checkpoint.alignment.max-size: 64` set;
    * a test Flink job with two congested sources (joined in `map-1`) and events that exceed the configured limit (see execution graph below).
   
   In the test job, the `map-1` with high probability rejects checkpoint/savepoints due to `checkpointSizeLimitExceeded`.
   
   ![job-graph](https://user-images.githubusercontent.com/488251/61300624-9b368600-a7e2-11e9-8d9c-62e936689a79.png)
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services