You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Vijay Bhaskar <bh...@gmail.com> on 2020/05/25 06:20:23 UTC

In consistent Check point API response

Hi
I am using flink retained check points and along with
 jobs/:jobid/checkpoints API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both
"restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both
values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both
values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both
values are same
4) Suppose the job is failed all 6 times and the job marked failed. then
also both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts
and made few check points, then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue.
Only When case (5) i had severe issue in my production as the "restored "
field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },

 "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,

 "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },

 "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },

 "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },

 "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}

Re: In consistent Check point API response

Posted by Vijay Bhaskar <bh...@gmail.com>.

Created JIRA for it: https://issues.apache.org/jira/browse/FLINK-17966

Regards
Bhaskar



On Wed, May 27, 2020 at 1:28 PM Vijay Bhaskar <bh...@gmail.com>
wrote:

> Thanks Yun. In that case  it would be good to give the reference of that
> documentation in the Flink Rest API:
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html
> while explaining about the checkpoints. Tomorrow any one want to use REST
> API, they will get easy reference of the monitoring document of
> checkpoints. It would give them complete idea. So I will open Jira with
> this requirement
>
> Regards
> Bhaskar
>
> On Wed, May 27, 2020 at 11:59 AM Yun Tang <my...@live.com> wrote:
>
>> To be honest, from my point of view current description should have
>> already give enough explanations [1] in "Overview Tab".
>> *    Latest Completed Checkpoint*: The latest successfully completed
>> checkpoints.
>>     *Latest Restore*: There are two types of restore operations.
>>
>>    - Restore from Checkpoint: We restored from a regular periodic
>>    checkpoint.
>>    - Restore from Savepoint: We restored from a savepoint.
>>
>>
>> You could still create a JIRA issue and give your ideas in that issue. If
>> agreed to work on in that ticket, you can create a PR to edit
>> checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to
>> update related documentation.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab
>> [2]
>> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md
>> [3]
>> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md
>>
>> Best
>> Yun Tang
>> ------------------------------
>> *From:* Vijay Bhaskar <bh...@gmail.com>
>> *Sent:* Tuesday, May 26, 2020 15:18
>> *To:* Yun Tang <my...@live.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Thanks Yun. How can i contribute better documentation of the same by
>> opening Jira on this?
>>
>> Regards
>> Bhaskar
>>
>> On Tue, May 26, 2020 at 12:32 PM Yun Tang <my...@live.com> wrote:
>>
>> Hi Bhaskar
>>
>> I think I have understood your scenario now. And I think this is what
>> expected in Flink.
>> As you only allow your job could restore 5 times, the "restore" would
>> only record the checkpoint to restore at the 5th recovery, and the
>> checkpoint id would always stay there.
>>
>> "Restored" is for last restored checkpoint and "completed" is for last
>> completed checkpoint, they are actually not the same thing.
>> The only scenario that they're the same in numbers is when Flink just
>> restore successfully before a new checkpoint completes.
>>
>> Best
>> Yun Tang
>>
>>
>> ------------------------------
>> *From:* Vijay Bhaskar <bh...@gmail.com>
>> *Sent:* Tuesday, May 26, 2020 12:19
>> *To:* Yun Tang <my...@live.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Hi Yun
>> Understood the issue now:
>> "restored" always shows only the check point that is used for restoring
>> previous state
>> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
>> attempt)
>>   Flink HA is  restoring the state, so restored and latest are same value
>> if the last attempt  == 6
>>  Flink job already has few check points
>>  After that job failed and Flink HA gave up and marked the job state as
>> "FAILED"
>>    At this point "restored". value is the one which is in 5'th attempt
>> but latest is the one which is the latest checkpoint which is retained
>>
>> Shall i file any documentation improvement Jira? I want to add more
>> documentation with the help of  the above scenarios.
>>
>> Regards
>> Bhaskar
>>
>>
>>
>> On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com> wrote:
>>
>> Hi Bhaskar
>>
>> It seems I still not understand your case-5 totally. Your job failed 6
>> times, and recover from previous checkpoint to restart again. However, you
>> found the REST API told the wrong answer.
>> How do you ensure your "restored" field is giving the wrong checkpoint
>> file which is not latest? Have you ever checked the log in JM to view
>> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
>> [1] to know exactly which checkpoint choose to restore?
>>
>> I think you could give a more concrete example e.g. which expected/actual
>> checkpoint to restore, to tell your story.
>>
>> [1]
>> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>>
>> Best
>> Yun Tang
>> ------------------------------
>> *From:* Vijay Bhaskar <bh...@gmail.com>
>> *Sent:* Monday, May 25, 2020 17:01
>> *To:* Yun Tang <my...@live.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Thanks Yun.
>> Here is the problem i am facing:
>>
>> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
>> have the remote manager which monitors the jobs.  We are using "restored"
>> field of the API response to get the latest check point file to use. Its
>> giving correct checkpoint file for all the 4 cases except the 5'th case.
>> Where the "restored" field is giving the wrong check point file which is
>> not latest.  When we compare the  check point file returned by  the
>> "completed". field, both are giving identical checkpoints in all 4 cases,
>> except 5'th case
>> We can't use flink UI in because of security reasons
>>
>> Regards
>> Bhaskar
>>
>> On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com> wrote:
>>
>> Hi Vijay
>>
>> If I understand correct, do you mean your last "restored" checkpoint is
>> null via REST api when the job failed 6 times and then recover successfully
>> with another several successful checkpoints?
>>
>> First of all, if your job just recovered successfully, can you observe
>> the "last restored" checkpoint in web UI?
>> Secondly, how long will you cannot see the "restored " field  after
>> recover successfully?
>> Last but not least, I cannot see the real difference among your cases,
>> what's the core difference in your case(5)?
>>
>> From the implementation of Flink, it will create the checkpoint statics
>> without restored checkpoint and assign it once the latest
>> savepoint/checkpoint is restored. [1]
>>
>> [1]
>> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>>
>> Best
>> Yun Tang
>>
>> ------------------------------
>> *From:* Vijay Bhaskar <bh...@gmail.com>
>> *Sent:* Monday, May 25, 2020 14:20
>> *To:* user <us...@flink.apache.org>
>> *Subject:* In consistent Check point API response
>>
>> Hi
>> I am using flink retained check points and along with
>>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
>> Following the response of Flink Checkpoints API:
>>
>> I have my jobs restart attempts are 5
>>  check point API response in "latest" key, check point file name of both
>> "restored" and "completed" values are having following behavior
>> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
>> values are same
>> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
>> values are same
>> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
>> values are same
>> 4) Suppose the job is failed all 6 times and the job marked failed. then
>> also both the values are same
>> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
>> and made few check points, then both values are different.
>>
>> During case (1), case (2), case (3) and case (4) i never had any issue.
>> Only When case (5) i had severe issue in my production as the "restored "
>> field check point doesn't exist
>>
>> Please suggest any
>>
>>
>>
>> {
>>    "counts":{
>>       "restored":6,
>>       "total":3,
>>       "in_progress":0,
>>       "completed":3,
>>       "failed":0
>>    },
>>    "summary":{
>>       "state_size":{
>>          "min":4879,
>>          "max":4879,
>>          "avg":4879
>>       },
>>       "end_to_end_duration":{
>>          "min":25,
>>          "max":130,
>>          "avg":87
>>       },
>>       "alignment_buffered":{
>>          "min":0,
>>          "max":0,
>>          "avg":0
>>       }
>>    },
>>    "latest":{
>>       "completed":{
>>          "@class":"completed",
>>          "id":7094,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382502772,
>>          "latest_ack_timestamp":1590382502902,
>>          "state_size":4879,
>>          "end_to_end_duration":130,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>>          "discarded":false
>>       },
>>       "savepoint":null,
>>       "failed":null,
>>       "restored":{
>>          "id":7093,
>>          "restore_timestamp":1590382478448,
>>          "is_savepoint":false,
>>
>>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>>       }
>>    },
>>    "history":[
>>       {
>>          "@class":"completed",
>>          "id":7094,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382502772,
>>          "latest_ack_timestamp":1590382502902,
>>          "state_size":4879,
>>          "end_to_end_duration":130,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>>          "discarded":false
>>       },
>>       {
>>          "@class":"completed",
>>          "id":7093,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382310195,
>>          "latest_ack_timestamp":1590382310220,
>>          "state_size":4879,
>>          "end_to_end_duration":25,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>>          "discarded":false
>>       },
>>       {
>>          "@class":"completed",
>>          "id":7092,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382190195,
>>          "latest_ack_timestamp":1590382190303,
>>          "state_size":4879,
>>          "end_to_end_duration":108,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>>          "discarded":true
>>       }
>>    ]
>> }
>>
>>

Re: In consistent Check point API response

Posted by Vijay Bhaskar <bh...@gmail.com>.

Thanks Yun. In that case  it would be good to give the reference of that
documentation in the Flink Rest API:
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html
while explaining about the checkpoints. Tomorrow any one want to use REST
API, they will get easy reference of the monitoring document of
checkpoints. It would give them complete idea. So I will open Jira with
this requirement

Regards
Bhaskar

On Wed, May 27, 2020 at 11:59 AM Yun Tang <my...@live.com> wrote:

> To be honest, from my point of view current description should have
> already give enough explanations [1] in "Overview Tab".
> *    Latest Completed Checkpoint*: The latest successfully completed
> checkpoints.
>     *Latest Restore*: There are two types of restore operations.
>
>    - Restore from Checkpoint: We restored from a regular periodic
>    checkpoint.
>    - Restore from Savepoint: We restored from a savepoint.
>
>
> You could still create a JIRA issue and give your ideas in that issue. If
> agreed to work on in that ticket, you can create a PR to edit
> checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to
> update related documentation.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab
> [2]
> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md
> [3]
> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md
>
> Best
> Yun Tang
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 15:18
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Thanks Yun. How can i contribute better documentation of the same by
> opening Jira on this?
>
> Regards
> Bhaskar
>
> On Tue, May 26, 2020 at 12:32 PM Yun Tang <my...@live.com> wrote:
>
> Hi Bhaskar
>
> I think I have understood your scenario now. And I think this is what
> expected in Flink.
> As you only allow your job could restore 5 times, the "restore" would only
> record the checkpoint to restore at the 5th recovery, and the checkpoint id
> would always stay there.
>
> "Restored" is for last restored checkpoint and "completed" is for last
> completed checkpoint, they are actually not the same thing.
> The only scenario that they're the same in numbers is when Flink just
> restore successfully before a new checkpoint completes.
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 12:19
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Hi Yun
> Understood the issue now:
> "restored" always shows only the check point that is used for restoring
> previous state
> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
> attempt)
>   Flink HA is  restoring the state, so restored and latest are same value
> if the last attempt  == 6
>  Flink job already has few check points
>  After that job failed and Flink HA gave up and marked the job state as
> "FAILED"
>    At this point "restored". value is the one which is in 5'th attempt but
> latest is the one which is the latest checkpoint which is retained
>
> Shall i file any documentation improvement Jira? I want to add more
> documentation with the help of  the above scenarios.
>
> Regards
> Bhaskar
>
>
>
> On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com> wrote:
>
> Hi Bhaskar
>
> It seems I still not understand your case-5 totally. Your job failed 6
> times, and recover from previous checkpoint to restart again. However, you
> found the REST API told the wrong answer.
> How do you ensure your "restored" field is giving the wrong checkpoint
> file which is not latest? Have you ever checked the log in JM to view
> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
> [1] to know exactly which checkpoint choose to restore?
>
> I think you could give a more concrete example e.g. which expected/actual
> checkpoint to restore, to tell your story.
>
> [1]
> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>
> Best
> Yun Tang
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 17:01
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Thanks Yun.
> Here is the problem i am facing:
>
> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
> have the remote manager which monitors the jobs.  We are using "restored"
> field of the API response to get the latest check point file to use. Its
> giving correct checkpoint file for all the 4 cases except the 5'th case.
> Where the "restored" field is giving the wrong check point file which is
> not latest.  When we compare the  check point file returned by  the
> "completed". field, both are giving identical checkpoints in all 4 cases,
> except 5'th case
> We can't use flink UI in because of security reasons
>
> Regards
> Bhaskar
>
> On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com> wrote:
>
> Hi Vijay
>
> If I understand correct, do you mean your last "restored" checkpoint is
> null via REST api when the job failed 6 times and then recover successfully
> with another several successful checkpoints?
>
> First of all, if your job just recovered successfully, can you observe the
> "last restored" checkpoint in web UI?
> Secondly, how long will you cannot see the "restored " field  after
> recover successfully?
> Last but not least, I cannot see the real difference among your cases,
> what's the core difference in your case(5)?
>
> From the implementation of Flink, it will create the checkpoint statics
> without restored checkpoint and assign it once the latest
> savepoint/checkpoint is restored. [1]
>
> [1]
> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 14:20
> *To:* user <us...@flink.apache.org>
> *Subject:* In consistent Check point API response
>
> Hi
> I am using flink retained check points and along with
>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
> Following the response of Flink Checkpoints API:
>
> I have my jobs restart attempts are 5
>  check point API response in "latest" key, check point file name of both
> "restored" and "completed" values are having following behavior
> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
> values are same
> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
> values are same
> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
> values are same
> 4) Suppose the job is failed all 6 times and the job marked failed. then
> also both the values are same
> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
> and made few check points, then both values are different.
>
> During case (1), case (2), case (3) and case (4) i never had any issue.
> Only When case (5) i had severe issue in my production as the "restored "
> field check point doesn't exist
>
> Please suggest any
>
>
>
> {
>    "counts":{
>       "restored":6,
>       "total":3,
>       "in_progress":0,
>       "completed":3,
>       "failed":0
>    },
>    "summary":{
>       "state_size":{
>          "min":4879,
>          "max":4879,
>          "avg":4879
>       },
>       "end_to_end_duration":{
>          "min":25,
>          "max":130,
>          "avg":87
>       },
>       "alignment_buffered":{
>          "min":0,
>          "max":0,
>          "avg":0
>       }
>    },
>    "latest":{
>       "completed":{
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       "savepoint":null,
>       "failed":null,
>       "restored":{
>          "id":7093,
>          "restore_timestamp":1590382478448,
>          "is_savepoint":false,
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>       }
>    },
>    "history":[
>       {
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7093,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382310195,
>          "latest_ack_timestamp":1590382310220,
>          "state_size":4879,
>          "end_to_end_duration":25,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7092,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382190195,
>          "latest_ack_timestamp":1590382190303,
>          "state_size":4879,
>          "end_to_end_duration":108,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>          "discarded":true
>       }
>    ]
> }
>
>

Re: In consistent Check point API response

Posted by Yun Tang <my...@live.com>.

To be honest, from my point of view current description should have already give enough explanations [1] in "Overview Tab".
    Latest Completed Checkpoint: The latest successfully completed checkpoints.
    Latest Restore: There are two types of restore operations.

  *   Restore from Checkpoint: We restored from a regular periodic checkpoint.
  *   Restore from Savepoint: We restored from a savepoint.

You could still create a JIRA issue and give your ideas in that issue. If agreed to work on in that ticket, you can create a PR to edit checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to update related documentation.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab
[2] https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md
[3] https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md

Best
Yun Tang
________________________________
From: Vijay Bhaskar <bh...@gmail.com>
Sent: Tuesday, May 26, 2020 15:18
To: Yun Tang <my...@live.com>
Cc: user <us...@flink.apache.org>
Subject: Re: In consistent Check point API response

Thanks Yun. How can i contribute better documentation of the same by opening Jira on this?

Regards
Bhaskar

On Tue, May 26, 2020 at 12:32 PM Yun Tang <my...@live.com>> wrote:
Hi Bhaskar

I think I have understood your scenario now. And I think this is what expected in Flink.
As you only allow your job could restore 5 times, the "restore" would only record the checkpoint to restore at the 5th recovery, and the checkpoint id would always stay there.

"Restored" is for last restored checkpoint and "completed" is for last completed checkpoint, they are actually not the same thing.
The only scenario that they're the same in numbers is when Flink just restore successfully before a new checkpoint completes.

Best
Yun Tang


________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Tuesday, May 26, 2020 12:19
To: Yun Tang <my...@live.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: In consistent Check point API response

Hi Yun
Understood the issue now:
"restored" always shows only the check point that is used for restoring previous state
In all the attempts < 6 ( in my case max attempts are 5, 6 is the last attempt)
  Flink HA is  restoring the state, so restored and latest are same value
if the last attempt  == 6
 Flink job already has few check points
 After that job failed and Flink HA gave up and marked the job state as "FAILED"
   At this point "restored". value is the one which is in 5'th attempt but latest is the one which is the latest checkpoint which is retained

Shall i file any documentation improvement Jira? I want to add more documentation with the help of  the above scenarios.

Regards
Bhaskar



On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com>> wrote:
Hi Bhaskar

It seems I still not understand your case-5 totally. Your job failed 6 times, and recover from previous checkpoint to restart again. However, you found the REST API told the wrong answer.
How do you ensure your "restored" field is giving the wrong checkpoint file which is not latest? Have you ever checked the log in JM to view related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" [1] to know exactly which checkpoint choose to restore?

I think you could give a more concrete example e.g. which expected/actual checkpoint to restore, to tell your story.

[1] https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250

Best
Yun Tang
________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Monday, May 25, 2020 17:01
To: Yun Tang <my...@live.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: In consistent Check point API response

Thanks Yun.
Here is the problem i am facing:

I am using  jobs/:jobID/checkpoints  API to recover the failed job. We have the remote manager which monitors the jobs.  We are using "restored" field of the API response to get the latest check point file to use. Its giving correct checkpoint file for all the 4 cases except the 5'th case. Where the "restored" field is giving the wrong check point file which is not latest.  When we compare the  check point file returned by  the "completed". field, both are giving identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons

Regards
Bhaskar

On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com>> wrote:
Hi Vijay

If I understand correct, do you mean your last "restored" checkpoint is null via REST api when the job failed 6 times and then recover successfully with another several successful checkpoints?

First of all, if your job just recovered successfully, can you observe the "last restored" checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field  after recover successfully?
Last but not least, I cannot see the real difference among your cases, what's the core difference in your case(5)?

From the implementation of Flink, it will create the checkpoint statics without restored checkpoint and assign it once the latest savepoint/checkpoint is restored. [1]

[1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285

Best
Yun Tang

________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Monday, May 25, 2020 14:20
To: user <us...@flink.apache.org>>
Subject: In consistent Check point API response

Hi
I am using flink retained check points and along with   jobs/:jobid/checkpoints API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both "restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values are same
4) Suppose the job is failed all 6 times and the job marked failed. then also both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made few check points, then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue. Only When case (5) i had severe issue in my production as the "restored " field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}

Re: In consistent Check point API response

Posted by Vijay Bhaskar <bh...@gmail.com>.

Thanks Yun. How can i contribute better documentation of the same by
opening Jira on this?

Regards
Bhaskar

On Tue, May 26, 2020 at 12:32 PM Yun Tang <my...@live.com> wrote:

> Hi Bhaskar
>
> I think I have understood your scenario now. And I think this is what
> expected in Flink.
> As you only allow your job could restore 5 times, the "restore" would only
> record the checkpoint to restore at the 5th recovery, and the checkpoint id
> would always stay there.
>
> "Restored" is for last restored checkpoint and "completed" is for last
> completed checkpoint, they are actually not the same thing.
> The only scenario that they're the same in numbers is when Flink just
> restore successfully before a new checkpoint completes.
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 12:19
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Hi Yun
> Understood the issue now:
> "restored" always shows only the check point that is used for restoring
> previous state
> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
> attempt)
>   Flink HA is  restoring the state, so restored and latest are same value
> if the last attempt  == 6
>  Flink job already has few check points
>  After that job failed and Flink HA gave up and marked the job state as
> "FAILED"
>    At this point "restored". value is the one which is in 5'th attempt but
> latest is the one which is the latest checkpoint which is retained
>
> Shall i file any documentation improvement Jira? I want to add more
> documentation with the help of  the above scenarios.
>
> Regards
> Bhaskar
>
>
>
> On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com> wrote:
>
> Hi Bhaskar
>
> It seems I still not understand your case-5 totally. Your job failed 6
> times, and recover from previous checkpoint to restart again. However, you
> found the REST API told the wrong answer.
> How do you ensure your "restored" field is giving the wrong checkpoint
> file which is not latest? Have you ever checked the log in JM to view
> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
> [1] to know exactly which checkpoint choose to restore?
>
> I think you could give a more concrete example e.g. which expected/actual
> checkpoint to restore, to tell your story.
>
> [1]
> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>
> Best
> Yun Tang
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 17:01
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Thanks Yun.
> Here is the problem i am facing:
>
> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
> have the remote manager which monitors the jobs.  We are using "restored"
> field of the API response to get the latest check point file to use. Its
> giving correct checkpoint file for all the 4 cases except the 5'th case.
> Where the "restored" field is giving the wrong check point file which is
> not latest.  When we compare the  check point file returned by  the
> "completed". field, both are giving identical checkpoints in all 4 cases,
> except 5'th case
> We can't use flink UI in because of security reasons
>
> Regards
> Bhaskar
>
> On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com> wrote:
>
> Hi Vijay
>
> If I understand correct, do you mean your last "restored" checkpoint is
> null via REST api when the job failed 6 times and then recover successfully
> with another several successful checkpoints?
>
> First of all, if your job just recovered successfully, can you observe the
> "last restored" checkpoint in web UI?
> Secondly, how long will you cannot see the "restored " field  after
> recover successfully?
> Last but not least, I cannot see the real difference among your cases,
> what's the core difference in your case(5)?
>
> From the implementation of Flink, it will create the checkpoint statics
> without restored checkpoint and assign it once the latest
> savepoint/checkpoint is restored. [1]
>
> [1]
> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 14:20
> *To:* user <us...@flink.apache.org>
> *Subject:* In consistent Check point API response
>
> Hi
> I am using flink retained check points and along with
>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
> Following the response of Flink Checkpoints API:
>
> I have my jobs restart attempts are 5
>  check point API response in "latest" key, check point file name of both
> "restored" and "completed" values are having following behavior
> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
> values are same
> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
> values are same
> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
> values are same
> 4) Suppose the job is failed all 6 times and the job marked failed. then
> also both the values are same
> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
> and made few check points, then both values are different.
>
> During case (1), case (2), case (3) and case (4) i never had any issue.
> Only When case (5) i had severe issue in my production as the "restored "
> field check point doesn't exist
>
> Please suggest any
>
>
>
> {
>    "counts":{
>       "restored":6,
>       "total":3,
>       "in_progress":0,
>       "completed":3,
>       "failed":0
>    },
>    "summary":{
>       "state_size":{
>          "min":4879,
>          "max":4879,
>          "avg":4879
>       },
>       "end_to_end_duration":{
>          "min":25,
>          "max":130,
>          "avg":87
>       },
>       "alignment_buffered":{
>          "min":0,
>          "max":0,
>          "avg":0
>       }
>    },
>    "latest":{
>       "completed":{
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       "savepoint":null,
>       "failed":null,
>       "restored":{
>          "id":7093,
>          "restore_timestamp":1590382478448,
>          "is_savepoint":false,
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>       }
>    },
>    "history":[
>       {
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7093,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382310195,
>          "latest_ack_timestamp":1590382310220,
>          "state_size":4879,
>          "end_to_end_duration":25,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7092,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382190195,
>          "latest_ack_timestamp":1590382190303,
>          "state_size":4879,
>          "end_to_end_duration":108,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>          "discarded":true
>       }
>    ]
> }
>
>

Re: In consistent Check point API response

Posted by Yun Tang <my...@live.com>.

Hi Bhaskar

I think I have understood your scenario now. And I think this is what expected in Flink.
As you only allow your job could restore 5 times, the "restore" would only record the checkpoint to restore at the 5th recovery, and the checkpoint id would always stay there.

"Restored" is for last restored checkpoint and "completed" is for last completed checkpoint, they are actually not the same thing.
The only scenario that they're the same in numbers is when Flink just restore successfully before a new checkpoint completes.

Best
Yun Tang


________________________________
From: Vijay Bhaskar <bh...@gmail.com>
Sent: Tuesday, May 26, 2020 12:19
To: Yun Tang <my...@live.com>
Cc: user <us...@flink.apache.org>
Subject: Re: In consistent Check point API response

Hi Yun
Understood the issue now:
"restored" always shows only the check point that is used for restoring previous state
In all the attempts < 6 ( in my case max attempts are 5, 6 is the last attempt)
  Flink HA is  restoring the state, so restored and latest are same value
if the last attempt  == 6
 Flink job already has few check points
 After that job failed and Flink HA gave up and marked the job state as "FAILED"
   At this point "restored". value is the one which is in 5'th attempt but latest is the one which is the latest checkpoint which is retained

Shall i file any documentation improvement Jira? I want to add more documentation with the help of  the above scenarios.

Regards
Bhaskar



On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com>> wrote:
Hi Bhaskar

It seems I still not understand your case-5 totally. Your job failed 6 times, and recover from previous checkpoint to restart again. However, you found the REST API told the wrong answer.
How do you ensure your "restored" field is giving the wrong checkpoint file which is not latest? Have you ever checked the log in JM to view related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" [1] to know exactly which checkpoint choose to restore?

I think you could give a more concrete example e.g. which expected/actual checkpoint to restore, to tell your story.

[1] https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250

Best
Yun Tang
________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Monday, May 25, 2020 17:01
To: Yun Tang <my...@live.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: In consistent Check point API response

Thanks Yun.
Here is the problem i am facing:

I am using  jobs/:jobID/checkpoints  API to recover the failed job. We have the remote manager which monitors the jobs.  We are using "restored" field of the API response to get the latest check point file to use. Its giving correct checkpoint file for all the 4 cases except the 5'th case. Where the "restored" field is giving the wrong check point file which is not latest.  When we compare the  check point file returned by  the "completed". field, both are giving identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons

Regards
Bhaskar

On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com>> wrote:
Hi Vijay

If I understand correct, do you mean your last "restored" checkpoint is null via REST api when the job failed 6 times and then recover successfully with another several successful checkpoints?

First of all, if your job just recovered successfully, can you observe the "last restored" checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field  after recover successfully?
Last but not least, I cannot see the real difference among your cases, what's the core difference in your case(5)?

From the implementation of Flink, it will create the checkpoint statics without restored checkpoint and assign it once the latest savepoint/checkpoint is restored. [1]

[1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285

Best
Yun Tang

________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Monday, May 25, 2020 14:20
To: user <us...@flink.apache.org>>
Subject: In consistent Check point API response

Hi
I am using flink retained check points and along with   jobs/:jobid/checkpoints API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both "restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values are same
4) Suppose the job is failed all 6 times and the job marked failed. then also both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made few check points, then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue. Only When case (5) i had severe issue in my production as the "restored " field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}

Re: In consistent Check point API response

Posted by Vijay Bhaskar <bh...@gmail.com>.

Hi Yun
Understood the issue now:
"restored" always shows only the check point that is used for restoring
previous state
In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
attempt)
  Flink HA is  restoring the state, so restored and latest are same value
if the last attempt  == 6
 Flink job already has few check points
 After that job failed and Flink HA gave up and marked the job state as
"FAILED"
   At this point "restored". value is the one which is in 5'th attempt but
latest is the one which is the latest checkpoint which is retained

Shall i file any documentation improvement Jira? I want to add more
documentation with the help of  the above scenarios.

Regards
Bhaskar



On Tue, May 26, 2020 at 8:14 AM Yun Tang <my...@live.com> wrote:

> Hi Bhaskar
>
> It seems I still not understand your case-5 totally. Your job failed 6
> times, and recover from previous checkpoint to restart again. However, you
> found the REST API told the wrong answer.
> How do you ensure your "restored" field is giving the wrong checkpoint
> file which is not latest? Have you ever checked the log in JM to view
> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
> [1] to know exactly which checkpoint choose to restore?
>
> I think you could give a more concrete example e.g. which expected/actual
> checkpoint to restore, to tell your story.
>
> [1]
> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>
> Best
> Yun Tang
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 17:01
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Thanks Yun.
> Here is the problem i am facing:
>
> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
> have the remote manager which monitors the jobs.  We are using "restored"
> field of the API response to get the latest check point file to use. Its
> giving correct checkpoint file for all the 4 cases except the 5'th case.
> Where the "restored" field is giving the wrong check point file which is
> not latest.  When we compare the  check point file returned by  the
> "completed". field, both are giving identical checkpoints in all 4 cases,
> except 5'th case
> We can't use flink UI in because of security reasons
>
> Regards
> Bhaskar
>
> On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com> wrote:
>
> Hi Vijay
>
> If I understand correct, do you mean your last "restored" checkpoint is
> null via REST api when the job failed 6 times and then recover successfully
> with another several successful checkpoints?
>
> First of all, if your job just recovered successfully, can you observe the
> "last restored" checkpoint in web UI?
> Secondly, how long will you cannot see the "restored " field  after
> recover successfully?
> Last but not least, I cannot see the real difference among your cases,
> what's the core difference in your case(5)?
>
> From the implementation of Flink, it will create the checkpoint statics
> without restored checkpoint and assign it once the latest
> savepoint/checkpoint is restored. [1]
>
> [1]
> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 14:20
> *To:* user <us...@flink.apache.org>
> *Subject:* In consistent Check point API response
>
> Hi
> I am using flink retained check points and along with
>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
> Following the response of Flink Checkpoints API:
>
> I have my jobs restart attempts are 5
>  check point API response in "latest" key, check point file name of both
> "restored" and "completed" values are having following behavior
> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
> values are same
> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
> values are same
> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
> values are same
> 4) Suppose the job is failed all 6 times and the job marked failed. then
> also both the values are same
> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
> and made few check points, then both values are different.
>
> During case (1), case (2), case (3) and case (4) i never had any issue.
> Only When case (5) i had severe issue in my production as the "restored "
> field check point doesn't exist
>
> Please suggest any
>
>
>
> {
>    "counts":{
>       "restored":6,
>       "total":3,
>       "in_progress":0,
>       "completed":3,
>       "failed":0
>    },
>    "summary":{
>       "state_size":{
>          "min":4879,
>          "max":4879,
>          "avg":4879
>       },
>       "end_to_end_duration":{
>          "min":25,
>          "max":130,
>          "avg":87
>       },
>       "alignment_buffered":{
>          "min":0,
>          "max":0,
>          "avg":0
>       }
>    },
>    "latest":{
>       "completed":{
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       "savepoint":null,
>       "failed":null,
>       "restored":{
>          "id":7093,
>          "restore_timestamp":1590382478448,
>          "is_savepoint":false,
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>       }
>    },
>    "history":[
>       {
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7093,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382310195,
>          "latest_ack_timestamp":1590382310220,
>          "state_size":4879,
>          "end_to_end_duration":25,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7092,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382190195,
>          "latest_ack_timestamp":1590382190303,
>          "state_size":4879,
>          "end_to_end_duration":108,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>          "discarded":true
>       }
>    ]
> }
>
>

Re: In consistent Check point API response

Posted by Yun Tang <my...@live.com>.

Hi Bhaskar

It seems I still not understand your case-5 totally. Your job failed 6 times, and recover from previous checkpoint to restart again. However, you found the REST API told the wrong answer.
How do you ensure your "restored" field is giving the wrong checkpoint file which is not latest? Have you ever checked the log in JM to view related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" [1] to know exactly which checkpoint choose to restore?

I think you could give a more concrete example e.g. which expected/actual checkpoint to restore, to tell your story.

[1] https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250

Best
Yun Tang
________________________________
From: Vijay Bhaskar <bh...@gmail.com>
Sent: Monday, May 25, 2020 17:01
To: Yun Tang <my...@live.com>
Cc: user <us...@flink.apache.org>
Subject: Re: In consistent Check point API response

Thanks Yun.
Here is the problem i am facing:

I am using  jobs/:jobID/checkpoints  API to recover the failed job. We have the remote manager which monitors the jobs.  We are using "restored" field of the API response to get the latest check point file to use. Its giving correct checkpoint file for all the 4 cases except the 5'th case. Where the "restored" field is giving the wrong check point file which is not latest.  When we compare the  check point file returned by  the "completed". field, both are giving identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons

Regards
Bhaskar

On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com>> wrote:
Hi Vijay

If I understand correct, do you mean your last "restored" checkpoint is null via REST api when the job failed 6 times and then recover successfully with another several successful checkpoints?

First of all, if your job just recovered successfully, can you observe the "last restored" checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field  after recover successfully?
Last but not least, I cannot see the real difference among your cases, what's the core difference in your case(5)?

From the implementation of Flink, it will create the checkpoint statics without restored checkpoint and assign it once the latest savepoint/checkpoint is restored. [1]

[1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285

Best
Yun Tang

________________________________
From: Vijay Bhaskar <bh...@gmail.com>>
Sent: Monday, May 25, 2020 14:20
To: user <us...@flink.apache.org>>
Subject: In consistent Check point API response

Hi
I am using flink retained check points and along with   jobs/:jobid/checkpoints API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both "restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values are same
4) Suppose the job is failed all 6 times and the job marked failed. then also both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made few check points, then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue. Only When case (5) i had severe issue in my production as the "restored " field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}

Re: In consistent Check point API response

Posted by Vijay Bhaskar <bh...@gmail.com>.

Thanks Yun.
Here is the problem i am facing:

I am using  jobs/:jobID/checkpoints  API to recover the failed job. We have
the remote manager which monitors the jobs.  We are using "restored" field
of the API response to get the latest check point file to use. Its giving
correct checkpoint file for all the 4 cases except the 5'th case. Where the
"restored" field is giving the wrong check point file which is not latest.
When we compare the  check point file returned by  the "completed". field,
both are giving identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons

Regards
Bhaskar

On Mon, May 25, 2020 at 12:57 PM Yun Tang <my...@live.com> wrote:

> Hi Vijay
>
> If I understand correct, do you mean your last "restored" checkpoint is
> null via REST api when the job failed 6 times and then recover successfully
> with another several successful checkpoints?
>
> First of all, if your job just recovered successfully, can you observe the
> "last restored" checkpoint in web UI?
> Secondly, how long will you cannot see the "restored " field  after
> recover successfully?
> Last but not least, I cannot see the real difference among your cases,
> what's the core difference in your case(5)?
>
> From the implementation of Flink, it will create the checkpoint statics
> without restored checkpoint and assign it once the latest
> savepoint/checkpoint is restored. [1]
>
> [1]
> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Vijay Bhaskar <bh...@gmail.com>
> *Sent:* Monday, May 25, 2020 14:20
> *To:* user <us...@flink.apache.org>
> *Subject:* In consistent Check point API response
>
> Hi
> I am using flink retained check points and along with
>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
> Following the response of Flink Checkpoints API:
>
> I have my jobs restart attempts are 5
>  check point API response in "latest" key, check point file name of both
> "restored" and "completed" values are having following behavior
> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
> values are same
> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
> values are same
> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
> values are same
> 4) Suppose the job is failed all 6 times and the job marked failed. then
> also both the values are same
> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
> and made few check points, then both values are different.
>
> During case (1), case (2), case (3) and case (4) i never had any issue.
> Only When case (5) i had severe issue in my production as the "restored "
> field check point doesn't exist
>
> Please suggest any
>
>
>
> {
>    "counts":{
>       "restored":6,
>       "total":3,
>       "in_progress":0,
>       "completed":3,
>       "failed":0
>    },
>    "summary":{
>       "state_size":{
>          "min":4879,
>          "max":4879,
>          "avg":4879
>       },
>       "end_to_end_duration":{
>          "min":25,
>          "max":130,
>          "avg":87
>       },
>       "alignment_buffered":{
>          "min":0,
>          "max":0,
>          "avg":0
>       }
>    },
>    "latest":{
>       "completed":{
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       "savepoint":null,
>       "failed":null,
>       "restored":{
>          "id":7093,
>          "restore_timestamp":1590382478448,
>          "is_savepoint":false,
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>       }
>    },
>    "history":[
>       {
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7093,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382310195,
>          "latest_ack_timestamp":1590382310220,
>          "state_size":4879,
>          "end_to_end_duration":25,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7092,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382190195,
>          "latest_ack_timestamp":1590382190303,
>          "state_size":4879,
>          "end_to_end_duration":108,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>          "discarded":true
>       }
>    ]
> }
>
>

Re: In consistent Check point API response

Posted by Yun Tang <my...@live.com>.

Hi Vijay

If I understand correct, do you mean your last "restored" checkpoint is null via REST api when the job failed 6 times and then recover successfully with another several successful checkpoints?

First of all, if your job just recovered successfully, can you observe the "last restored" checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field  after recover successfully?
Last but not least, I cannot see the real difference among your cases, what's the core difference in your case(5)?

From the implementation of Flink, it will create the checkpoint statics without restored checkpoint and assign it once the latest savepoint/checkpoint is restored. [1]

[1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285

Best
Yun Tang

________________________________
From: Vijay Bhaskar <bh...@gmail.com>
Sent: Monday, May 25, 2020 14:20
To: user <us...@flink.apache.org>
Subject: In consistent Check point API response

Hi
I am using flink retained check points and along with   jobs/:jobid/checkpoints API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both "restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values are same
4) Suppose the job is failed all 6 times and the job marked failed. then also both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made few check points, then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue. Only When case (5) i had severe issue in my production as the "restored " field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}