You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Vishal Santoshi <vi...@gmail.com> on 2019/03/11 21:16:13 UTC

K8s job cluster and cancel and resume from a save point ?

There are some issues I see and would want to get some feedback

1. On Cancellation With SavePoint with a Target Directory , the k8s  job
does not exit ( it is not a deployment ) . I would assume that on
cancellation the jvm should exit, after cleanup etc, and thus the pod
should too. That does not happen and thus the job pod remains live. Is that
expected ?

2. To resume fro a save point it seems that I have to delete the job id (
0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
latest checkpoint no matter what


I am kind of curious as to what in 1.7.2 is the tested  process of
cancelling with a save point and resuming  and what is the cogent story
around job id ( defaults to 000000000000.. ). Note that --job-id does not
work with 1.7.2 so even though that does not make sense, I still can not
provide a new job id.

Regards,

Vishal.

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vijay Bhaskar <bh...@gmail.com>.
Oh, Yeah this is larger issue indeed :)

Regards
Bhaskar

On Tue, Mar 12, 2019 at 7:51 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Thanks Vijay,
>
> This is the larger issue.  The cancellation routine is itself broken.
>
> On cancellation flink does remove the checkpoint counter
>
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/00000000000000000000000000000000 from
> ZooKeeper *
>
> but exist with a non zero code
>
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
>
>
> That I think is an issue. A cancelled job is a complete job and thus the
> exit code should be 0 for k8s to mark it complete.
>
>
>
>
>
>
>
>
>
> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Yes Vishal. Thats correct.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> This really not cool but here you go. This seems to work. Agreed that
>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>> you have had to do ?
>>>
>>>
>>>    - Take a save point . This returns a request id
>>>
>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>
>>>
>>>
>>>    - Make sure the save point succeeded
>>>
>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>
>>>
>>>
>>>    - cancel the job
>>>
>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>
>>>
>>>
>>>    - Delete the job and deployment
>>>
>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>
>>>    args: ["job-cluster",
>>>
>>>                   "--fromSavepoint",
>>>
>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>                   "--job-classname", .........
>>>
>>>
>>>
>>>    - Restart
>>>
>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>    - Make sure from the UI, that it restored from the specific save
>>>    point.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>>> community needs to respond to this behavior.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Aah.
>>>>> Let me try this out and will get back to you.
>>>>> Though I would assume that save point with cancel is a single atomic
>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>> why would that be an option ).
>>>>> Thanks again.
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>> clusters. Its recommended command.
>>>>>>
>>>>>> Use the following command to issue save point.
>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>> "cancel-job":false}'  \ https://
>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>> Then issue yarn-cancel.
>>>>>> After that  follow the process to restore save point
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Vijay,
>>>>>>>
>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>> a straight up
>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>> --data
>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>> \ https://
>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>
>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>> do  ( to me at least ).
>>>>>>>
>>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>> cluster deployment  seems to be
>>>>>>>
>>>>>>>    - cancel with save point as defined hre
>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>    k8s almost immediately.
>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>>    checkpoints directory.
>>>>>>>    - resumeFromCheckPoint
>>>>>>>
>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  Logs are attached.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Savepoint stored in
>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>>>>> to CANCELLING.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Completed checkpoint 10 for job
>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>> CANCELLING to CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>> 00000000000000000000000000000000.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Shutting down
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 8 at
>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>> discarded.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Removing
>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>> from ZooKeeper
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 10 at
>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>> discarded.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Shutting down.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Stopping the JobMaster for job
>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>> Shutting down rest endpoint.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>> /leader/resource_manager_lock.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>> JobManager is shutting down..
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Suspending SlotPool.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Stopping SlotPool.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>
>>>>>>>
>>>>>>> After a little bit
>>>>>>>
>>>>>>>
>>>>>>> Starting the job-cluster
>>>>>>>
>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>> `jobmanager.heap.size`
>>>>>>>
>>>>>>> Starting standalonejob as a console application on host
>>>>>>> anomalyecho-mmg6t.
>>>>>>>
>>>>>>> ..
>>>>>>>
>>>>>>> ..
>>>>>>>
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vishal
>>>>>>>>
>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>
>>>>>>>> a) First issue save point REST API
>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>> )
>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>> Above is the smoother way
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>
>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>> expected ?
>>>>>>>>>
>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>> provide a new job id.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vishal.
>>>>>>>>>
>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Gary Yao <ga...@ververica.com>.
Hi Vishal,

I'm afraid not but there are open pull requests for that. You can track the
progress here:

    https://issues.apache.org/jira/browse/FLINK-9953

Best,
Gary

On Tue, Mar 12, 2019 at 3:32 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released with
>> 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus the
>>> exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Yes Vishal. Thats correct.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> This really not cool but here you go. This seems to work. Agreed that
>>>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>>>> you have had to do ?
>>>>>
>>>>>
>>>>>    - Take a save point . This returns a request id
>>>>>
>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure the save point succeeded
>>>>>
>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>
>>>>>
>>>>>
>>>>>    - cancel the job
>>>>>
>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>
>>>>>
>>>>>
>>>>>    - Delete the job and deployment
>>>>>
>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>
>>>>>    args: ["job-cluster",
>>>>>
>>>>>                   "--fromSavepoint",
>>>>>
>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>                   "--job-classname", .........
>>>>>
>>>>>
>>>>>
>>>>>    - Restart
>>>>>
>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>    point.
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>> Flink community needs to respond to this behavior.
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Aah.
>>>>>>> Let me try this out and will get back to you.
>>>>>>> Though I would assume that save point with cancel is a single atomic
>>>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>>>> why would that be an option ).
>>>>>>> Thanks again.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>> clusters. Its recommended command.
>>>>>>>>
>>>>>>>> Use the following command to issue save point.
>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>> Then issue yarn-cancel.
>>>>>>>> After that  follow the process to restore save point
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello Vijay,
>>>>>>>>>
>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>> a straight up
>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>> --data
>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>> \ https://
>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>>> do  ( to me at least ).
>>>>>>>>>
>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>>>> cluster deployment  seems to be
>>>>>>>>>
>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>>>    k8s almost immediately.
>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>    the checkpoints directory.
>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>
>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Logs are attached.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Savepoint stored in
>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Shutting down
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Removing
>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>> from ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Shutting down.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>> CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>> JobManager is shutting down..
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Suspending SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Stopping SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After a little bit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Starting the job-cluster
>>>>>>>>>
>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>>> `jobmanager.heap.size`
>>>>>>>>>
>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vishal
>>>>>>>>>>
>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>>>
>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>> )
>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>> Above is the smoother way
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>
>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>>>> expected ?
>>>>>>>>>>>
>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>>> of cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>>>> provide a new job id.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vishal.
>>>>>>>>>>>
>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
I confirm that 1.8.0 fixes all the above issue . The JM process exits with
code 0 and exits the pod ( TERMINATED state ) . The above is true for both
PATCH cancel and POST save point with cancel as above.

Thank you for fixing this issue.


On Wed, Mar 13, 2019 at 10:17 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> BTW, does 1.8 also solve the issue where we can cancel with a save point.
> That too is broken in 1.7.2
>
> curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>
>
> On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> Awesome, thanks!
>>
>> On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <ga...@ververica.com> wrote:
>>
>>> The RC artifacts are only deployed to the Maven Central Repository when
>>> the RC
>>> is promoted to a release. As written in the 1.8.0 RC1 voting email [1],
>>> you
>>> can find the maven artifacts, and the Flink binaries here:
>>>
>>>     -
>>> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>>>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>>>
>>> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
>>> sources [2]. On my machine this takes around 10 minutes if tests are
>>> skipped.
>>>
>>> Best,
>>> Gary
>>>
>>> [1]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>>>
>>> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>>>> candidate. We could test it for you.
>>>>
>>>> Without 1.8and this exit code we are essentially held up.
>>>>
>>>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <ga...@ververica.com> wrote:
>>>>
>>>>> Nobody can tell with 100% certainty. We want to give the RC some
>>>>> exposure
>>>>> first, and there is also a release process that is prescribed by the
>>>>> ASF [1].
>>>>> You can look at past releases to get a feeling for how long the release
>>>>> process lasts [2].
>>>>>
>>>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>>>> [2]
>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> And when is the 1.8.0 release expected ?
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> :) That makes so much more sense. Is  k8s native flink a part of
>>>>>>> this release ?
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>>>> with 1.8. If
>>>>>>>> the Flink job gets cancelled, the JVM should exit with code 0.
>>>>>>>> There is a
>>>>>>>> release candidate [2], which you can test.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Gary
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>>>> [2]
>>>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Vijay,
>>>>>>>>>
>>>>>>>>> This is the larger issue.  The cancellation routine is itself
>>>>>>>>> broken.
>>>>>>>>>
>>>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>>>
>>>>>>>>> *2019-03-12 14:12:13,143
>>>>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>>>> ZooKeeper *
>>>>>>>>>
>>>>>>>>> but exist with a non zero code
>>>>>>>>>
>>>>>>>>> *2019-03-12 14:12:13,477
>>>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>>>>>> exit code 1444.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That I think is an issue. A cancelled job is a complete job and
>>>>>>>>> thus the exit code should be 0 for k8s to mark it complete.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>>>>>> with what you have had to do ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>>>
>>>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>>>
>>>>>>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - cancel the job
>>>>>>>>>>>
>>>>>>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>>>
>>>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>>>
>>>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>>>
>>>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>>>
>>>>>>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Restart
>>>>>>>>>>>
>>>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>>>    save point.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not
>>>>>>>>>>>> working. Flink community needs to respond to this behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Aah.
>>>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>>>> Thanks again.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>> POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if it is a
>>>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence the pod. It
>>>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( the
>>>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also remove the
>>>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a little bit the
>>>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not the right
>>>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a
>>>>>>>>>>>>>>> k8s job cluster deployment  seems to be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>>>    - delete the job manger job  and task manager
>>>>>>>>>>>>>>>    deployments from k8s almost immediately.
>>>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Savepoint stored in
>>>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>>> switched from state RUNNING to CANCELLING.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING
>>>>>>>>>>>>>>> to CANCELING.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from
>>>>>>>>>>>>>>> CANCELING to CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>>> switched from state CANCELLING to CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher
>>>>>>>>>>>>>>>   - Job 00000000000000000000000000000000 reached globally
>>>>>>>>>>>>>>> terminal state CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Stopping the JobMaster for job
>>>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Close ResourceManager connection
>>>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>>   - Suspending SlotPool.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>>   - Stopping SlotPool.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for
>>>>>>>>>>>>>>> job 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace
>>>>>>>>>>>>>>> with key `jobmanager.heap.size`
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  way to issue
>>>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory
>>>>>>>>>>>>>>>>> , the k8s  job  does not exit ( it is not a deployment ) . I would assume
>>>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and thus the
>>>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains live. Is
>>>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to
>>>>>>>>>>>>>>>>> delete the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else
>>>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is the
>>>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note that
>>>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not make sense,
>>>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
BTW, does 1.8 also solve the issue where we can cancel with a save point.
That too is broken in 1.7.2

curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}'
   https://*************/jobs/00000000000000000000000000000000/savepoints


On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> Awesome, thanks!
>
> On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <ga...@ververica.com> wrote:
>
>> The RC artifacts are only deployed to the Maven Central Repository when
>> the RC
>> is promoted to a release. As written in the 1.8.0 RC1 voting email [1],
>> you
>> can find the maven artifacts, and the Flink binaries here:
>>
>>     -
>> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>>
>> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
>> sources [2]. On my machine this takes around 10 minutes if tests are
>> skipped.
>>
>> Best,
>> Gary
>>
>> [1]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>>
>> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>>> candidate. We could test it for you.
>>>
>>> Without 1.8and this exit code we are essentially held up.
>>>
>>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <ga...@ververica.com> wrote:
>>>
>>>> Nobody can tell with 100% certainty. We want to give the RC some
>>>> exposure
>>>> first, and there is also a release process that is prescribed by the
>>>> ASF [1].
>>>> You can look at past releases to get a feeling for how long the release
>>>> process lasts [2].
>>>>
>>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>>> [2]
>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> And when is the 1.8.0 release expected ?
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>>>> release ?
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>>> with 1.8. If
>>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>>>> is a
>>>>>>> release candidate [2], which you can test.
>>>>>>>
>>>>>>> Best,
>>>>>>> Gary
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>>> [2]
>>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Vijay,
>>>>>>>>
>>>>>>>> This is the larger issue.  The cancellation routine is itself
>>>>>>>> broken.
>>>>>>>>
>>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>>
>>>>>>>> *2019-03-12 14:12:13,143
>>>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>>> ZooKeeper *
>>>>>>>>
>>>>>>>> but exist with a non zero code
>>>>>>>>
>>>>>>>> *2019-03-12 14:12:13,477
>>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>>>>> exit code 1444.*
>>>>>>>>
>>>>>>>>
>>>>>>>> That I think is an issue. A cancelled job is a complete job and
>>>>>>>> thus the exit code should be 0 for k8s to mark it complete.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>>>>> with what you have had to do ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>>
>>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>>
>>>>>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - cancel the job
>>>>>>>>>>
>>>>>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>>
>>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>>
>>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>>
>>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>>
>>>>>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Restart
>>>>>>>>>>
>>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>>    save point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Aah.
>>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>>> Thanks again.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>>
>>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>> POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if it is a
>>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence the pod. It
>>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( the
>>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also remove the
>>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a little bit the
>>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not the right
>>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a
>>>>>>>>>>>>>> k8s job cluster deployment  seems to be
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Savepoint stored in
>>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>> switched from state RUNNING to CANCELLING.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING
>>>>>>>>>>>>>> to CANCELING.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from
>>>>>>>>>>>>>> CANCELING to CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>> switched from state CANCELLING to CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher
>>>>>>>>>>>>>>   - Job 00000000000000000000000000000000 reached globally
>>>>>>>>>>>>>> terminal state CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Stopping the JobMaster for job
>>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Close ResourceManager connection
>>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>   - Suspending SlotPool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>   - Stopping SlotPool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  way to issue
>>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would assume
>>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and thus the
>>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains live. Is
>>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to
>>>>>>>>>>>>>>>> delete the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else
>>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is the
>>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note that
>>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not make sense,
>>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
Awesome, thanks!

On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <ga...@ververica.com> wrote:

> The RC artifacts are only deployed to the Maven Central Repository when
> the RC
> is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
> can find the maven artifacts, and the Flink binaries here:
>
>     -
> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>
> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
> sources [2]. On my machine this takes around 10 minutes if tests are
> skipped.
>
> Best,
> Gary
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>
> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>> candidate. We could test it for you.
>>
>> Without 1.8and this exit code we are essentially held up.
>>
>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <ga...@ververica.com> wrote:
>>
>>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>>> first, and there is also a release process that is prescribed by the ASF
>>> [1].
>>> You can look at past releases to get a feeling for how long the release
>>> process lasts [2].
>>>
>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>
>>>
>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> And when is the 1.8.0 release expected ?
>>>>
>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>>> release ?
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>> with 1.8. If
>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>>> is a
>>>>>> release candidate [2], which you can test.
>>>>>>
>>>>>> Best,
>>>>>> Gary
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>> [2]
>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Vijay,
>>>>>>>
>>>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>>>
>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>
>>>>>>> *2019-03-12 14:12:13,143
>>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>> ZooKeeper *
>>>>>>>
>>>>>>> but exist with a non zero code
>>>>>>>
>>>>>>> *2019-03-12 14:12:13,477
>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>>>> exit code 1444.*
>>>>>>>
>>>>>>>
>>>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>>>> with what you have had to do ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>
>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>
>>>>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - cancel the job
>>>>>>>>>
>>>>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>
>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>
>>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>
>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>
>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>
>>>>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Restart
>>>>>>>>>
>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>
>>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>    save point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Aah.
>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>> Thanks again.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>
>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>
>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>
>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>
>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if it is a
>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence the pod. It
>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( the
>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also remove the
>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a little bit the
>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not the right
>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>>>>> point.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>
>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>
>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>>
>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  way to issue
>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>>>>> )
>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would assume
>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and thus the
>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains live. Is
>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it
>>>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is the
>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note that
>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not make sense,
>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Gary Yao <ga...@ververica.com>.
The RC artifacts are only deployed to the Maven Central Repository when the
RC
is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
can find the maven artifacts, and the Flink binaries here:

    -
https://repository.apache.org/content/repositories/orgapacheflink-1210/
    - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/

Alternatively, you can apply the patch yourself, and build Flink 1.7 from
sources [2]. On my machine this takes around 10 minutes if tests are
skipped.

Best,
Gary

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink

On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
> candidate. We could test it for you.
>
> Without 1.8and this exit code we are essentially held up.
>
> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <ga...@ververica.com> wrote:
>
>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>> first, and there is also a release process that is prescribed by the ASF
>> [1].
>> You can look at past releases to get a feeling for how long the release
>> process lasts [2].
>>
>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>
>>
>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> And when is the 1.8.0 release expected ?
>>>
>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>> release ?
>>>>
>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> This issue was fixed recently [1], and the patch will be released with
>>>>> 1.8. If
>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>> is a
>>>>> release candidate [2], which you can test.
>>>>>
>>>>> Best,
>>>>> Gary
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>> [2]
>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Thanks Vijay,
>>>>>>
>>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>>
>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>
>>>>>> *2019-03-12 14:12:13,143
>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>> ZooKeeper *
>>>>>>
>>>>>> but exist with a non zero code
>>>>>>
>>>>>> *2019-03-12 14:12:13,477
>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>>> exit code 1444.*
>>>>>>
>>>>>>
>>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>
>>>>>>> Yes Vishal. Thats correct.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>>> with what you have had to do ?
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>
>>>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>
>>>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - cancel the job
>>>>>>>>
>>>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Delete the job and deployment
>>>>>>>>
>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>
>>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>
>>>>>>>>    args: ["job-cluster",
>>>>>>>>
>>>>>>>>                   "--fromSavepoint",
>>>>>>>>
>>>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>                   "--job-classname", .........
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Restart
>>>>>>>>
>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>
>>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>    save point.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Aah.
>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>> Thanks again.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>
>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>
>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>
>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>> a straight up
>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>>> --data
>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>> \ https://
>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>
>>>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>>>
>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>>>> point.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>>
>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>
>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>>> discarded.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>> - Removing
>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>>>> discarded.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>
>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>
>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>
>>>>>>>>>>>> ..
>>>>>>>>>>>>
>>>>>>>>>>>> ..
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>
>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  way to issue
>>>>>>>>>>>>> is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>>>> )
>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would assume
>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and thus the
>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains live. Is
>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it
>>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is the
>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note that
>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not make sense,
>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
Do you have a mvn repository ( at mvn central )  set up for 1,8 release
candidate. We could test it for you.

Without 1.8and this exit code we are essentially held up.

On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <ga...@ververica.com> wrote:

> Nobody can tell with 100% certainty. We want to give the RC some exposure
> first, and there is also a release process that is prescribed by the ASF
> [1].
> You can look at past releases to get a feeling for how long the release
> process lasts [2].
>
> [1] http://www.apache.org/legal/release-policy.html#release-approval
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>
>
> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> And when is the 1.8.0 release expected ?
>>
>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>> release ?
>>>
>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> This issue was fixed recently [1], and the patch will be released with
>>>> 1.8. If
>>>> the Flink job gets cancelled, the JVM should exit with code 0. There is
>>>> a
>>>> release candidate [2], which you can test.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>> [2]
>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>
>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Thanks Vijay,
>>>>>
>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>
>>>>> On cancellation flink does remove the checkpoint counter
>>>>>
>>>>> *2019-03-12 14:12:13,143
>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>> ZooKeeper *
>>>>>
>>>>> but exist with a non zero code
>>>>>
>>>>> *2019-03-12 14:12:13,477
>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>> exit code 1444.*
>>>>>
>>>>>
>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Yes Vishal. Thats correct.
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>> with what you have had to do ?
>>>>>>>
>>>>>>>
>>>>>>>    - Take a save point . This returns a request id
>>>>>>>
>>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Make sure the save point succeeded
>>>>>>>
>>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - cancel the job
>>>>>>>
>>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Delete the job and deployment
>>>>>>>
>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>
>>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>
>>>>>>>    args: ["job-cluster",
>>>>>>>
>>>>>>>                   "--fromSavepoint",
>>>>>>>
>>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>                   "--job-classname", .........
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Restart
>>>>>>>
>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>
>>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>>>    point.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Aah.
>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>> Thanks again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vishal,
>>>>>>>>>>
>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>>>> clusters. Its recommended command.
>>>>>>>>>>
>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>
>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>> a straight up
>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>> --data
>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>> \ https://
>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>>
>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>>> point.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>
>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>
>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>> - Shutting down
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>> discarded.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>> - Removing
>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>>> discarded.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>>> CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> After a little bit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>
>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>
>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>
>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>>>>>
>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>>> )
>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would assume
>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and thus the
>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains live. Is
>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it
>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is the
>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note that
>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not make sense,
>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Gary Yao <ga...@ververica.com>.
Nobody can tell with 100% certainty. We want to give the RC some exposure
first, and there is also a release process that is prescribed by the ASF
[1].
You can look at past releases to get a feeling for how long the release
process lasts [2].

[1] http://www.apache.org/legal/release-policy.html#release-approval
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0


On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> And when is the 1.8.0 release expected ?
>
> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> :) That makes so much more sense. Is  k8s native flink a part of this
>> release ?
>>
>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>>
>>> Hi Vishal,
>>>
>>> This issue was fixed recently [1], and the patch will be released with
>>> 1.8. If
>>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>>> release candidate [2], which you can test.
>>>
>>> Best,
>>> Gary
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>
>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> Thanks Vijay,
>>>>
>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>
>>>> On cancellation flink does remove the checkpoint counter
>>>>
>>>> *2019-03-12 14:12:13,143
>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>> ZooKeeper *
>>>>
>>>> but exist with a non zero code
>>>>
>>>> *2019-03-12 14:12:13,477
>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>> exit code 1444.*
>>>>
>>>>
>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>> the exit code should be 0 for k8s to mark it complete.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>
>>>>> Yes Vishal. Thats correct.
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> This really not cool but here you go. This seems to work. Agreed that
>>>>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>>>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>>>>> you have had to do ?
>>>>>>
>>>>>>
>>>>>>    - Take a save point . This returns a request id
>>>>>>
>>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Make sure the save point succeeded
>>>>>>
>>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - cancel the job
>>>>>>
>>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Delete the job and deployment
>>>>>>
>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>
>>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>
>>>>>>    args: ["job-cluster",
>>>>>>
>>>>>>                   "--fromSavepoint",
>>>>>>
>>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>                   "--job-classname", .........
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Restart
>>>>>>
>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>
>>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>>    point.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>
>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Aah.
>>>>>>>> Let me try this out and will get back to you.
>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>> Thanks again.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Vishal,
>>>>>>>>>
>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>>> clusters. Its recommended command.
>>>>>>>>>
>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Vijay,
>>>>>>>>>>
>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>> a straight up
>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>> --data
>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>> \ https://
>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>
>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>>> point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>
>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>
>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Logs are attached.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Savepoint stored in
>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>> - Shutting down
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>> - Removing
>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>> from ZooKeeper
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>> - Shutting down.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>> from ZooKeeper
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>> CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After a little bit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>
>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>>>> `jobmanager.heap.size`
>>>>>>>>>>
>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>
>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>>>>
>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>>> )
>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>>
>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>>>>> expected ?
>>>>>>>>>>>>
>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it
>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>>>> of cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>>>>> provide a new job id.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
And when is the 1.8.0 release expected ?

On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released with
>> 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus the
>>> exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Yes Vishal. Thats correct.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> This really not cool but here you go. This seems to work. Agreed that
>>>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>>>> you have had to do ?
>>>>>
>>>>>
>>>>>    - Take a save point . This returns a request id
>>>>>
>>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure the save point succeeded
>>>>>
>>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>
>>>>>
>>>>>
>>>>>    - cancel the job
>>>>>
>>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>
>>>>>
>>>>>
>>>>>    - Delete the job and deployment
>>>>>
>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>
>>>>>    args: ["job-cluster",
>>>>>
>>>>>                   "--fromSavepoint",
>>>>>
>>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>                   "--job-classname", .........
>>>>>
>>>>>
>>>>>
>>>>>    - Restart
>>>>>
>>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>    point.
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>> Flink community needs to respond to this behavior.
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Aah.
>>>>>>> Let me try this out and will get back to you.
>>>>>>> Though I would assume that save point with cancel is a single atomic
>>>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>>>> why would that be an option ).
>>>>>>> Thanks again.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>> clusters. Its recommended command.
>>>>>>>>
>>>>>>>> Use the following command to issue save point.
>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>> Then issue yarn-cancel.
>>>>>>>> After that  follow the process to restore save point
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello Vijay,
>>>>>>>>>
>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>> a straight up
>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>> --data
>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>> \ https://
>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>>> do  ( to me at least ).
>>>>>>>>>
>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>>>> cluster deployment  seems to be
>>>>>>>>>
>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>>>    k8s almost immediately.
>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>    the checkpoints directory.
>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>
>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Logs are attached.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Savepoint stored in
>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Shutting down
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Removing
>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>> from ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Shutting down.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>> CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>> JobManager is shutting down..
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Suspending SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Stopping SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After a little bit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Starting the job-cluster
>>>>>>>>>
>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>>> `jobmanager.heap.size`
>>>>>>>>>
>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vishal
>>>>>>>>>>
>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>>>
>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>>> )
>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>> Above is the smoother way
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>
>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>>>> expected ?
>>>>>>>>>>>
>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>>> of cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>>>> provide a new job id.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vishal.
>>>>>>>>>>>
>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
:) That makes so much more sense. Is  k8s native flink a part of this
release ?

On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <ga...@ververica.com> wrote:

> Hi Vishal,
>
> This issue was fixed recently [1], and the patch will be released with
> 1.8. If
> the Flink job gets cancelled, the JVM should exit with code 0. There is a
> release candidate [2], which you can test.
>
> Best,
> Gary
>
> [1] https://issues.apache.org/jira/browse/FLINK-10743
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>
> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Thanks Vijay,
>>
>> This is the larger issue.  The cancellation routine is itself broken.
>>
>> On cancellation flink does remove the checkpoint counter
>>
>> *2019-03-12 14:12:13,143
>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>> ZooKeeper *
>>
>> but exist with a non zero code
>>
>> *2019-03-12 14:12:13,477
>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>> exit code 1444.*
>>
>>
>> That I think is an issue. A cancelled job is a complete job and thus the
>> exit code should be 0 for k8s to mark it complete.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
>> wrote:
>>
>>> Yes Vishal. Thats correct.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> This really not cool but here you go. This seems to work. Agreed that
>>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>>> you have had to do ?
>>>>
>>>>
>>>>    - Take a save point . This returns a request id
>>>>
>>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>>
>>>>
>>>>    - Make sure the save point succeeded
>>>>
>>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>
>>>>
>>>>
>>>>    - cancel the job
>>>>
>>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>
>>>>
>>>>
>>>>    - Delete the job and deployment
>>>>
>>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>
>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>
>>>>
>>>>
>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>
>>>>    args: ["job-cluster",
>>>>
>>>>                   "--fromSavepoint",
>>>>
>>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>                   "--job-classname", .........
>>>>
>>>>
>>>>
>>>>    - Restart
>>>>
>>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>
>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>
>>>>
>>>>
>>>>    - Make sure from the UI, that it restored from the specific save
>>>>    point.
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>>>> community needs to respond to this behavior.
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Aah.
>>>>>> Let me try this out and will get back to you.
>>>>>> Though I would assume that save point with cancel is a single atomic
>>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>>> why would that be an option ).
>>>>>> Thanks again.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>> clusters. Its recommended command.
>>>>>>>
>>>>>>> Use the following command to issue save point.
>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>> "cancel-job":false}'  \ https://
>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>
>>>>>>> Then issue yarn-cancel.
>>>>>>> After that  follow the process to restore save point
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Vijay,
>>>>>>>>
>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>> a straight up
>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>> --data
>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>> \ https://
>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>>> do  ( to me at least ).
>>>>>>>>
>>>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>>> point.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>>> cluster deployment  seems to be
>>>>>>>>
>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>>    k8s almost immediately.
>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>>>    checkpoints directory.
>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>
>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  Logs are attached.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>> Savepoint stored in
>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>> - Shutting down
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>> discarded.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>> - Removing
>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>> from ZooKeeper
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>> discarded.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>> Shutting down.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>> CANCELED.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>> Stopping the JobMaster for job
>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>> Shutting down rest endpoint.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>> JobManager is shutting down..
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>> Suspending SlotPool.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>> Stopping SlotPool.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>
>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>
>>>>>>>>
>>>>>>>> After a little bit
>>>>>>>>
>>>>>>>>
>>>>>>>> Starting the job-cluster
>>>>>>>>
>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>> `jobmanager.heap.size`
>>>>>>>>
>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>
>>>>>>>> ..
>>>>>>>>
>>>>>>>> ..
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Vishal
>>>>>>>>>
>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>>
>>>>>>>>> a) First issue save point REST API
>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>>> )
>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>> Above is the smoother way
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>
>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>>> expected ?
>>>>>>>>>>
>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>> of cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>>> provide a new job id.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vishal.
>>>>>>>>>>
>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Gary Yao <ga...@ververica.com>.
Hi Vishal,

This issue was fixed recently [1], and the patch will be released with 1.8.
If
the Flink job gets cancelled, the JVM should exit with code 0. There is a
release candidate [2], which you can test.

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-10743
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html

On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Thanks Vijay,
>
> This is the larger issue.  The cancellation routine is itself broken.
>
> On cancellation flink does remove the checkpoint counter
>
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/00000000000000000000000000000000 from
> ZooKeeper *
>
> but exist with a non zero code
>
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
>
>
> That I think is an issue. A cancelled job is a complete job and thus the
> exit code should be 0 for k8s to mark it complete.
>
>
>
>
>
>
>
>
>
> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Yes Vishal. Thats correct.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> This really not cool but here you go. This seems to work. Agreed that
>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>> you have had to do ?
>>>
>>>
>>>    - Take a save point . This returns a request id
>>>
>>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>>
>>>
>>>
>>>    - Make sure the save point succeeded
>>>
>>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>
>>>
>>>
>>>    - cancel the job
>>>
>>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>
>>>
>>>
>>>    - Delete the job and deployment
>>>
>>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>
>>>    args: ["job-cluster",
>>>
>>>                   "--fromSavepoint",
>>>
>>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>                   "--job-classname", .........
>>>
>>>
>>>
>>>    - Restart
>>>
>>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>    - Make sure from the UI, that it restored from the specific save
>>>    point.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>>> community needs to respond to this behavior.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Aah.
>>>>> Let me try this out and will get back to you.
>>>>> Though I would assume that save point with cancel is a single atomic
>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>> why would that be an option ).
>>>>> Thanks again.
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>> clusters. Its recommended command.
>>>>>>
>>>>>> Use the following command to issue save point.
>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>> "cancel-job":false}'  \ https://
>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>> Then issue yarn-cancel.
>>>>>> After that  follow the process to restore save point
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Vijay,
>>>>>>>
>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>> deployment ( rather then yarn ) but may be they follow the same lifecycle.
>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>> a straight up
>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>> --data
>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>> \ https://
>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>
>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It does
>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>>> do  ( to me at least ).
>>>>>>>
>>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>> cluster deployment  seems to be
>>>>>>>
>>>>>>>    - cancel with save point as defined hre
>>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>    k8s almost immediately.
>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>>    checkpoints directory.
>>>>>>>    - resumeFromCheckPoint
>>>>>>>
>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  Logs are attached.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Savepoint stored in
>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>>>>> to CANCELLING.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Completed checkpoint 10 for job
>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>> CANCELLING to CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>> 00000000000000000000000000000000.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Shutting down
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 8 at
>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>> discarded.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Removing
>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>> from ZooKeeper
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 10 at
>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>> discarded.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Shutting down.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Stopping the JobMaster for job
>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>> Shutting down rest endpoint.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>> /leader/resource_manager_lock.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>> JobManager is shutting down..
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Suspending SlotPool.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Stopping SlotPool.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>
>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>
>>>>>>>
>>>>>>> After a little bit
>>>>>>>
>>>>>>>
>>>>>>> Starting the job-cluster
>>>>>>>
>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>> `jobmanager.heap.size`
>>>>>>>
>>>>>>> Starting standalonejob as a console application on host
>>>>>>> anomalyecho-mmg6t.
>>>>>>>
>>>>>>> ..
>>>>>>>
>>>>>>> ..
>>>>>>>
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vishal
>>>>>>>>
>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>>
>>>>>>>> a) First issue save point REST API
>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>>> )
>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>> Above is the smoother way
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>
>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>>> expected ?
>>>>>>>>>
>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>>> provide a new job id.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vishal.
>>>>>>>>>
>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
Thanks Vijay,

This is the larger issue.  The cancellation routine is itself broken.

On cancellation flink does remove the checkpoint counter

*2019-03-12 14:12:13,143
INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Removing /checkpoint-counter/00000000000000000000000000000000 from
ZooKeeper *

but exist with a non zero code

*2019-03-12 14:12:13,477
INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
exit code 1444.*


That I think is an issue. A cancelled job is a complete job and thus the
exit code should be 0 for k8s to mark it complete.









On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bh...@gmail.com>
wrote:

> Yes Vishal. Thats correct.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> This really not cool but here you go. This seems to work. Agreed that
>> this cannot be this painful. The cancel does not exit with an exit code pf
>> 0 and thus the job has to manually delete. Vijay does this align with what
>> you have had to do ?
>>
>>
>>    - Take a save point . This returns a request id
>>
>>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>>
>>
>>
>>    - Make sure the save point succeeded
>>
>>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>
>>
>>
>>    - cancel the job
>>
>>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>
>>
>>
>>    - Delete the job and deployment
>>
>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>
>>    args: ["job-cluster",
>>
>>                   "--fromSavepoint",
>>
>>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>                   "--job-classname", .........
>>
>>
>>
>>    - Restart
>>
>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>    - Make sure from the UI, that it restored from the specific save
>>    point.
>>
>>
>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
>> wrote:
>>
>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>> community needs to respond to this behavior.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> Aah.
>>>> Let me try this out and will get back to you.
>>>> Though I would assume that save point with cancel is a single atomic
>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>> why would that be an option ).
>>>> Thanks again.
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>> clusters. Its recommended command.
>>>>>
>>>>> Use the following command to issue save point.
>>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>> "cancel-job":false}'  \ https://
>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>> Then issue yarn-cancel.
>>>>> After that  follow the process to restore save point
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Hello Vijay,
>>>>>>
>>>>>>                Thank you for the reply. This though is k8s deployment
>>>>>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>>>>> save point with cancel*  as documented here
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>> a straight up
>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>> --data
>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>> \ https://
>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>> I would assume that after taking the save point, the jvm should exit,
>>>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>> do  ( to me at least ).
>>>>>>
>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>> point.
>>>>>>
>>>>>>
>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>> cluster deployment  seems to be
>>>>>>
>>>>>>    - cancel with save point as defined hre
>>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>    k8s almost immediately.
>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>    checkpoints directory.
>>>>>>    - resumeFromCheckPoint
>>>>>>
>>>>>> If some body can say that this indeed is the process ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Logs are attached.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Savepoint stored in
>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>
>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>>>> to CANCELLING.
>>>>>>
>>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>     - Completed checkpoint 10 for job
>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>
>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>
>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>> CANCELLING to CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>     - Stopping checkpoint coordinator for job
>>>>>> 00000000000000000000000000000000.
>>>>>>
>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>> - Shutting down
>>>>>>
>>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>       - Checkpoint with ID 8 at
>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>> discarded.
>>>>>>
>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>> - Removing
>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>> from ZooKeeper
>>>>>>
>>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>       - Checkpoint with ID 10 at
>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>> discarded.
>>>>>>
>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Shutting down.
>>>>>>
>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>>
>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Stopping the JobMaster for job
>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>
>>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>> application status CANCELED. Diagnostics null.
>>>>>>
>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>> Shutting down rest endpoint.
>>>>>>
>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>> /leader/resource_manager_lock.
>>>>>>
>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>> JobManager is shutting down..
>>>>>>
>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>> Suspending SlotPool.
>>>>>>
>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>> Stopping SlotPool.
>>>>>>
>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>
>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>
>>>>>>
>>>>>> After a little bit
>>>>>>
>>>>>>
>>>>>> Starting the job-cluster
>>>>>>
>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>> `jobmanager.heap.size`
>>>>>>
>>>>>> Starting standalonejob as a console application on host
>>>>>> anomalyecho-mmg6t.
>>>>>>
>>>>>> ..
>>>>>>
>>>>>> ..
>>>>>>
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vishal
>>>>>>>
>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>>
>>>>>>> a) First issue save point REST API
>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>>> )
>>>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>>>> for the run jar REST API, which is returned by the (a)
>>>>>>> Above is the smoother way
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>
>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>>> expected ?
>>>>>>>>
>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults
>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>
>>>>>>>>
>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>>> provide a new job id.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Vishal.
>>>>>>>>
>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vijay Bhaskar <bh...@gmail.com>.
Yes Vishal. Thats correct.

Regards
Bhaskar

On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> This really not cool but here you go. This seems to work. Agreed that this
> cannot be this painful. The cancel does not exit with an exit code pf 0 and
> thus the job has to manually delete. Vijay does this align with what you
> have had to do ?
>
>
>    - Take a save point . This returns a request id
>
>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>
>
>
>    - Make sure the save point succeeded
>
>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>    - cancel the job
>
>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>
>
>
>    - Delete the job and deployment
>
>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>    args: ["job-cluster",
>
>                   "--fromSavepoint",
>
>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>                   "--job-classname", .........
>
>
>
>    - Restart
>
>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Make sure from the UI, that it restored from the specific save point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>> community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else why
>>> would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>> clusters. Its recommended command.
>>>>
>>>> Use the following command to issue save point.
>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>> "cancel-job":false}'  \ https://
>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>> Then issue yarn-cancel.
>>>> After that  follow the process to restore save point
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Hello Vijay,
>>>>>
>>>>>                Thank you for the reply. This though is k8s deployment
>>>>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>>>> save point with cancel*  as documented here
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>> a straight up
>>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>> \ https://
>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>> I would assume that after taking the save point, the jvm should exit,
>>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>> do  ( to me at least ).
>>>>>
>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>> point.
>>>>>
>>>>>
>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>> cluster deployment  seems to be
>>>>>
>>>>>    - cancel with save point as defined hre
>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>    - delete the job manger job  and task manager deployments from k8s
>>>>>    almost immediately.
>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>    checkpoints directory.
>>>>>    - resumeFromCheckPoint
>>>>>
>>>>> If some body can say that this indeed is the process ?
>>>>>
>>>>>
>>>>>
>>>>>  Logs are attached.
>>>>>
>>>>>
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Savepoint stored in
>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>> cancelling 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>>> to CANCELLING.
>>>>>
>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Completed checkpoint 10 for job
>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>
>>>>> 2019-03-12 08:10:44,232 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>
>>>>> 2019-03-12 08:10:44,274 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>> CANCELLING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Stopping checkpoint coordinator for job
>>>>> 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:44,277 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Shutting down
>>>>>
>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 8 at
>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Removing
>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>> from ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 10 at
>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Shutting down.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,463 INFO
>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,467 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Stopping the JobMaster for job
>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>> application status CANCELED. Diagnostics null.
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO
>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>> Shutting down rest endpoint.
>>>>>
>>>>> 2019-03-12 08:10:44,473 INFO
>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>> /leader/resource_manager_lock.
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>>>> shutting down..
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Suspending SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Stopping SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>
>>>>> 2019-03-12 08:10:44,477 INFO
>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>
>>>>>
>>>>> After a little bit
>>>>>
>>>>>
>>>>> Starting the job-cluster
>>>>>
>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>> `jobmanager.heap.size`
>>>>>
>>>>> Starting standalonejob as a console application on host
>>>>> anomalyecho-mmg6t.
>>>>>
>>>>> ..
>>>>>
>>>>> ..
>>>>>
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Hi Vishal
>>>>>>
>>>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>
>>>>>> a) First issue save point REST API
>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>> )
>>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>>> for the run jar REST API, which is returned by the (a)
>>>>>> Above is the smoother way
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>
>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>> expected ?
>>>>>>>
>>>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to
>>>>>>> the latest checkpoint no matter what
>>>>>>>
>>>>>>>
>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>> provide a new job id.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vishal.
>>>>>>>
>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
I must add that there has to be more love for k8s flink deployments. IMHO
that is the way to go.  Maintaining a captive/session cluster, if you have
k8s on premise is pretty much a no go  for various reasons.

On Tue, Mar 12, 2019 at 9:44 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> This really not cool but here you go. This seems to work. Agreed that this
> cannot be this painful. The cancel does not exit with an exit code pf 0 and
> thus the job has to manually delete. Vijay does this align with what you
> have had to do ?
>
>
>    - Take a save point . This returns a request id
>
>    curl  --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'    https://*************/jobs/00000000000000000000000000000000/savepoints
>
>
>
>    - Make sure the save point succeeded
>
>    curl  --request GET   https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>    - cancel the job
>
>    curl  --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel
>
>
>
>    - Delete the job and deployment
>
>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>    args: ["job-cluster",
>
>                   "--fromSavepoint",
>
>                   "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>                   "--job-classname", .........
>
>
>
>    - Restart
>
>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Make sure from the UI, that it restored from the specific save point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>> community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else why
>>> would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>> clusters. Its recommended command.
>>>>
>>>> Use the following command to issue save point.
>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>> "cancel-job":false}'  \ https://
>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>> Then issue yarn-cancel.
>>>> After that  follow the process to restore save point
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Hello Vijay,
>>>>>
>>>>>                Thank you for the reply. This though is k8s deployment
>>>>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>>>> save point with cancel*  as documented here
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>> a straight up
>>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>> \ https://
>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>> I would assume that after taking the save point, the jvm should exit,
>>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>> do  ( to me at least ).
>>>>>
>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>> point.
>>>>>
>>>>>
>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>> cluster deployment  seems to be
>>>>>
>>>>>    - cancel with save point as defined hre
>>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>    - delete the job manger job  and task manager deployments from k8s
>>>>>    almost immediately.
>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>    checkpoints directory.
>>>>>    - resumeFromCheckPoint
>>>>>
>>>>> If some body can say that this indeed is the process ?
>>>>>
>>>>>
>>>>>
>>>>>  Logs are attached.
>>>>>
>>>>>
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Savepoint stored in
>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>> cancelling 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>>> to CANCELLING.
>>>>>
>>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Completed checkpoint 10 for job
>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>
>>>>> 2019-03-12 08:10:44,232 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>
>>>>> 2019-03-12 08:10:44,274 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>> CANCELLING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Stopping checkpoint coordinator for job
>>>>> 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:44,277 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Shutting down
>>>>>
>>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 8 at
>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Removing
>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>> from ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 10 at
>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Shutting down.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,463 INFO
>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,467 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Stopping the JobMaster for job
>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>> application status CANCELED. Diagnostics null.
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO
>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>> Shutting down rest endpoint.
>>>>>
>>>>> 2019-03-12 08:10:44,473 INFO
>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>> /leader/resource_manager_lock.
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>>>> shutting down..
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Suspending SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Stopping SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>
>>>>> 2019-03-12 08:10:44,477 INFO
>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>
>>>>>
>>>>> After a little bit
>>>>>
>>>>>
>>>>> Starting the job-cluster
>>>>>
>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>> `jobmanager.heap.size`
>>>>>
>>>>> Starting standalonejob as a console application on host
>>>>> anomalyecho-mmg6t.
>>>>>
>>>>> ..
>>>>>
>>>>> ..
>>>>>
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>> bhaskar.ebay77@gmail.com> wrote:
>>>>>
>>>>>> Hi Vishal
>>>>>>
>>>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>
>>>>>> a) First issue save point REST API
>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>>> )
>>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>>> for the run jar REST API, which is returned by the (a)
>>>>>> Above is the smoother way
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>
>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>>> expected ?
>>>>>>>
>>>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to
>>>>>>> the latest checkpoint no matter what
>>>>>>>
>>>>>>>
>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>> provide a new job id.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vishal.
>>>>>>>
>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
This really not cool but here you go. This seems to work. Agreed that this
cannot be this painful. The cancel does not exit with an exit code pf 0 and
thus the job has to manually delete. Vijay does this align with what you
have had to do ?


   - Take a save point . This returns a request id

   curl  --header "Content-Type: application/json" --request POST
--data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
   https://*************/jobs/00000000000000000000000000000000/savepoints



   - Make sure the save point succeeded

   curl  --request GET
https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687



   - cancel the job

   curl  --request PATCH
https://***************/jobs/00000000000000000000000000000000?mode=cancel



   - Delete the job and deployment

   kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

   kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml



   - Edit the job-cluster-job-deployment.yaml. Add/Edit

   args: ["job-cluster",

                  "--fromSavepoint",

                  "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
                  "--job-classname", .........



   - Restart

   kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

   kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml



   - Make sure from the UI, that it restored from the specific save point.


On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bh...@gmail.com>
wrote:

> Yes Its supposed to work.  But unfortunately it was not working. Flink
> community needs to respond to this behavior.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Aah.
>> Let me try this out and will get back to you.
>> Though I would assume that save point with cancel is a single atomic
>> step, rather then a save point *followed*  by a cancellation ( else why
>> would that be an option ).
>> Thanks again.
>>
>>
>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>> clusters. Its recommended command.
>>>
>>> Use the following command to issue save point.
>>>  curl  --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
>>> \ https://
>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>
>>> Then issue yarn-cancel.
>>> After that  follow the process to restore save point
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> Hello Vijay,
>>>>
>>>>                Thank you for the reply. This though is k8s deployment (
>>>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>>> save point with cancel*  as documented here
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>> a straight up
>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>> \ https://
>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>> I would assume that after taking the save point, the jvm should exit,
>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>> counter but does not exit  the job. And after a little bit the job is
>>>> restarted which does not make sense and absolutely not the right thing to
>>>> do  ( to me at least ).
>>>>
>>>> Further if I delete the deployment and the job from k8s and restart the
>>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>>>> have to delete the zk chroot for it to consider the save point.
>>>>
>>>>
>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>> cluster deployment  seems to be
>>>>
>>>>    - cancel with save point as defined hre
>>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>    - delete the job manger job  and task manager deployments from k8s
>>>>    almost immediately.
>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>    checkpoints directory.
>>>>    - resumeFromCheckPoint
>>>>
>>>> If some body can say that this indeed is the process ?
>>>>
>>>>
>>>>
>>>>  Logs are attached.
>>>>
>>>>
>>>>
>>>> 2019-03-12 08:10:43,871 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>> Savepoint stored in
>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>> cancelling 00000000000000000000000000000000.
>>>>
>>>> 2019-03-12 08:10:43,871 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>> to CANCELLING.
>>>>
>>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>     - Completed checkpoint 10 for job 00000000000000000000000000000000
>>>> (7238 bytes in 311 ms).
>>>>
>>>> 2019-03-12 08:10:44,232 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>
>>>> 2019-03-12 08:10:44,274 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,276 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>> CANCELLING to CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>     - Stopping checkpoint coordinator for job
>>>> 00000000000000000000000000000000.
>>>>
>>>> 2019-03-12 08:10:44,277 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Shutting down
>>>>
>>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>       - Checkpoint with ID 8 at
>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>> discarded.
>>>>
>>>> 2019-03-12 08:10:44,437 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Removing
>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>> from ZooKeeper
>>>>
>>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>       - Checkpoint with ID 10 at
>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>> discarded.
>>>>
>>>> 2019-03-12 08:10:44,447 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Shutting down.
>>>>
>>>> 2019-03-12 08:10:44,447 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>>
>>>> 2019-03-12 08:10:44,463 INFO
>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,467 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>> Stopping the JobMaster for job
>>>> anomaly_echo(00000000000000000000000000000000).
>>>>
>>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>> application status CANCELED. Diagnostics null.
>>>>
>>>> 2019-03-12 08:10:44,468 INFO
>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>> Shutting down rest endpoint.
>>>>
>>>> 2019-03-12 08:10:44,473 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>> /leader/resource_manager_lock.
>>>>
>>>> 2019-03-12 08:10:44,475 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>>> shutting down..
>>>>
>>>> 2019-03-12 08:10:44,475 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>> Suspending SlotPool.
>>>>
>>>> 2019-03-12 08:10:44,476 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>> Stopping SlotPool.
>>>>
>>>> 2019-03-12 08:10:44,476 INFO
>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>> 00000000000000000000000000000000 from the resource manager.
>>>>
>>>> 2019-03-12 08:10:44,477 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>> - Stopping ZooKeeperLeaderElectionService
>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>
>>>>
>>>> After a little bit
>>>>
>>>>
>>>> Starting the job-cluster
>>>>
>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>> `jobmanager.heap.size`
>>>>
>>>> Starting standalonejob as a console application on host
>>>> anomalyecho-mmg6t.
>>>>
>>>> ..
>>>>
>>>> ..
>>>>
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vishal
>>>>>
>>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>
>>>>> a) First issue save point REST API
>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>>> )
>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>> for the run jar REST API, which is returned by the (a)
>>>>> Above is the smoother way
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> There are some issues I see and would want to get some feedback
>>>>>>
>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>>> expected ?
>>>>>>
>>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to
>>>>>> the latest checkpoint no matter what
>>>>>>
>>>>>>
>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>> provide a new job id.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Vishal.
>>>>>>
>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vijay Bhaskar <bh...@gmail.com>.
Yes Its supposed to work.  But unfortunately it was not working. Flink
community needs to respond to this behavior.

Regards
Bhaskar

On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Aah.
> Let me try this out and will get back to you.
> Though I would assume that save point with cancel is a single atomic step,
> rather then a save point *followed*  by a cancellation ( else why would
> that be an option ).
> Thanks again.
>
>
> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Hi Vishal,
>>
>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>> clusters. Its recommended command.
>>
>> Use the following command to issue save point.
>>  curl  --header "Content-Type: application/json" --request POST --data
>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
>> \ https://
>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>
>> Then issue yarn-cancel.
>> After that  follow the process to restore save point
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Hello Vijay,
>>>
>>>                Thank you for the reply. This though is k8s deployment (
>>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>> save point with cancel*  as documented here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>> a straight up
>>>  curl  --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>> \ https://
>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>
>>> I would assume that after taking the save point, the jvm should exit,
>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>> counter but does not exit  the job. And after a little bit the job is
>>> restarted which does not make sense and absolutely not the right thing to
>>> do  ( to me at least ).
>>>
>>> Further if I delete the deployment and the job from k8s and restart the
>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>>> have to delete the zk chroot for it to consider the save point.
>>>
>>>
>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>> cluster deployment  seems to be
>>>
>>>    - cancel with save point as defined hre
>>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>    - delete the job manger job  and task manager deployments from k8s
>>>    almost immediately.
>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>    checkpoints directory.
>>>    - resumeFromCheckPoint
>>>
>>> If some body can say that this indeed is the process ?
>>>
>>>
>>>
>>>  Logs are attached.
>>>
>>>
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>> Savepoint stored in
>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>> cancelling 00000000000000000000000000000000.
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>> to CANCELLING.
>>>
>>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>     - Completed checkpoint 10 for job 00000000000000000000000000000000
>>> (7238 bytes in 311 ms).
>>>
>>> 2019-03-12 08:10:44,232 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>
>>> 2019-03-12 08:10:44,274 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>> CANCELLING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>     - Stopping checkpoint coordinator for job
>>> 00000000000000000000000000000000.
>>>
>>> 2019-03-12 08:10:44,277 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Shutting down
>>>
>>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>       - Checkpoint with ID 8 at
>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>> discarded.
>>>
>>> 2019-03-12 08:10:44,437 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Removing
>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>> from ZooKeeper
>>>
>>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>       - Checkpoint with ID 10 at
>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>> discarded.
>>>
>>> 2019-03-12 08:10:44,447 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Shutting down.
>>>
>>> 2019-03-12 08:10:44,447 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>
>>> 2019-03-12 08:10:44,463 INFO
>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>
>>> 2019-03-12 08:10:44,467 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>> Stopping the JobMaster for job
>>> anomaly_echo(00000000000000000000000000000000).
>>>
>>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>         - Shutting StandaloneJobClusterEntryPoint down with application
>>> status CANCELED. Diagnostics null.
>>>
>>> 2019-03-12 08:10:44,468 INFO
>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>> Shutting down rest endpoint.
>>>
>>> 2019-03-12 08:10:44,473 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>> - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>>
>>> 2019-03-12 08:10:44,475 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>> shutting down..
>>>
>>> 2019-03-12 08:10:44,475 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>> Suspending SlotPool.
>>>
>>> 2019-03-12 08:10:44,476 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>> Stopping SlotPool.
>>>
>>> 2019-03-12 08:10:44,476 INFO
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>> 00000000000000000000000000000000 from the resource manager.
>>>
>>> 2019-03-12 08:10:44,477 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Stopping ZooKeeperLeaderElectionService
>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>
>>>
>>> After a little bit
>>>
>>>
>>> Starting the job-cluster
>>>
>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>> `jobmanager.heap.size`
>>>
>>> Starting standalonejob as a console application on host
>>> anomalyecho-mmg6t.
>>>
>>> ..
>>>
>>> ..
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bh...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal
>>>>
>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>
>>>> a) First issue save point REST API
>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>>> )
>>>> c) Then After resuming your job, provide save point Path as argument
>>>> for the run jar REST API, which is returned by the (a)
>>>> Above is the smoother way
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> There are some issues I see and would want to get some feedback
>>>>>
>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>>> expected ?
>>>>>
>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to
>>>>> the latest checkpoint no matter what
>>>>>
>>>>>
>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>> provide a new job id.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vishal.
>>>>>
>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
Aah.
Let me try this out and will get back to you.
Though I would assume that save point with cancel is a single atomic step,
rather then a save point *followed*  by a cancellation ( else why would
that be an option ).
Thanks again.


On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bh...@gmail.com>
wrote:

> Hi Vishal,
>
> yarn-cancel doesn't mean to be for yarn cluster. It works for all
> clusters. Its recommended command.
>
> Use the following command to issue save point.
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
> \ https://
> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>
> Then issue yarn-cancel.
> After that  follow the process to restore save point
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Hello Vijay,
>>
>>                Thank you for the reply. This though is k8s deployment (
>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>> save point with cancel*  as documented here
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>> a straight up
>>  curl  --header "Content-Type: application/json" --request POST --data
>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>> \ https://
>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>
>> I would assume that after taking the save point, the jvm should exit,
>> after all the k8s deployment is of kind: job and if it is a job cluster
>> then a cancellation should exit the jvm and hence the pod. It does seem to
>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>> counter but does not exit  the job. And after a little bit the job is
>> restarted which does not make sense and absolutely not the right thing to
>> do  ( to me at least ).
>>
>> Further if I delete the deployment and the job from k8s and restart the
>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>> have to delete the zk chroot for it to consider the save point.
>>
>>
>> Thus the process of cancelling and resuming from a SP on a k8s job
>> cluster deployment  seems to be
>>
>>    - cancel with save point as defined hre
>>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>    - delete the job manger job  and task manager deployments from k8s
>>    almost immediately.
>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>    checkpoints directory.
>>    - resumeFromCheckPoint
>>
>> If some body can say that this indeed is the process ?
>>
>>
>>
>>  Logs are attached.
>>
>>
>>
>> 2019-03-12 08:10:43,871 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>> Savepoint stored in
>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>> cancelling 00000000000000000000000000000000.
>>
>> 2019-03-12 08:10:43,871 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>> to CANCELLING.
>>
>> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>     - Completed checkpoint 10 for job 00000000000000000000000000000000
>> (7238 bytes in 311 ms).
>>
>> 2019-03-12 08:10:44,232 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>
>> 2019-03-12 08:10:44,274 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>
>> 2019-03-12 08:10:44,276 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> anomaly_echo (00000000000000000000000000000000) switched from state
>> CANCELLING to CANCELED.
>>
>> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>     - Stopping checkpoint coordinator for job
>> 00000000000000000000000000000000.
>>
>> 2019-03-12 08:10:44,277 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Shutting down
>>
>> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>       - Checkpoint with ID 8 at
>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>> discarded.
>>
>> 2019-03-12 08:10:44,437 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Removing
>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>> from ZooKeeper
>>
>> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>       - Checkpoint with ID 10 at
>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>> discarded.
>>
>> 2019-03-12 08:10:44,447 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Shutting down.
>>
>> 2019-03-12 08:10:44,447 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>
>> 2019-03-12 08:10:44,463 INFO
>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>
>> 2019-03-12 08:10:44,467 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>> the JobMaster for job anomaly_echo(00000000000000000000000000000000).
>>
>> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>         - Shutting StandaloneJobClusterEntryPoint down with application
>> status CANCELED. Diagnostics null.
>>
>> 2019-03-12 08:10:44,468 INFO
>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>> Shutting down rest endpoint.
>>
>> 2019-03-12 08:10:44,473 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>> - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>
>> 2019-03-12 08:10:44,475 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>> shutting down..
>>
>> 2019-03-12 08:10:44,475 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>> Suspending SlotPool.
>>
>> 2019-03-12 08:10:44,476 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>> SlotPool.
>>
>> 2019-03-12 08:10:44,476 INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>> 00000000000000000000000000000000 from the resource manager.
>>
>> 2019-03-12 08:10:44,477 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Stopping ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>
>>
>> After a little bit
>>
>>
>> Starting the job-cluster
>>
>> used deprecated key `jobmanager.heap.mb`, please replace with key
>> `jobmanager.heap.size`
>>
>> Starting standalonejob as a console application on host anomalyecho-mmg6t.
>>
>> ..
>>
>> ..
>>
>>
>> Regards.
>>
>>
>>
>>
>>
>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bh...@gmail.com>
>> wrote:
>>
>>> Hi Vishal
>>>
>>> Save point with cancellation internally use  /cancel  REST API. Which is
>>> not stable API.  It always exits with 404. Best  way to issue is:
>>>
>>> a) First issue save point REST API
>>> b) Then issue  /yarn-cancel  rest API( As described in
>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>>> )
>>> c) Then After resuming your job, provide save point Path as argument for
>>> the run jar REST API, which is returned by the (a)
>>> Above is the smoother way
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> There are some issues I see and would want to get some feedback
>>>>
>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>> should too. That does not happen and thus the job pod remains live. Is that
>>>> expected ?
>>>>
>>>> 2. To resume fro a save point it seems that I have to delete the job id
>>>> ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
>>>> latest checkpoint no matter what
>>>>
>>>>
>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>> cancelling with a save point and resuming  and what is the cogent story
>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>> provide a new job id.
>>>>
>>>> Regards,
>>>>
>>>> Vishal.
>>>>
>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vijay Bhaskar <bh...@gmail.com>.
Hi Vishal,

yarn-cancel doesn't mean to be for yarn cluster. It works for all clusters.
Its recommended command.

Use the following command to issue save point.
 curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
\ https://
************.ingress.*******/jobs/00000000000000000000000000000000/savepoints

Then issue yarn-cancel.
After that  follow the process to restore save point

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Hello Vijay,
>
>                Thank you for the reply. This though is k8s deployment (
> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
> save point with cancel*  as documented here
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
> a straight up
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
> \ https://
> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>
> I would assume that after taking the save point, the jvm should exit,
> after all the k8s deployment is of kind: job and if it is a job cluster
> then a cancellation should exit the jvm and hence the pod. It does seem to
> do some things right. It stops a bunch of stuff ( the JobMaster, the
> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
> counter but does not exit  the job. And after a little bit the job is
> restarted which does not make sense and absolutely not the right thing to
> do  ( to me at least ).
>
> Further if I delete the deployment and the job from k8s and restart the
> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
> have to delete the zk chroot for it to consider the save point.
>
>
> Thus the process of cancelling and resuming from a SP on a k8s job cluster
> deployment  seems to be
>
>    - cancel with save point as defined hre
>    https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>    - delete the job manger job  and task manager deployments from k8s
>    almost immediately.
>    - clear the ZK chroot for the 0000000...... job  and may be the
>    checkpoints directory.
>    - resumeFromCheckPoint
>
> If some body can say that this indeed is the process ?
>
>
>
>  Logs are attached.
>
>
>
> 2019-03-12 08:10:43,871 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 - Savepoint stored in
> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
> cancelling 00000000000000000000000000000000.
>
> 2019-03-12 08:10:43,871 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
> to CANCELLING.
>
> 2019-03-12 08:10:44,227 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>     - Completed checkpoint 10 for job 00000000000000000000000000000000
> (7238 bytes in 311 ms).
>
> 2019-03-12 08:10:44,232 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>
> 2019-03-12 08:10:44,274 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> anomaly_echo (00000000000000000000000000000000) switched from state
> CANCELLING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>     - Stopping checkpoint coordinator for job
> 00000000000000000000000000000000.
>
> 2019-03-12 08:10:44,277 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
> Shutting down
>
> 2019-03-12 08:10:44,323 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>       - Checkpoint with ID 8 at
> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
> discarded.
>
> 2019-03-12 08:10:44,437 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
> Removing
> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
> from ZooKeeper
>
> 2019-03-12 08:10:44,437 INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>       - Checkpoint with ID 10 at
> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
> discarded.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Shutting down.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>
> 2019-03-12 08:10:44,463 INFO
> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>
> 2019-03-12 08:10:44,467 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 - Stopping the JobMaster for job
> anomaly_echo(00000000000000000000000000000000).
>
> 2019-03-12 08:10:44,468 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>         - Shutting StandaloneJobClusterEntryPoint down with application
> status CANCELED. Diagnostics null.
>
> 2019-03-12 08:10:44,468 INFO
> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  - Shutting
> down rest endpoint.
>
> 2019-03-12 08:10:44,473 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>
> 2019-03-12 08:10:44,475 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 - Close ResourceManager connection
> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>
> 2019-03-12 08:10:44,475 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
> Suspending SlotPool.
>
> 2019-03-12 08:10:44,476 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
> SlotPool.
>
> 2019-03-12 08:10:44,476 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
> 00000000000000000000000000000000 from the resource manager.
>
> 2019-03-12 08:10:44,477 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>
>
> After a little bit
>
>
> Starting the job-cluster
>
> used deprecated key `jobmanager.heap.mb`, please replace with key
> `jobmanager.heap.size`
>
> Starting standalonejob as a console application on host anomalyecho-mmg6t.
>
> ..
>
> ..
>
>
> Regards.
>
>
>
>
>
> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bh...@gmail.com>
> wrote:
>
>> Hi Vishal
>>
>> Save point with cancellation internally use  /cancel  REST API. Which is
>> not stable API.  It always exits with 404. Best  way to issue is:
>>
>> a) First issue save point REST API
>> b) Then issue  /yarn-cancel  rest API( As described in
>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
>> )
>> c) Then After resuming your job, provide save point Path as argument for
>> the run jar REST API, which is returned by the (a)
>> Above is the smoother way
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> There are some issues I see and would want to get some feedback
>>>
>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>> should too. That does not happen and thus the job pod remains live. Is that
>>> expected ?
>>>
>>> 2. To resume fro a save point it seems that I have to delete the job id
>>> ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
>>> latest checkpoint no matter what
>>>
>>>
>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>> cancelling with a save point and resuming  and what is the cogent story
>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>> work with 1.7.2 so even though that does not make sense, I still can not
>>> provide a new job id.
>>>
>>> Regards,
>>>
>>> Vishal.
>>>
>>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vishal Santoshi <vi...@gmail.com>.
Hello Vijay,

               Thank you for the reply. This though is k8s deployment (
rather then yarn ) but may be they follow the same lifecycle.  I issue a*
save point with cancel*  as documented here
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
a straight up
 curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
\ https://
************.ingress.*******/jobs/00000000000000000000000000000000/savepoints

I would assume that after taking the save point, the jvm should exit, after
all the k8s deployment is of kind: job and if it is a job cluster then a
cancellation should exit the jvm and hence the pod. It does seem to do some
things right. It stops a bunch of stuff ( the JobMaster, the slotPol,
zookeeper coordinator etc ) . It also remove the checkpoint counter but
does not exit  the job. And after a little bit the job is restarted which
does not make sense and absolutely not the right thing to do  ( to me at
least ).

Further if I delete the deployment and the job from k8s and restart the job
and deployment fromSavePoint, it refuses to honor the fromSavePoint. I have
to delete the zk chroot for it to consider the save point.


Thus the process of cancelling and resuming from a SP on a k8s job cluster
deployment  seems to be

   - cancel with save point as defined hre
   https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
   - delete the job manger job  and task manager deployments from k8s
   almost immediately.
   - clear the ZK chroot for the 0000000...... job  and may be the
   checkpoints directory.
   - resumeFromCheckPoint

If some body can say that this indeed is the process ?



 Logs are attached.



2019-03-12 08:10:43,871 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Savepoint stored in
hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
cancelling 00000000000000000000000000000000.

2019-03-12 08:10:43,871 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
to CANCELLING.

2019-03-12 08:10:44,227 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
    - Completed checkpoint 10 for job 00000000000000000000000000000000
(7238 bytes in 311 ms).

2019-03-12 08:10:44,232 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.

2019-03-12 08:10:44,274 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
anomaly_echo (00000000000000000000000000000000) switched from state
CANCELLING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
    - Stopping checkpoint coordinator for job
00000000000000000000000000000000.

2019-03-12 08:10:44,277 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Shutting down

2019-03-12 08:10:44,323 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
      - Checkpoint with ID 8 at
'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
discarded.

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Removing
/k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
from ZooKeeper

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
      - Checkpoint with ID 10 at
'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
discarded.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Shutting down.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper

2019-03-12 08:10:44,463 INFO
org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
00000000000000000000000000000000 reached globally terminal state CANCELED.

2019-03-12 08:10:44,467 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Stopping the JobMaster for job
anomaly_echo(00000000000000000000000000000000).

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint
        - Shutting StandaloneJobClusterEntryPoint down with application
status CANCELED. Diagnostics null.

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  - Shutting
down rest endpoint.

2019-03-12 08:10:44,473 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.

2019-03-12 08:10:44,475 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Close ResourceManager connection
d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..

2019-03-12 08:10:44,475 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
SlotPool.

2019-03-12 08:10:44,476 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
SlotPool.

2019-03-12 08:10:44,476 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
@akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
00000000000000000000000000000000 from the resource manager.

2019-03-12 08:10:44,477 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.


After a little bit


Starting the job-cluster

used deprecated key `jobmanager.heap.mb`, please replace with key
`jobmanager.heap.size`

Starting standalonejob as a console application on host anomalyecho-mmg6t.

..

..


Regards.





On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bh...@gmail.com>
wrote:

> Hi Vishal
>
> Save point with cancellation internally use  /cancel  REST API. Which is
> not stable API.  It always exits with 404. Best  way to issue is:
>
> a) First issue save point REST API
> b) Then issue  /yarn-cancel  rest API( As described in
> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
> )
> c) Then After resuming your job, provide save point Path as argument for
> the run jar REST API, which is returned by the (a)
> Above is the smoother way
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> There are some issues I see and would want to get some feedback
>>
>> 1. On Cancellation With SavePoint with a Target Directory , the k8s  job
>> does not exit ( it is not a deployment ) . I would assume that on
>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>> should too. That does not happen and thus the job pod remains live. Is that
>> expected ?
>>
>> 2. To resume fro a save point it seems that I have to delete the job id (
>> 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
>> latest checkpoint no matter what
>>
>>
>> I am kind of curious as to what in 1.7.2 is the tested  process of
>> cancelling with a save point and resuming  and what is the cogent story
>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>> work with 1.7.2 so even though that does not make sense, I still can not
>> provide a new job id.
>>
>> Regards,
>>
>> Vishal.
>>
>>

Re: K8s job cluster and cancel and resume from a save point ?

Posted by Vijay Bhaskar <bh...@gmail.com>.
Hi Vishal

Save point with cancellation internally use  /cancel  REST API. Which is
not stable API.  It always exits with 404. Best  way to issue is:

a) First issue save point REST API
b) Then issue  /yarn-cancel  rest API( As described in
http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3C0ffa63f4-e6ed-42d8-1928-37a7adaaac0e@apache.org%3E
)
c) Then After resuming your job, provide save point Path as argument for
the run jar REST API, which is returned by the (a)
Above is the smoother way

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> There are some issues I see and would want to get some feedback
>
> 1. On Cancellation With SavePoint with a Target Directory , the k8s  job
> does not exit ( it is not a deployment ) . I would assume that on
> cancellation the jvm should exit, after cleanup etc, and thus the pod
> should too. That does not happen and thus the job pod remains live. Is that
> expected ?
>
> 2. To resume fro a save point it seems that I have to delete the job id (
> 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
> latest checkpoint no matter what
>
>
> I am kind of curious as to what in 1.7.2 is the tested  process of
> cancelling with a save point and resuming  and what is the cogent story
> around job id ( defaults to 000000000000.. ). Note that --job-id does not
> work with 1.7.2 so even though that does not make sense, I still can not
> provide a new job id.
>
> Regards,
>
> Vishal.
>
>