You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Timothy Victor <vi...@gmail.com> on 2019/06/29 13:35:12 UTC

Why did JM fail on K8s (see original thread below)

This is slightly off topic, so I'm changing the subject to not conflate the
original issue you brought up.   But do we know why JM crashed in the first
place?

We are also thinking of moving to K8s, but to be honest we had tons of
stability issues in our first rodeo.  That could just be our lack of a
deeper understanding of K8s (a lot had to do with DNS).  But cases like
this make me wonder what happened to cause the crash in the first place.

Thanks

Tim


On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> Another point the JM had terminated. The policy on K8s for Job Cluster is
>
> spec:
>   restartPolicy: OnFailure
>
>
> *2019-06-29 00:33:14,308 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
>
>
>
> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> I have not tried on bare metal. We have no option but k8s.
>>
>> And this is a job cluster.
>>
>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <vi...@gmail.com> wrote:
>>
>>> Hi Vishal, can this be reproduced on a bare metal instance as well?
>>>  Also is this a job or a session cluster?
>>>
>>> Thanks
>>>
>>> Tim
>>>
>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi <vi...@gmail.com>
>>> wrote:
>>>
>>>> OK this happened again and it is bizarre ( and is definitely not what I
>>>> think should happen )
>>>>
>>>>
>>>>
>>>>
>>>> The job failed and I see these logs  ( In essence it is keeping the
>>>> last 5 externalized checkpoints )  but deleting the zk checkpoints directory
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>> with ID 5654 at
>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
>>>> not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786 INFO
>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>> with ID 5655 at
>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
>>>> not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836 INFO
>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>> with ID 5656 at
>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
>>>> not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886 INFO
>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>> with ID 5657 at
>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>> with ID 5658 at
>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
>>>> from ZooKeeper*
>>>>
>>>> The job restarts and this is bizzare. It does not find the ZK
>>>> checkpoint directory but instead of going to the state.checkpoints.dir to
>>>> get it's last checkpoint, it restarts from a save point that we started
>>>> this job with ( resetting the checkpoint id )  like about 15 days ago
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
>>>>  2019-06-29 00:33:20,051 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
>>>>  2019-06-29 00:33:20,051 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
>>>> 00:33:20,053 INFO
>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
>>>> job 00000000000000000000000000000005 from savepoint
>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
>>>> ()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the
>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203.
>>>> 06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
>>>>  2019-06-29 00:33:20,549 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
>>>>  2019-06-29 00:33:20,548 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
>>>> 00:33:20,549 INFO
>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Trying to fetch 1 checkpoints from storage.*
>>>>
>>>>
>>>>
>>>> This just does not make sense....
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Ok, I will do that.
>>>>>
>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Can you provide us the jobmanager logs?
>>>>>>
>>>>>> After the first restart the JM should have started deleting older
>>>>>> checkpoints as new ones were created.
>>>>>> After the second restart the JM should have recovered all 10
>>>>>> checkpoints, start from the latest, and start pruning old ones as new ones
>>>>>> were created.
>>>>>>
>>>>>> So you're running into 2 separate issues here, which is a bit odd.
>>>>>>
>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote:
>>>>>>
>>>>>> Any one?
>>>>>>
>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> The above is flink 1.8
>>>>>>>
>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> I had a sequence of events that created this issue.
>>>>>>>>
>>>>>>>> * I started a job and I had the state.checkpoints.num-retained: 5
>>>>>>>>
>>>>>>>> * As expected I have 5 latest checkpoints retained in my hdfs
>>>>>>>> backend.
>>>>>>>>
>>>>>>>>
>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.
>>>>>>>> The k8s  job restores from the latest checkpoint ( I think ) but as it
>>>>>>>> creates new checkpoints it does not delete the older chk point. At the end
>>>>>>>> there are now 10 chkpoints,  5 from the old run which remain static and 5
>>>>>>>> latest representing the on going pipe.
>>>>>>>>
>>>>>>>> * The JM dies again and restart  from the latest from the 5 old
>>>>>>>> checkpoints.
>>>>>>>>
>>>>>>>> This looks a bug in the Job Cluster implementation of flink. It
>>>>>>>> looks like it is taking the 5th checkpoint from the beginning based on
>>>>>>>> num-retained value, Note that it has the same job id and does not scope to
>>>>>>>> a new directory.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>>>>>>
>>>>>>>> Please tell me if this does not make sense.
>>>>>>>>
>>>>>>>> Vishal
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>

Re: Why did JM fail on K8s (see original thread below)

Posted by Vishal Santoshi <vi...@gmail.com>.
even though
Max. number of execution retries Restart with fixed delay (240000 ms). #20
restart attempts.

On Sat, Jun 29, 2019 at 10:44 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> This is strange, the retry strategy was 20 times with 4 minute delay.
> This job tried once ( we had a hadoop Name Node hiccup )  but I think it
> could not even get to NN and gave up ( as in did not retry the next 19
> times ) ....
>
> *019-06-29 00:33:13,680 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart
> the job Kafka-to-HDFS (00000000000000000000000000000005) because the
> restart strategy prevented it.*
>
> On Sat, Jun 29, 2019 at 10:03 AM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> We are investigating that.   But is the above theory plausible ( flink
>> gurus ) even though this, as in forcing restartPolicy: Never pretty much
>> nullifies HA on JM is it is a Job cluster ( at leats on k8s )
>>
>>
>> As for the reason we are investigating that.
>>
>> One thing we looking as the QOS (
>> https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ )
>> ,  which was Burstable, a limit was not set on the job. The below for
>> example would have a QOS of guaranteed. So one reason could be this.
>>
>> resources:
>>   requests:
>>     cpu: "8.0"
>>     memory: "16Gi"
>>   limits:
>>     cpu: "8.0"
>>     memory: "16Gi"
>>
>>
>>
>>
>>
>> On Sat, Jun 29, 2019 at 9:35 AM Timothy Victor <vi...@gmail.com> wrote:
>>
>>> This is slightly off topic, so I'm changing the subject to not conflate
>>> the original issue you brought up.   But do we know why JM crashed in the
>>> first place?
>>>
>>> We are also thinking of moving to K8s, but to be honest we had tons of
>>> stability issues in our first rodeo.  That could just be our lack of a
>>> deeper understanding of K8s (a lot had to do with DNS).  But cases like
>>> this make me wonder what happened to cause the crash in the first place.
>>>
>>> Thanks
>>>
>>> Tim
>>>
>>>
>>> On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vi...@gmail.com>
>>> wrote:
>>>
>>>> Another point the JM had terminated. The policy on K8s for Job Cluster
>>>> is
>>>>
>>>> spec:
>>>>   restartPolicy: OnFailure
>>>>
>>>>
>>>> *2019-06-29 00:33:14,308 INFO
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
>>>> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
>>>>
>>>>
>>>>
>>>> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> I have not tried on bare metal. We have no option but k8s.
>>>>>
>>>>> And this is a job cluster.
>>>>>
>>>>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <vi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vishal, can this be reproduced on a bare metal instance as well?
>>>>>>  Also is this a job or a session cluster?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> OK this happened again and it is bizarre ( and is definitely not
>>>>>>> what I think should happen )
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The job failed and I see these logs  ( In essence it is keeping the
>>>>>>> last 5 externalized checkpoints )  but deleting the zk checkpoints directory
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>>> with ID 5654 at
>>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
>>>>>>> not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>>> with ID 5655 at
>>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
>>>>>>> not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>>> with ID 5656 at
>>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
>>>>>>> not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>>> with ID 5657 at
>>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
>>>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>>> with ID 5658 at
>>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
>>>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>>>  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
>>>>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
>>>>>>> from ZooKeeper*
>>>>>>>
>>>>>>> The job restarts and this is bizzare. It does not find the ZK
>>>>>>> checkpoint directory but instead of going to the
>>>>>>> state.checkpoints.dir to get it's last checkpoint, it restarts from
>>>>>>> a save point that we started this job with ( resetting the checkpoint id )
>>>>>>> like about 15 days ago
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
>>>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
>>>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
>>>>>>> 00:33:20,053 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
>>>>>>> job 00000000000000000000000000000005 from savepoint
>>>>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
>>>>>>> ()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the
>>>>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203.
>>>>>>> 06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
>>>>>>>  2019-06-29 00:33:20,549 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
>>>>>>>  2019-06-29 00:33:20,548 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
>>>>>>> 00:33:20,549 INFO
>>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>>> Trying to fetch 1 checkpoints from storage.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This just does not make sense....
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok, I will do that.
>>>>>>>>
>>>>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ch...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can you provide us the jobmanager logs?
>>>>>>>>>
>>>>>>>>> After the first restart the JM should have started deleting older
>>>>>>>>> checkpoints as new ones were created.
>>>>>>>>> After the second restart the JM should have recovered all 10
>>>>>>>>> checkpoints, start from the latest, and start pruning old ones as new ones
>>>>>>>>> were created.
>>>>>>>>>
>>>>>>>>> So you're running into 2 separate issues here, which is a bit odd.
>>>>>>>>>
>>>>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote:
>>>>>>>>>
>>>>>>>>> Any one?
>>>>>>>>>
>>>>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The above is flink 1.8
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I had a sequence of events that created this issue.
>>>>>>>>>>>
>>>>>>>>>>> * I started a job and I had the state.checkpoints.num-retained:
>>>>>>>>>>> 5
>>>>>>>>>>>
>>>>>>>>>>> * As expected I have 5 latest checkpoints retained in my hdfs
>>>>>>>>>>> backend.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs
>>>>>>>>>>> directory.  The k8s  job restores from the latest checkpoint ( I think )
>>>>>>>>>>> but as it creates new checkpoints it does not delete the older chk point.
>>>>>>>>>>> At the end there are now 10 chkpoints,  5 from the old run which remain
>>>>>>>>>>> static and 5 latest representing the on going pipe.
>>>>>>>>>>>
>>>>>>>>>>> * The JM dies again and restart  from the latest from the 5 old
>>>>>>>>>>> checkpoints.
>>>>>>>>>>>
>>>>>>>>>>> This looks a bug in the Job Cluster implementation of flink. It
>>>>>>>>>>> looks like it is taking the 5th checkpoint from the beginning based on
>>>>>>>>>>> num-retained value, Note that it has the same job id and does not scope to
>>>>>>>>>>> a new directory.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>>>>>>>>>
>>>>>>>>>>> Please tell me if this does not make sense.
>>>>>>>>>>>
>>>>>>>>>>> Vishal
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>

Re: Why did JM fail on K8s (see original thread below)

Posted by Vishal Santoshi <vi...@gmail.com>.
This is strange, the retry strategy was 20 times with 4 minute delay.  This
job tried once ( we had a hadoop Name Node hiccup )  but I think it could
not even get to NN and gave up ( as in did not retry the next 19 times )
....

*019-06-29 00:33:13,680 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart
the job Kafka-to-HDFS (00000000000000000000000000000005) because the
restart strategy prevented it.*

On Sat, Jun 29, 2019 at 10:03 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> We are investigating that.   But is the above theory plausible ( flink
> gurus ) even though this, as in forcing restartPolicy: Never pretty much
> nullifies HA on JM is it is a Job cluster ( at leats on k8s )
>
>
> As for the reason we are investigating that.
>
> One thing we looking as the QOS (
> https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ )
> ,  which was Burstable, a limit was not set on the job. The below for
> example would have a QOS of guaranteed. So one reason could be this.
>
> resources:
>   requests:
>     cpu: "8.0"
>     memory: "16Gi"
>   limits:
>     cpu: "8.0"
>     memory: "16Gi"
>
>
>
>
>
> On Sat, Jun 29, 2019 at 9:35 AM Timothy Victor <vi...@gmail.com> wrote:
>
>> This is slightly off topic, so I'm changing the subject to not conflate
>> the original issue you brought up.   But do we know why JM crashed in the
>> first place?
>>
>> We are also thinking of moving to K8s, but to be honest we had tons of
>> stability issues in our first rodeo.  That could just be our lack of a
>> deeper understanding of K8s (a lot had to do with DNS).  But cases like
>> this make me wonder what happened to cause the crash in the first place.
>>
>> Thanks
>>
>> Tim
>>
>>
>> On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vi...@gmail.com>
>> wrote:
>>
>>> Another point the JM had terminated. The policy on K8s for Job Cluster is
>>>
>>> spec:
>>>   restartPolicy: OnFailure
>>>
>>>
>>> *2019-06-29 00:33:14,308 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
>>> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
>>>
>>>
>>>
>>> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> I have not tried on bare metal. We have no option but k8s.
>>>>
>>>> And this is a job cluster.
>>>>
>>>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <vi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vishal, can this be reproduced on a bare metal instance as well?
>>>>>  Also is this a job or a session cluster?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Tim
>>>>>
>>>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> OK this happened again and it is bizarre ( and is definitely not what
>>>>>> I think should happen )
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The job failed and I see these logs  ( In essence it is keeping the
>>>>>> last 5 externalized checkpoints )  but deleting the zk checkpoints directory
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>> with ID 5654 at
>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
>>>>>> not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>> with ID 5655 at
>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
>>>>>> not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>> with ID 5656 at
>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
>>>>>> not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>> with ID 5657 at
>>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
>>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>>> with ID 5658 at
>>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
>>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>>  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
>>>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
>>>>>> from ZooKeeper*
>>>>>>
>>>>>> The job restarts and this is bizzare. It does not find the ZK
>>>>>> checkpoint directory but instead of going to the
>>>>>> state.checkpoints.dir to get it's last checkpoint, it restarts from
>>>>>> a save point that we started this job with ( resetting the checkpoint id )
>>>>>> like about 15 days ago
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
>>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
>>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
>>>>>> 00:33:20,053 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
>>>>>> job 00000000000000000000000000000005 from savepoint
>>>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
>>>>>> ()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the
>>>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203.
>>>>>> 06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
>>>>>>  2019-06-29 00:33:20,549 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
>>>>>>  2019-06-29 00:33:20,548 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
>>>>>> 00:33:20,549 INFO
>>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>>> Trying to fetch 1 checkpoints from storage.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> This just does not make sense....
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Ok, I will do that.
>>>>>>>
>>>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ch...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you provide us the jobmanager logs?
>>>>>>>>
>>>>>>>> After the first restart the JM should have started deleting older
>>>>>>>> checkpoints as new ones were created.
>>>>>>>> After the second restart the JM should have recovered all 10
>>>>>>>> checkpoints, start from the latest, and start pruning old ones as new ones
>>>>>>>> were created.
>>>>>>>>
>>>>>>>> So you're running into 2 separate issues here, which is a bit odd.
>>>>>>>>
>>>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote:
>>>>>>>>
>>>>>>>> Any one?
>>>>>>>>
>>>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The above is flink 1.8
>>>>>>>>>
>>>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I had a sequence of events that created this issue.
>>>>>>>>>>
>>>>>>>>>> * I started a job and I had the state.checkpoints.num-retained: 5
>>>>>>>>>>
>>>>>>>>>> * As expected I have 5 latest checkpoints retained in my hdfs
>>>>>>>>>> backend.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.
>>>>>>>>>> The k8s  job restores from the latest checkpoint ( I think ) but as it
>>>>>>>>>> creates new checkpoints it does not delete the older chk point. At the end
>>>>>>>>>> there are now 10 chkpoints,  5 from the old run which remain static and 5
>>>>>>>>>> latest representing the on going pipe.
>>>>>>>>>>
>>>>>>>>>> * The JM dies again and restart  from the latest from the 5 old
>>>>>>>>>> checkpoints.
>>>>>>>>>>
>>>>>>>>>> This looks a bug in the Job Cluster implementation of flink. It
>>>>>>>>>> looks like it is taking the 5th checkpoint from the beginning based on
>>>>>>>>>> num-retained value, Note that it has the same job id and does not scope to
>>>>>>>>>> a new directory.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>>>>>>>>
>>>>>>>>>> Please tell me if this does not make sense.
>>>>>>>>>>
>>>>>>>>>> Vishal
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>

Re: Why did JM fail on K8s (see original thread below)

Posted by Vishal Santoshi <vi...@gmail.com>.
We are investigating that.   But is the above theory plausible ( flink
gurus ) even though this, as in forcing restartPolicy: Never pretty much
nullifies HA on JM is it is a Job cluster ( at leats on k8s )


As for the reason we are investigating that.

One thing we looking as the QOS (
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ )
,  which was Burstable, a limit was not set on the job. The below for
example would have a QOS of guaranteed. So one reason could be this.

resources:
  requests:
    cpu: "8.0"
    memory: "16Gi"
  limits:
    cpu: "8.0"
    memory: "16Gi"





On Sat, Jun 29, 2019 at 9:35 AM Timothy Victor <vi...@gmail.com> wrote:

> This is slightly off topic, so I'm changing the subject to not conflate
> the original issue you brought up.   But do we know why JM crashed in the
> first place?
>
> We are also thinking of moving to K8s, but to be honest we had tons of
> stability issues in our first rodeo.  That could just be our lack of a
> deeper understanding of K8s (a lot had to do with DNS).  But cases like
> this make me wonder what happened to cause the crash in the first place.
>
> Thanks
>
> Tim
>
>
> On Sat, Jun 29, 2019, 8:25 AM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Another point the JM had terminated. The policy on K8s for Job Cluster is
>>
>> spec:
>>   restartPolicy: OnFailure
>>
>>
>> *2019-06-29 00:33:14,308 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster
>> entrypoint process StandaloneJobClusterEntryPoint with exit code 1443.*
>>
>>
>>
>> On Sat, Jun 29, 2019 at 9:04 AM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> I have not tried on bare metal. We have no option but k8s.
>>>
>>> And this is a job cluster.
>>>
>>> On Sat, Jun 29, 2019 at 9:01 AM Timothy Victor <vi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal, can this be reproduced on a bare metal instance as well?
>>>>  Also is this a job or a session cluster?
>>>>
>>>> Thanks
>>>>
>>>> Tim
>>>>
>>>> On Sat, Jun 29, 2019, 7:50 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> OK this happened again and it is bizarre ( and is definitely not what
>>>>> I think should happen )
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The job failed and I see these logs  ( In essence it is keeping the
>>>>> last 5 externalized checkpoints )  but deleting the zk checkpoints directory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5654 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
>>>>> not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5655 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
>>>>> not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5656 at
>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
>>>>> not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5657 at
>>>>> 'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
>>>>> with ID 5658 at
>>>>> 'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
>>>>> not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
>>>>>  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
>>>>> /kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
>>>>> from ZooKeeper*
>>>>>
>>>>> The job restarts and this is bizzare. It does not find the ZK
>>>>> checkpoint directory but instead of going to the
>>>>> state.checkpoints.dir to get it's last checkpoint, it restarts from a
>>>>> save point that we started this job with ( resetting the checkpoint id )
>>>>> like about 15 days ago
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
>>>>>  2019-06-29 00:33:20,051 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
>>>>> 00:33:20,053 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
>>>>> job 00000000000000000000000000000005 from savepoint
>>>>> hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
>>>>> ()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the
>>>>> checkpoint ID of job 00000000000000000000000000000005 to 4203.
>>>>> 06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
>>>>>  2019-06-29 00:33:20,549 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
>>>>>  2019-06-29 00:33:20,548 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
>>>>> 00:33:20,549 INFO
>>>>>  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>>> Trying to fetch 1 checkpoints from storage.*
>>>>>
>>>>>
>>>>>
>>>>> This just does not make sense....
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Ok, I will do that.
>>>>>>
>>>>>> On Wed, Jun 5, 2019, 8:25 AM Chesnay Schepler <ch...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you provide us the jobmanager logs?
>>>>>>>
>>>>>>> After the first restart the JM should have started deleting older
>>>>>>> checkpoints as new ones were created.
>>>>>>> After the second restart the JM should have recovered all 10
>>>>>>> checkpoints, start from the latest, and start pruning old ones as new ones
>>>>>>> were created.
>>>>>>>
>>>>>>> So you're running into 2 separate issues here, which is a bit odd.
>>>>>>>
>>>>>>> On 05/06/2019 13:44, Vishal Santoshi wrote:
>>>>>>>
>>>>>>> Any one?
>>>>>>>
>>>>>>> On Tue, Jun 4, 2019, 2:41 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> The above is flink 1.8
>>>>>>>>
>>>>>>>> On Tue, Jun 4, 2019 at 12:32 PM Vishal Santoshi <
>>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I had a sequence of events that created this issue.
>>>>>>>>>
>>>>>>>>> * I started a job and I had the state.checkpoints.num-retained: 5
>>>>>>>>>
>>>>>>>>> * As expected I have 5 latest checkpoints retained in my hdfs
>>>>>>>>> backend.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> * JM dies ( K8s limit etc ) without cleaning the hdfs directory.
>>>>>>>>> The k8s  job restores from the latest checkpoint ( I think ) but as it
>>>>>>>>> creates new checkpoints it does not delete the older chk point. At the end
>>>>>>>>> there are now 10 chkpoints,  5 from the old run which remain static and 5
>>>>>>>>> latest representing the on going pipe.
>>>>>>>>>
>>>>>>>>> * The JM dies again and restart  from the latest from the 5 old
>>>>>>>>> checkpoints.
>>>>>>>>>
>>>>>>>>> This looks a bug in the Job Cluster implementation of flink. It
>>>>>>>>> looks like it is taking the 5th checkpoint from the beginning based on
>>>>>>>>> num-retained value, Note that it has the same job id and does not scope to
>>>>>>>>> a new directory.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/flink/blob/1dfdaa417ab7cdca9bef1efe6381c7eb67022aaf/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L109
>>>>>>>>>
>>>>>>>>> Please tell me if this does not make sense.
>>>>>>>>>
>>>>>>>>> Vishal
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>