You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Fritz Budiyanto <fb...@icloud.com> on 2019/03/26 23:35:22 UTC

Help debugging Kafka connection leaks after job failure/cancelation

Hi All,

We're using Flink-1.4.2 and noticed many dangling connections to Kafka after job deletion/recreation. The trigger here is Job cancelation/failure due to network down event followed by Job recreation.

Our flink job has checkpointing disabled, and upon job failure (due to network failure), the Job got deleted and re-created. There were network failure event which impacting communication between task manager(s) and task-manager <-> job-manager. Our custom job controller monitored this condition and tried to cancel the job, followed by recreating the job (after a minute or so).

Because of the network failure, the above steps were repeated many times and eventually the flink-docker-container's socket file descriptors were exhausted.
Looks like there were many Kafka connections from flink-task-manager to the local Kafka broker.

netstat  -ntap | grep 9092 | grep java | wc -l
2235

Is this a known issue which already fixed in later release ? If yes, could someone point out the Jira link?
If this is a new issue, could someone let me know how to move forward and debug this issue ? Looks like kafka consumers were not cleaned up properly upon job cancelation.

Thanks,
Fritz

Re: Help debugging Kafka connection leaks after job failure/cancelation

Posted by Fritz Budiyanto <fb...@icloud.com>.

Thank you !

> On Mar 26, 2019, at 6:51 PM, Steven Wu <st...@gmail.com> wrote:
> 
> it might be related to this issue
> https://issues.apache.org/jira/browse/FLINK-10774 <https://issues.apache.org/jira/browse/FLINK-10774>
> 
> On Tue, Mar 26, 2019 at 4:35 PM Fritz Budiyanto <fbudiyan@icloud.com <ma...@icloud.com>> wrote:
> Hi All,
> 
> We're using Flink-1.4.2 and noticed many dangling connections to Kafka after job deletion/recreation. The trigger here is Job cancelation/failure due to network down event followed by Job recreation.
> 
> Our flink job has checkpointing disabled, and upon job failure (due to network failure), the Job got deleted and re-created. There were network failure event which impacting communication between task manager(s) and task-manager <-> job-manager. Our custom job controller monitored this condition and tried to cancel the job, followed by recreating the job (after a minute or so).
> 
> Because of the network failure, the above steps were repeated many times and eventually the flink-docker-container's socket file descriptors were exhausted.
> Looks like there were many Kafka connections from flink-task-manager to the local Kafka broker.
> 
> netstat  -ntap | grep 9092 | grep java | wc -l
> 2235
> 
> Is this a known issue which already fixed in later release ? If yes, could someone point out the Jira link?
> If this is a new issue, could someone let me know how to move forward and debug this issue ? Looks like kafka consumers were not cleaned up properly upon job cancelation.
> 
> Thanks,
> Fritz

Re: Help debugging Kafka connection leaks after job failure/cancelation

Posted by Steven Wu <st...@gmail.com>.

it might be related to this issue
https://issues.apache.org/jira/browse/FLINK-10774

On Tue, Mar 26, 2019 at 4:35 PM Fritz Budiyanto <fb...@icloud.com> wrote:

> Hi All,
>
> We're using Flink-1.4.2 and noticed many dangling connections to Kafka
> after job deletion/recreation. The trigger here is Job cancelation/failure
> due to network down event followed by Job recreation.
>
> Our flink job has checkpointing disabled, and upon job failure (due to
> network failure), the Job got deleted and re-created. There were network
> failure event which impacting communication between task manager(s) and
> task-manager <-> job-manager. Our custom job controller monitored this
> condition and tried to cancel the job, followed by recreating the job
> (after a minute or so).
>
> Because of the network failure, the above steps were repeated many times
> and eventually the flink-docker-container's socket file descriptors were
> exhausted.
> Looks like there were many Kafka connections from flink-task-manager to
> the local Kafka broker.
>
> netstat  -ntap | grep 9092 | grep java | wc -l
> 2235
>
> Is this a known issue which already fixed in later release ? If yes, could
> someone point out the Jira link?
> If this is a new issue, could someone let me know how to move forward and
> debug this issue ? Looks like kafka consumers were not cleaned up properly
> upon job cancelation.
>
> Thanks,
> Fritz