You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Yuan,Youjun" <yu...@baidu.com> on 2018/08/23 14:54:54 UTC

jobmanager holds too many CLOSE_WAIT connection to datanode

Hi,

After running for a while , my job manager holds thousands of CLOSE_WAIT TCP connection to HDFS datanode, the number is growing up slowly, and it's likely will hit the max open file limit. My jobs checkpoint to HDFS every minute.
If I run lsof -i -a -p $JMPID, I can get a tons of following output:
java    9433  iot  408u  IPv4 4060901898      0t0  TCP jmHost:17922->datanode:50010 (CLOSE_WAIT)
java    9433  iot  409u  IPv4 4061478455      0t0  TCP jmHost:52854->datanode:50010 (CLOSE_WAIT)
java    9433  iot  410r  IPv4 4063170767      0t0  TCP jmHost:49384->datanode:50010 (CLOSE_WAIT)
java    9433  iot  411w  IPv4 4063188376      0t0  TCP jmHost:50516->datanode:50010 (CLOSE_WAIT)
java    9433  iot  412u  IPv4 4061459881      0t0  TCP jmHost:51651->datanode:50010 (CLOSE_WAIT)
java    9433  iot  413u  IPv4 4063737603      0t0  TCP jmHost:31318->datanode:50010 (CLOSE_WAIT)
java    9433  iot  414w  IPv4 4062030625      0t0  TCP jmHost:34033->datanode:50010 (CLOSE_WAIT)
java    9433  iot  415u  IPv4 4062049134      0t0  TCP jmHost:35156->datanode:50010 (CLOSE_WAIT)
java    9433  iot  416u  IPv4 4062615550      0t0  TCP jmHost:16962->datanode:50010 (CLOSE_WAIT)
java    9433  iot  417r  IPv4 4063757056      0t0  TCP jmHost:32553->datanode:50010 (CLOSE_WAIT)
java    9433  iot  418w  IPv4 4064304789      0t0  TCP jmHost:13375->datanode:50010 (CLOSE_WAIT)
java    9433  iot  419u  IPv4 4062599328      0t0  TCP jmHost:15915->datanode:50010 (CLOSE_WAIT)
java    9433  iot  420w  IPv4 4065462963      0t0  TCP jmHost:30432->datanode:50010 (CLOSE_WAIT)
java    9433  iot  421u  IPv4 4067178257      0t0  TCP jmHost:28334->datanode:50010 (CLOSE_WAIT)
java    9433  iot  422u  IPv4 4066022066      0t0  TCP jmHost:11843->datanode:50010 (CLOSE_WAIT)


I know restarting the job manager should cleanup those connections, but I wonder if there is any better solution?
Btw, I am using flink 1.4.0, and running a standalone cluster.

Thanks
Youjun

答复: jobmanager holds too many CLOSE_WAIT connection to datanode

Posted by "Yuan,Youjun" <yu...@baidu.com>.

One more safer approach is to execute cancel with savepoint on all jobs first
>> this sounds great!

Thanks
Youjun

发件人: vino yang <ya...@gmail.com>
发送时间: Friday, August 24, 2018 2:43 PM
收件人: Yuan,Youjun <yu...@baidu.com>; user <us...@flink.apache.org>
主题: Re: jobmanager holds too many CLOSE_WAIT connection to datanode

Hi Youjun,

You can see if there is any real data transfer between these connections.
I guess there may be some connection leaks here, and if so, it's a bug.
On the other hand, the 1.4 version is a bit old, can you compare the 1.5 or 1.6 whether the same problem exists?
I suggest you create an issue on JIRA and maybe get more feedback.

Questions about how to force these connections to be closed.
If you have configured HA mode and the checkpoints are enabled for the job, you can try to show off the JM leader, then let ZK conduct the leader election and JM to switch.

But please be cautious about this process. One more safer approach is to execute cancel with savepoint on all jobs first. Then switch JM.

Thanks, vino.

Yuan,Youjun <yu...@baidu.com>> 于2018年8月24日周五 下午1:06写道：
Hi vino,

My jobs are running for months now, on a standalone cluster, using flink 1.4.0.
The connections were accumulated over time, not in a short period of time. There is no timeout error in Jobmanager log.

So there are two questions:
1, how to force close those connections, ideally without restarting the running jobs.
2, in the future, how to avoid jobmanager holing so many, apparently not necessary, TCP connections?

Thanks
Youjun

发件人: vino yang <ya...@gmail.com>>
发送时间: Friday, August 24, 2018 10:26 AM
收件人: Yuan,Youjun <yu...@baidu.com>>
抄送: user <us...@flink.apache.org>>
主题: Re: jobmanager holds too many CLOSE_WAIT connection to datanode

Hi Youjun,

How long has your job been running for a long time?
As far as I know, if in a short time, for checkpoint, jobmanager will not generate so many connections to HDFS.
What is your Flink cluster environment? Standalone or Flink on YARN?
In addition, does JM's log show any timeout information? Has Checkpoint timed out?
If you can provide more information, it will help locate the problem.

Thanks, vino.

Yuan,Youjun <yu...@baidu.com>> 于2018年8月23日周四 下午10:53写道：
Hi,

After running for a while , my job manager holds thousands of CLOSE_WAIT TCP connection to HDFS datanode, the number is growing up slowly, and it’s likely will hit the max open file limit. My jobs checkpoint to HDFS every minute.
If I run lsof -i -a -p $JMPID, I can get a tons of following output:
java    9433  iot  408u  IPv4 4060901898      0t0  TCP jmHost:17922->datanode:50010 (CLOSE_WAIT)
java    9433  iot  409u  IPv4 4061478455      0t0  TCP jmHost:52854->datanode:50010 (CLOSE_WAIT)
java    9433  iot  410r  IPv4 4063170767      0t0  TCP jmHost:49384->datanode:50010 (CLOSE_WAIT)
java    9433  iot  411w  IPv4 4063188376      0t0  TCP jmHost:50516->datanode:50010 (CLOSE_WAIT)
java    9433  iot  412u  IPv4 4061459881      0t0  TCP jmHost:51651->datanode:50010 (CLOSE_WAIT)
java    9433  iot  413u  IPv4 4063737603      0t0  TCP jmHost:31318->datanode:50010 (CLOSE_WAIT)
java    9433  iot  414w  IPv4 4062030625      0t0  TCP jmHost:34033->datanode:50010 (CLOSE_WAIT)
java    9433  iot  415u  IPv4 4062049134      0t0  TCP jmHost:35156->datanode:50010 (CLOSE_WAIT)
java    9433  iot  416u  IPv4 4062615550      0t0  TCP jmHost:16962->datanode:50010 (CLOSE_WAIT)
java    9433  iot  417r  IPv4 4063757056      0t0  TCP jmHost:32553->datanode:50010 (CLOSE_WAIT)
java    9433  iot  418w  IPv4 4064304789      0t0  TCP jmHost:13375->datanode:50010 (CLOSE_WAIT)
java    9433  iot  419u  IPv4 4062599328      0t0  TCP jmHost:15915->datanode:50010 (CLOSE_WAIT)
java    9433  iot  420w  IPv4 4065462963      0t0  TCP jmHost:30432->datanode:50010 (CLOSE_WAIT)
java    9433  iot  421u  IPv4 4067178257      0t0  TCP jmHost:28334->datanode:50010 (CLOSE_WAIT)
java    9433  iot  422u  IPv4 4066022066      0t0  TCP jmHost:11843->datanode:50010 (CLOSE_WAIT)


I know restarting the job manager should cleanup those connections, but I wonder if there is any better solution?
Btw, I am using flink 1.4.0, and running a standalone cluster.

Thanks
Youjun

Re: jobmanager holds too many CLOSE_WAIT connection to datanode

Posted by vino yang <ya...@gmail.com>.

Hi Youjun,

You can see if there is any real data transfer between these connections.
I guess there may be some connection leaks here, and if so, it's a bug.
On the other hand, the 1.4 version is a bit old, can you compare the 1.5 or
1.6 whether the same problem exists?
I suggest you create an issue on JIRA and maybe get more feedback.

Questions about how to force these connections to be closed.
If you have configured HA mode and the checkpoints are enabled for the job,
you can try to show off the JM leader, then let ZK conduct the leader
election and JM to switch.

But please be cautious about this process. One more safer approach is to
execute cancel with savepoint on all jobs first. Then switch JM.

Thanks, vino.

Yuan,Youjun <yu...@baidu.com> 于2018年8月24日周五 下午1:06写道：

> Hi vino,
>
>
>
> My jobs are running for months now, on a standalone cluster, using flink
> 1.4.0.
>
> The connections were accumulated over time, not in a short period of time.
> There is no timeout error in Jobmanager log.
>
>
>
> So there are two questions:
>
> 1, how to force close those connections, ideally without restarting the
> running jobs.
>
> 2, in the future, how to avoid jobmanager holing so many, apparently not
> necessary, TCP connections?
>
>
>
> Thanks
>
> Youjun
>
>
>
> *发件人**:* vino yang <ya...@gmail.com>
> *发送时间:* Friday, August 24, 2018 10:26 AM
> *收件人:* Yuan,Youjun <yu...@baidu.com>
> *抄送:* user <us...@flink.apache.org>
> *主题:* Re: jobmanager holds too many CLOSE_WAIT connection to datanode
>
>
>
> Hi Youjun,
>
>
>
> How long has your job been running for a long time?
>
> As far as I know, if in a short time, for checkpoint, jobmanager will not
> generate so many connections to HDFS.
>
> What is your Flink cluster environment? Standalone or Flink on YARN?
>
> In addition, does JM's log show any timeout information? Has Checkpoint
> timed out?
>
> If you can provide more information, it will help locate the problem.
>
>
>
> Thanks, vino.
>
>
>
> Yuan,Youjun <yu...@baidu.com> 于2018年8月23日周四 下午10:53写道：
>
> Hi,
>
>
>
> After running for a while , my job manager holds thousands of CLOSE_WAIT
> TCP connection to HDFS datanode, the number is growing up slowly, and it’s
> likely will hit the max open file limit. My jobs checkpoint to HDFS every
> minute.
>
> If I run lsof -i -a -p $JMPID, I can get a tons of following output:
>
> java    9433  iot  408u  IPv4 4060901898      0t0  TCP
> jmHost:17922->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  409u  IPv4 4061478455      0t0  TCP
> jmHost:52854->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  410r  IPv4 4063170767      0t0  TCP
> jmHost:49384->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  411w  IPv4 4063188376      0t0  TCP
> jmHost:50516->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  412u  IPv4 4061459881      0t0  TCP
> jmHost:51651->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  413u  IPv4 4063737603      0t0  TCP
> jmHost:31318->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  414w  IPv4 4062030625      0t0  TCP
> jmHost:34033->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  415u  IPv4 4062049134      0t0  TCP
> jmHost:35156->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  416u  IPv4 4062615550      0t0  TCP
> jmHost:16962->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  417r  IPv4 4063757056      0t0  TCP
> jmHost:32553->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  418w  IPv4 4064304789      0t0  TCP
> jmHost:13375->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  419u  IPv4 4062599328      0t0  TCP
> jmHost:15915->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  420w  IPv4 4065462963      0t0  TCP
> jmHost:30432->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  421u  IPv4 4067178257      0t0  TCP
> jmHost:28334->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  422u  IPv4 4066022066      0t0  TCP
> jmHost:11843->datanode:50010 (CLOSE_WAIT)
>
>
>
>
>
> I know restarting the job manager should cleanup those connections, but I
> wonder if there is any better solution?
>
> Btw, I am using flink 1.4.0, and running a standalone cluster.
>
>
>
> Thanks
>
> Youjun
>
>

Re: jobmanager holds too many CLOSE_WAIT connection to datanode

Posted by vino yang <ya...@gmail.com>.

Hi Youjun,

How long has your job been running for a long time?
As far as I know, if in a short time, for checkpoint, jobmanager will not
generate so many connections to HDFS.
What is your Flink cluster environment? Standalone or Flink on YARN?
In addition, does JM's log show any timeout information? Has Checkpoint
timed out?
If you can provide more information, it will help locate the problem.

Thanks, vino.

Yuan,Youjun <yu...@baidu.com> 于2018年8月23日周四 下午10:53写道：

> Hi,
>
>
>
> After running for a while , my job manager holds thousands of CLOSE_WAIT
> TCP connection to HDFS datanode, the number is growing up slowly, and it’s
> likely will hit the max open file limit. My jobs checkpoint to HDFS every
> minute.
>
> If I run lsof -i -a -p $JMPID, I can get a tons of following output:
>
> java    9433  iot  408u  IPv4 4060901898      0t0  TCP
> jmHost:17922->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  409u  IPv4 4061478455      0t0  TCP
> jmHost:52854->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  410r  IPv4 4063170767      0t0  TCP
> jmHost:49384->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  411w  IPv4 4063188376      0t0  TCP
> jmHost:50516->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  412u  IPv4 4061459881      0t0  TCP
> jmHost:51651->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  413u  IPv4 4063737603      0t0  TCP
> jmHost:31318->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  414w  IPv4 4062030625      0t0  TCP
> jmHost:34033->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  415u  IPv4 4062049134      0t0  TCP
> jmHost:35156->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  416u  IPv4 4062615550      0t0  TCP
> jmHost:16962->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  417r  IPv4 4063757056      0t0  TCP
> jmHost:32553->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  418w  IPv4 4064304789      0t0  TCP
> jmHost:13375->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  419u  IPv4 4062599328      0t0  TCP
> jmHost:15915->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  420w  IPv4 4065462963      0t0  TCP
> jmHost:30432->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  421u  IPv4 4067178257      0t0  TCP
> jmHost:28334->datanode:50010 (CLOSE_WAIT)
>
> java    9433  iot  422u  IPv4 4066022066      0t0  TCP
> jmHost:11843->datanode:50010 (CLOSE_WAIT)
>
>
>
>
>
> I know restarting the job manager should cleanup those connections, but I
> wonder if there is any better solution?
>
> Btw, I am using flink 1.4.0, and running a standalone cluster.
>
>
>
> Thanks
>
> Youjun
>