You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "chenlf (JIRA)" <ji...@apache.org> on 2018/10/16 12:17:00 UTC

[jira] [Commented] (FLINK-10564) tm costs too much time when communicating with jm

    [ https://issues.apache.org/jira/browse/FLINK-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651588#comment-16651588 ] 

chenlf commented on FLINK-10564:
--------------------------------

i code a flink-client extended form original client and package it into a jar,then i run this jar in the cluster,here is the log.this a obviouds delay costing about 10s.
18:18:23.365 [main] INFO  c.s.r.a.f.c.MyLeaderRetrievalUtils - [getActorGatewayFuture,278] - futureActorGateway is Future(<not completed>)
18:18:23.366 [main-EventThread] DEBUG o.a.f.r.l.ZooKeeperLeaderRetrievalService - [nodeChanged,105] - Leader node has changed.
18:18:23.366 [main-EventThread] DEBUG o.a.f.r.l.ZooKeeperLeaderRetrievalService - [nodeChanged,132] - New leader information: Leader=akka.tcp://flink@10.247.13.2:45332/user/jobmanager, session ID=77779ebb-3fcd-4413-87f7-e0bb1b16c82e.
18:18:23.620 [flink-akka.actor.default-dispatcher-6] DEBUG a.s.Serialization(akka://flink) - [apply$mcV$sp,77] - Using serializer[akka.serialization.JavaSerializer] for message [akka.actor.Identify]
18:18:23.650 [flink-akka.actor.default-dispatcher-6] DEBUG a.r.EndpointWriter - [apply$mcV$sp,77] - Drained buffer with maxWriteCount: 50, fullBackoffCount: 1, smallBackoffCount: 0, noBackoffCount: 0 , adaptiveBackoff: 1000
18:18:32.462 [flink-akka.actor.default-dispatcher-6] INFO  c.s.r.a.f.c.MyLeaderRetrievalUtils - [apply,230] - ActorGateway apply
18:18:32.464 [flink-akka.actor.default-dispatcher-6] INFO  c.s.r.a.f.c.MyLeaderRetrievalUtils - [onComplete,238] - retrieve with out failure in onCompleteakka.tcp://flink@10.247.13.2:45332/user/jobmanager.
18:18:32.464 [main] INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService - [stop,94] - Stopping ZooKeeperLeaderRetrievalService.
18:18:32.464 [main] DEBUG o.a.f.s.o.a.c.f.i.CuratorFrameworkImpl - [close,271] - Closing
18:18:32.465 [main] DEBUG o.a.f.s.o.a.c.CuratorZookeeperClient - [close,197] - Closing
18:18:32.465 [main] DEBUG o.a.f.s.o.a.c.ConnectionState - [close,108] - Closing
18:18:32.465 [main] DEBUG o.a.z.ZooKeeper - [close,673] - Closing session: 0x163296b56cf84ef
18:18:32.466 [main] DEBUG o.a.z.ClientCnxn - [close,1370] - Closing client for session: 0x163296b56cf84ef
18:18:32.468 [main-SendThread(10.103.66.27:2181)] DEBUG o.a.z.ClientCnxn - [readResponse,843] - Reading reply sessionid:0x163296b56cf84ef, packet:: clientPath:null serverPath:null finished:false header:: 7,-11  replyHeader:: 7,32202255054,0  request:: null response:: null
18:18:32.468 [main] DEBUG o.a.z.ClientCnxn - [disconnect,1354] - Disconnecting client for session: 0x163296b56cf84e

> tm costs too much time when communicating with  jm
> --------------------------------------------------
>
>                 Key: FLINK-10564
>                 URL: https://issues.apache.org/jira/browse/FLINK-10564
>             Project: Flink
>          Issue Type: Bug
>          Components: Core, JobManager, TaskManager
>         Environment: configs are following:
> jm
> high-availability	zookeeper
> taskmanager.heap.mb	16384
> taskmanager.memory.preallocate	false
> taskmanager.numberOfTaskSlots	64
> tm
> slots 128
> free slots 0-128
> cpu core 40 
> Physical Memory 95gb
> free Memory 32gb-50gb
> Flink Managed Memory 22gb-35gb
>            Reporter: chenlf
>            Priority: Major
>         Attachments: timeout.log
>
>
> it works fine until the number of tasks is above about 400.
> There are  600+ tasks(each task handles billion data) running in our cluster now,and the problem is it costs too much time (even time out)when submiting/canceling/querying a task.
> Recouses like memory,cpu are on normal level.
> after debuging,we found this method is the ulprit:
> org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String, UUID)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)