You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2019/05/10 09:33:00 UTC

[jira] [Commented] (FLINK-12426) TM occasionally hang in deploying state

    [ https://issues.apache.org/jira/browse/FLINK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837098#comment-16837098 ] 

Chesnay Schepler commented on FLINK-12426:
------------------------------------------

Could this be a simple case of the BlobServer being overloaded with connections? There are likely no fairness guarantees, so if new requests keep coming in it might be that a TM simple gets starved.

> TM occasionally hang in deploying state
> ---------------------------------------
>
>                 Key: FLINK-12426
>                 URL: https://issues.apache.org/jira/browse/FLINK-12426
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Qi
>            Priority: Major
>
> Hi all,
>   
>  We use Flink batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 
>   
>  It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 
>   
>  These jobs usually have many TM slots (1~2k). We checked the TM log and dumped the TM thread. It indeed hung on socket read to download jar from Blob server. 
>   
>  We're using Flink 1.5 but this may also affect later versions since related code are not changed much. We've tried to add socket timeout in BlobClient, but still no luck.
>   
>  ————————
>  TM log
>  ————————
>  ...
>  INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
> INFO org.apache.flink.runtime.taskmanager.Task - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
> INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
> INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
> INFO org.apache.flink.runtime.blob.BlobClient - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port
> {color:#222222}no more logs...{color}
>   
>  ————————
>  TM thread dump:
>  ————————
>  _"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000]_
>     _java.lang.Thread.State: RUNNABLE_
>          _at java.net.SocketInputStream.socketRead0(Native Method)_
>          _at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)_
>          _at java.net.SocketInputStream.read(SocketInputStream.java:171)_
>          _at java.net.SocketInputStream.read(SocketInputStream.java:141)_
>          _at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)_
>          _at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)_
>          _at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)_
>          _at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)_
>          _at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)_
>          _at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)_
>          _- locked <0x000000078ab60ba8> (a java.lang.Object)_
>          _at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)_
>          _at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)_
>          _at java.lang.Thread.run(Thread.java:748)_
>  _————————_
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)