You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by qi luo <lu...@gmail.com> on 2019/05/07 09:09:06 UTC

Re: TM occasionally hang in deploying state in Flink 1.5

Thanks Dawid, I’ve created an issue for this https://jira.apache.org/jira/browse/FLINK-12426 <https://jira.apache.org/jira/browse/FLINK-12426>. Though we’re using 1.5 but this may affect later versions.

I’m still investigating the root case but no result yet. This happens occasionally and isn't easy to reproduce.

> On Apr 25, 2019, at 6:40 PM, Dawid Wysakowicz <dw...@apache.org> wrote:
> 
> Hi,
> 
> Feel free to open a JIRA for this issue. By the way have you investigated what is the root cause for it hanging?
> 
> Best,
> 
> Dawid
> 
> On 25/04/2019 08:55, qi luo wrote:
>> Hello,
>> 
>> This issue occurred again and we dumped the TM thread. It indeed hung on socket read to download jar from Blob server:
>> 
>> "DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000]
>>    java.lang.Thread.State: RUNNABLE
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>>         at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
>>         at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
>>         at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
>>         at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>>         at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
>>         at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>>         - locked <0x000000078ab60ba8> (a java.lang.Object)
>>         at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
>>         at java.lang.Thread.run(Thread.java:748)
>> 
>> I checked the latest master code. There’s still no socket timeout in Blob client. Should I create an issue to add this timeout?
>> 
>> Regards,
>> Qi 
>> 
>>> On Apr 19, 2019, at 7:49 PM, qi luo <luoqi.bd@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi all,
>>> 
>>> We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 
>>> 
>>> On checking TM log, it shows that it stuck in downloading jars in BlobClient:
>>> 
>>> ————
>>> ...
>>> INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
>>> INFO  org.apache.flink.runtime.blob.BlobClient                          - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port
>>> 
>>> no more logs...
>>> ————
>>> 
>>> It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 
>>> 
>>> Should we add timeout in socket operations in BlobClient to resolve this issue?
>>> 
>>> Thanks,
>>> Qi
>>