You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2015/05/26 09:45:17 UTC

[jira] [Created] (FLINK-2091) Lock contention during release of network buffer pools

Ufuk Celebi created FLINK-2091:
----------------------------------

             Summary: Lock contention during release of network buffer pools
                 Key: FLINK-2091
                 URL: https://issues.apache.org/jira/browse/FLINK-2091
             Project: Flink
          Issue Type: Improvement
          Components: Distributed Runtime
    Affects Versions: master
            Reporter: Ufuk Celebi
            Assignee: Ufuk Celebi


[~rmetzger] reported the following stack traces during cancelling of high parallelism jobs:

{code}
13:43:46,803 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'DataSource (at main(Job.java:59) (org.apache.flink.api.java.io.TextInputFormat)) (4/16)' did not react to cancelling signal, but is stuck in method:
 org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:238)
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:268)
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:218)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
java.lang.Thread.run(Thread.java:745)
{code}

{code}
13:42:57,595 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'DataSource (at main(Job.java:59) (org.apache.flink.api.java.io.TextInputFormat)) (16/16)' did not react to cancelling signal, but is stuck in method:
 org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:212)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
java.lang.Thread.run(Thread.java:745)
{code}

The issue is that during cancelling of high parallelism jobs the locks for buffer pool management are highly contended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)