You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/04/16 15:21:00 UTC

[jira] [Comment Edited] (TEZ-3967) DAGImpl: dag lock is unfair and can starve the writers

    [ https://issues.apache.org/jira/browse/TEZ-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084977#comment-17084977 ] 

László Bodor edited comment on TEZ-3967 at 4/16/20, 3:20 PM:
-------------------------------------------------------------

[~jeagles]: attached  [^TEZ-3967.02.patch] , which contains the following basically:
1. DagStatusHandler: which encapsulates the dag status refresh logic
2. I noticed that getCachedCounters were invalid, [it didn't refresh cachedCounters at all|https://github.com/apache/tez/blob/214c544472f80997755b62c382f016fbe8373177/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]...but with this patch I started to use cached counters

the basic idea is to refresh different fields with a different frequency (please let me know which should be configurable in your opinion)
2a) dag status refresh: 500ms - DAG_STATUS_BUILD_INTERVAL_MS
2b) counters refresh: >1000ms - COUNTER_CACHE_INTERVAL_THRESHOLD_FAST_MS

3. DrainDispatcher minor fix (I've found sometimes that it was hanging in the while loop, even when queue was empty), this class is used only in tests

4. TestDAGImpl: added some dag.entityUpdateTracker.stop(), becuase we manually set entityUpdateTracker into DAGImpl, but there is another one, which [already started its thread|https://github.com/apache/tez/blob/214c544472f80997755b62c382f016fbe8373177/tez-dag/src/main/java/org/apache/tez/dag/app/dag/StateChangeNotifier.java#L91] (I haven't indentified error because of this, so just for clarity's sake)


was (Author: abstractdog):
[~jeagles]: attached  [^TEZ-3967.02.patch] , which contains the following basically:
1. DagStatusHandler: which encapsulates the dag status refresh logic
2. I noticed that getCachedCounters were invalid, [it didn't refresh cachedCounters at all|https://github.com/apache/tez/blob/214c544472f80997755b62c382f016fbe8373177/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]...but with this patch I started to use cached counters

the basic idea is to refresh different fields with a different frequency (please let me know which should be configurable in your opinion)
2a) dag status refresh: 500ms - DAG_STATUS_BUILD_INTERVAL_MS
2b) counters refresh: >1000ms - COUNTER_CACHE_INTERVAL_THRESHOLD_FAST_MS

3. DrainDispatcher minor fix (I've found sometimes that it was hanging in the while loop, even when queue was empty), this class is used only in tests

4. TestDAGImpl: added some dag.entityUpdateTracker.stop(), becuase we manually set entityUpdateTracker into DAGImpl, but there is another one, which already started its thread (I haven't indentified error because of this, so just for clarity's sake)

> DAGImpl: dag lock is unfair and can starve the writers
> ------------------------------------------------------
>
>                 Key: TEZ-3967
>                 URL: https://issues.apache.org/jira/browse/TEZ-3967
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Gopal Vijayaraghavan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-3967.01.patch, TEZ-3967.02.patch
>
>
> Found when debugging HIVE-20103, that a reader arriving when another reader is active can postpone a writer from obtaining a write-lock.
> This is fundamentally bad for the DAGImpl as useful progress can only happen when the writeLock is held.
> {code}
>   public void handle(DAGEvent event) {
> ...
>     try {
>       writeLock.lock();
> {code}
> {code}
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007efb02246f40> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>         at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1162)
>         at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:149)
>         at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2251)
>         at org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:2242)
>         at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
>         at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> while read-lock is passed around between 
> {code}
>        at org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:901)
>         at org.apache.tez.dag.app.dag.impl.DAGImpl.getDAGStatus(DAGImpl.java:940)
>         at org.apache.tez.dag.api.client.DAGClientHandler.getDAGStatus(DAGClientHandler.java:73)
> {code}
> calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)