You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jiechuan Chen (Jira)" <ji...@apache.org> on 2022/05/25 08:34:00 UTC

[jira] [Created] (TEZ-4416) Dead lock triggered by ShuffleScheduler

Jiechuan Chen created TEZ-4416:
----------------------------------

             Summary: Dead lock triggered by ShuffleScheduler
                 Key: TEZ-4416
                 URL: https://issues.apache.org/jira/browse/TEZ-4416
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.10.1
            Reporter: Jiechuan Chen
         Attachments: container.jstack, screenshot.PNG

How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. An exception was thrown during the execution of one of the reducer, due to the failure of reading intermediate files. The task didn't stop normally, but keep hanging for a long while. Therefore, I printed out the jstack and did some investigation. Here's what I found.

(The jstack file and the screenshot of corresponding jstack segment are attached below.)

 

How this dead lock is triggered:
 # Fail to copy files on hdfs, which will trigger copyFailed() from FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on ShuffleScheduler instance. 
 # Method called from 1 will eventually goes to ShuffleScheduler.close(), in which it tries to kill the Referee's thread by calling referee.interrupt() and referee.join().
 # Meanwhile, Referee is waiting for ShuffleScheduler's lock in its run() method, which is hold by the method called from 1. Hence a dead lock happens.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)