You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Nicholas Brown (JIRA)" <ji...@apache.org> on 2016/08/05 23:31:20 UTC

[jira] [Created] (SPARK-16929) Bad synchronization with regard to speculation

Nicholas Brown created SPARK-16929:
--------------------------------------

             Summary: Bad synchronization with regard to speculation
                 Key: SPARK-16929
                 URL: https://issues.apache.org/jira/browse/SPARK-16929
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
            Reporter: Nicholas Brown


Our cluster has been running slowly since I got speculation working, I looked into it and noticed that stderr was saying some tasks were taking almost an hour to run even though in the application logs on the nodes that task only took a minute or so to run.  Digging into the thread dump for the master node I noticed a number of threads are blocked, apparently by speculation thread.  At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while it looks through the tasks to see what needs to be rerun.  Unfortunately that code loops through each of the tasks, so when you have even just a couple hundred thousand tasks to run that can be prohibitively slow to run inside of a synchronized block.  Once I disabled speculation, the job went back to having acceptable performance.

There are no comments around that lock indicating why it was added, and the git history seems to have a couple refactorings so its hard to find where it was added.  I'm tempted to believe it is the result of someone assuming that an extra synchronized block never hurt anyone (in reality I've probably just as many bugs caused by over synchronization as too little) as it looks too broad to be actually guarding any potential concurrency issue.  But, since concurrency issues can be tricky to reproduce (and yes, I understand that's an extreme understatement) I'm not sure just blindly removing it without being familiar with the history is necessarily safe.  

Can someone look into this?  Or at least make a note in the documentation that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org