You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:15:27 UTC

[jira] [Resolved] (SPARK-22213) Spark to detect slow executors on nodes with problematic hardware

     [ https://issues.apache.org/jira/browse/SPARK-22213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-22213.
----------------------------------
    Resolution: Incomplete

> Spark to detect slow executors on nodes with problematic hardware
> -----------------------------------------------------------------
>
>                 Key: SPARK-22213
>                 URL: https://issues.apache.org/jira/browse/SPARK-22213
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.0.0
>         Environment: - AWS EMR clusters 
> - window time is 60s
> - several millions of events processed per minute
>            Reporter: Oleksandr Konopko
>            Priority: Major
>              Labels: bulk-closed
>
> Sometimes when new cluster is created it contains 1-2 slow nodes. When average Task finishes in 5 seconds, it takes up to 50 seconds to finish on slow node. As a result, batch processing time increases for 45s
> In order to avoid that we could use `speculation` feature, but it seems that it can be improved
>  
> - 1st issue with `speculation` is that we do not want to use `speculation` on all tasks, since we have tens of thousands of them during processing of one batch. Spawning extra several thousands would not be resource-efficient. I suggest to create new parameter `spark.speculation.mintime`. This would specify minimal task run time for speculation to be enabled for this task
> - 2nd issue is that even if Spark spawns speculative tasks only for long-running ones (longer than 10s for example), task on slow node still will run for some significant time before it is killed. Which still makes batch processing time bigger than it should be. Solution is to enable `blacklisting` for slow nodes. With speculation and blacklisting combined, only first 1-2 batches would take more time when expected. After faulty node is blacklisted batch processing time is as expected



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org