You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Al M (JIRA)" <ji...@apache.org> on 2018/07/04 16:18:00 UTC
[jira] [Comment Edited] (SPARK-24474) Cores are left idle when
there are a lot of tasks to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532910#comment-16532910 ]
Al M edited comment on SPARK-24474 at 7/4/18 4:17 PM:
------------------------------------------------------
My initial tests suggest that this stops the issue from happening. Thanks! I will perform more tests to make 100% sure that it does not still occur.
I am surprised that this config makes a difference. My tasks are usually quite big; normally taking about a minute each. I would not have expected a change from waiting 3s per task to 0s per task to make such a huge difference.
Do you know if there is any unexpected logic around this config setting?
was (Author: alrocks46):
My initial tests suggest that this stops the issue from happening. Thanks! I will perform more tests to make 100% sure that it does not still occur.
I am surprised that this config makes a difference. My tasks are usually quite big; normally taking about a minute each. I would not have expected a change from waiting 3s per task to 0s per task to make such a huge difference.
Do you know if there is any unusual behaviour around this config setting?
> Cores are left idle when there are a lot of tasks to run
> --------------------------------------------------------
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.2.0
> Reporter: Al M
> Priority: Major
>
> I've observed an issue happening consistently when:
> * A job contains a join of two datasets
> * One dataset is much larger than the other
> * Both datasets require some processing before they are joined
> What I have observed is:
> * 2 stages are initially active to run processing on the two datasets
> ** These stages are run in parallel
> ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks)
> ** Spark allocates a similar (though not exactly equal) number of cores to each stage
> * First stage completes (for the smaller dataset)
> ** Now there is only one stage running
> ** It still has many tasks left (usually > 20k tasks)
> ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
> ** This continues until the second stage completes
> * Second stage completes, and third begins (the stage that actually joins the data)
> ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200)
> Other interesting things about this:
> * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages
> * Once all active stages are done, we release all cores to new stages
> * I can't reproduce this locally on my machine, only on a cluster with YARN enabled
> * It happens when dynamic allocation is enabled, and when it is disabled
> * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes
> * This happens with spark.shuffle.service.enabled set to true and false
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org