You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/03 15:25:34 UTC

[GitHub] [spark] tgravescs commented on issue #26696: [WIP][SPARK-18886][CORE] Only reset scheduling delay timer if allocated slots are fully utilized

tgravescs commented on issue #26696: [WIP][SPARK-18886][CORE] Only reset scheduling delay timer if allocated slots are fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-561217451

Can you please expand on the description? It sounds like you are saying you fill in all available slots no matter what locality and then reset the timer? Which would essentially bypass the locality wait on all tasks in the first "round" of scheduling, which is not what we want.

> , if its task duration is shorter than the locality wait time (3 seconds by default).

It doesn't have to be strictly shorter then wait time as long the harmonics of when tasks finish and get started is shorter then that. I've seen that happen before.

> A simple solution is: we never reset the timer. When a stage has been waiting long enough for locality, this stage should not wait for locality anymore. However, this may hurt performance if the last task is scheduled to a non-preferred location, and a preferred location becomes available right after this task gets scheduled,

Your always going to have a race condition where if you wanted a bit longer you would have got locality, there is no way around that. I don't think just waiting a flat timeout makes sense. For instance, if I have 10,000 tasks you are saying you only wait 3 seconds (by default) and then all of them can be scheduled. If you only have 10 executors (with 1 core), only 10 can run in parallel that doesn't even try to get locality on the 9,990 other tasks.

I've been thinking about options here and the one that seems the most straight forward to me is making it delay scheduling on a "slot". Here a "slot" is a place you can put a task. For example if I have 1 executor with 10 cores (default 1 cpu per task) then I have 10 slots. When a slot becomes available you set a timer to be whatever the currently locality level is for that slot. The reason to use slots and not entire executors is I could schedule 1 task on an executor with 10 slots and the other 9 would be wasted if you set the timer per executor.
When the timer goes off for that slot, you check to see if anything can be scheduled on it at that locality, if not you set the timer again to fall back to the next locality level (node -> rack).
This way you aren't wasting available resources. If you really want your tasks to wait for locality just set the timeout on those higher.

Another option is per task, but I think that involves more tracking logic and will use more memory.

I think the per slot logic would work fine with the job scheduler scenario with the Fair scheduler as well.

The way spark does it now just seems broken to me, even in the job scheduler case. At some point you just want to run, if you don't again you can simply increase the timeout. I'm going to go re-read the jira's to make sure I'm not missing anything though. Note if someone is really worried about getting rid of existing logic we could keep it and have a config, but personally think we shouldn't.

thoughts?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org