You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2021/08/01 16:53:38 UTC

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

Hi Andreas,

just to understand the question first, what is it you want to achieve by
breaking the map operations across the GPU and CPU?

Also it will be wonderful to understand the version of SPARK you are using,
and your GPU details a bit more.


Regards,
Gourav

On Sat, Jul 31, 2021 at 9:57 AM Andreas Kunft <an...@gmail.com>
wrote:

> I have a setup with two work intensive tasks, one map using GPU followed
> by a map using only CPU.
>
> Using stage level resource scheduling, I request a GPU node, but would
> also like to execute the consecutive CPU map on a different executor so
> that the GPU node is not blocked.
>
> However, spark will always combine the two maps due to the narrow
> dependency, and thus, I can not define two different resource requirements.
>
> So the question is: can I force the two map functions on different
> executors without shuffling or even better is there a plan to enable this
> by assigning different resource requirements.
>
> Best
>

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

Posted by Gourav Sengupta <go...@gmail.com>.
Hi Andreas,
I know that NVIDIA team is a wonderful team to reach out to, they respond
quite quickly and help you along the way.

I am not quite sure about SPARK community leaders will be willing to allow
the overall SPARK community to build native integrations with Deep Learning
systems. ray.io being headed by Berkely labs will gain a wider adoption if
they stop the SPARK community from building the native integrations and
thus we end up using Ray for such integrations thus massively increasing
the adoption of Ray.

But I think that SPARK should be building the native integrations with Deep
Learning libraries and not forced to depend on horovod, Ray, or other
solutions. NVIDIA is trying to lead its efforts on this integration I
think, and we should all congratulate them :)

Regards,
Gourav


On Sun, Aug 1, 2021 at 6:28 PM Andreas Kunft <an...@gmail.com>
wrote:

> Hi,
>
> @Sean: Since Spark 3.x, stage level resource scheduling is available:
> https://databricks.com/session_na21/stage-level-scheduling-improving-big-data-and-ai-integration
>
> @Gourav: I'm using the latest version of Spark 3.1.2. I want to split the
> two maps on different executors, as both the GPU function and the CPU
> function take quite some time,
> so it would be great to have element n being processed in the GPU function
> while n + 1 is already computed in the CPU function. As a workaround, I
> write the results of the
> CPU task to a queue which is consumed by another job that executes the CPU
> task.
>
> Do you have any idea, if resource assignment based scheduling for
> functions is a planned feature for the future?
>
> Best
> Andreas
>
>
> On Sun, Aug 1, 2021 at 6:53 PM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi Andreas,
>>
>> just to understand the question first, what is it you want to achieve by
>> breaking the map operations across the GPU and CPU?
>>
>> Also it will be wonderful to understand the version of SPARK you are
>> using, and your GPU details a bit more.
>>
>>
>> Regards,
>> Gourav
>>
>> On Sat, Jul 31, 2021 at 9:57 AM Andreas Kunft <an...@gmail.com>
>> wrote:
>>
>>> I have a setup with two work intensive tasks, one map using GPU followed
>>> by a map using only CPU.
>>>
>>> Using stage level resource scheduling, I request a GPU node, but would
>>> also like to execute the consecutive CPU map on a different executor so
>>> that the GPU node is not blocked.
>>>
>>> However, spark will always combine the two maps due to the narrow
>>> dependency, and thus, I can not define two different resource requirements.
>>>
>>> So the question is: can I force the two map functions on different
>>> executors without shuffling or even better is there a plan to enable this
>>> by assigning different resource requirements.
>>>
>>> Best
>>>
>>

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

Posted by Sean Owen <sr...@gmail.com>.
Oh I see, I missed that. You can specify at the stage level, nice. I think
you are more looking to break these operations into two stages. You can do
that with a persist or something - which has a cost but may work fine.

Does it actually help much with GPU utilization - in theory yes but
wondering if these two stages being so bound typically execute at
meaningfully different times. Your use case also seems to entail moving the
work across executors which would have overhead.

Stage is pretty much the lowest level of granularity so no. spark does not
schedule functions, it plans stages. I think this is a question of
splitting things that can be in a stage (usually a very good thing) apart
in this rarer case, not any change in Spark.

Yes DL workloads are important but distributed DL on Spark is already well
handled by third party libs. I'm not sure this is about DL specifically
anyway. Not everything should be in Spark itself.



On Sun, Aug 1, 2021, 11:28 AM Andreas Kunft <an...@gmail.com> wrote:

> Hi,
>
> @Sean: Since Spark 3.x, stage level resource scheduling is available:
> https://databricks.com/session_na21/stage-level-scheduling-improving-big-data-and-ai-integration
>
> @Gourav: I'm using the latest version of Spark 3.1.2. I want to split the
> two maps on different executors, as both the GPU function and the CPU
> function take quite some time,
> so it would be great to have element n being processed in the GPU function
> while n + 1 is already computed in the CPU function. As a workaround, I
> write the results of the
> CPU task to a queue which is consumed by another job that executes the CPU
> task.
>
> Do you have any idea, if resource assignment based scheduling for
> functions is a planned feature for the future?
>
> Best
> Andreas
>
>
> On Sun, Aug 1, 2021 at 6:53 PM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi Andreas,
>>
>> just to understand the question first, what is it you want to achieve by
>> breaking the map operations across the GPU and CPU?
>>
>> Also it will be wonderful to understand the version of SPARK you are
>> using, and your GPU details a bit more.
>>
>>
>> Regards,
>> Gourav
>>
>> On Sat, Jul 31, 2021 at 9:57 AM Andreas Kunft <an...@gmail.com>
>> wrote:
>>
>>> I have a setup with two work intensive tasks, one map using GPU followed
>>> by a map using only CPU.
>>>
>>> Using stage level resource scheduling, I request a GPU node, but would
>>> also like to execute the consecutive CPU map on a different executor so
>>> that the GPU node is not blocked.
>>>
>>> However, spark will always combine the two maps due to the narrow
>>> dependency, and thus, I can not define two different resource requirements.
>>>
>>> So the question is: can I force the two map functions on different
>>> executors without shuffling or even better is there a plan to enable this
>>> by assigning different resource requirements.
>>>
>>> Best
>>>
>>

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

Posted by Andreas Kunft <an...@gmail.com>.
Hi,

@Sean: Since Spark 3.x, stage level resource scheduling is available:
https://databricks.com/session_na21/stage-level-scheduling-improving-big-data-and-ai-integration

@Gourav: I'm using the latest version of Spark 3.1.2. I want to split the
two maps on different executors, as both the GPU function and the CPU
function take quite some time,
so it would be great to have element n being processed in the GPU function
while n + 1 is already computed in the CPU function. As a workaround, I
write the results of the
CPU task to a queue which is consumed by another job that executes the CPU
task.

Do you have any idea, if resource assignment based scheduling for functions
is a planned feature for the future?

Best
Andreas


On Sun, Aug 1, 2021 at 6:53 PM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Andreas,
>
> just to understand the question first, what is it you want to achieve by
> breaking the map operations across the GPU and CPU?
>
> Also it will be wonderful to understand the version of SPARK you are
> using, and your GPU details a bit more.
>
>
> Regards,
> Gourav
>
> On Sat, Jul 31, 2021 at 9:57 AM Andreas Kunft <an...@gmail.com>
> wrote:
>
>> I have a setup with two work intensive tasks, one map using GPU followed
>> by a map using only CPU.
>>
>> Using stage level resource scheduling, I request a GPU node, but would
>> also like to execute the consecutive CPU map on a different executor so
>> that the GPU node is not blocked.
>>
>> However, spark will always combine the two maps due to the narrow
>> dependency, and thus, I can not define two different resource requirements.
>>
>> So the question is: can I force the two map functions on different
>> executors without shuffling or even better is there a plan to enable this
>> by assigning different resource requirements.
>>
>> Best
>>
>