You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rajat kumar <ku...@gmail.com> on 2022/08/13 13:15:35 UTC

Spark with GPU

Hello,

I have been hearing about GPU in spark3.

For batch jobs , will it help to improve GPU performance. Also is GPU
support available only on Databricks or on cloud based Spark clusters ?

I am new , if anyone can share insight , it will help

Thanks
Rajat

Re: Spark with GPU

Posted by Gourav Sengupta <go...@gmail.com>.
One of the best things that could have happened to SPARK (now mostly an
overhyped ETL tool with small incremental optimisation changes and no
large scale innovation) is the release by NVIDIA for GPU processing. You
need some time to get your head around it, but it is supported quite easily
in AWS EMR with a few configuration changes.  You can see massive gains,
given AWS has different varieties of GPU's,

We can end up saving a lot of time and money running a few selected
processes on the GPU. There is an easier fall back options on CPU
obviously.

If you are in AWS, try to use athena, or redshift, or snowflake, they get a
lot more done with less overheads and heart aches. I particularly like how
native integration between ML systems like sagemaker works via redshift
queries, and aurora postgres - that is true unified data analytics at work.


Regards,
Gourav Sengupta


Regards,
Gourav Sengupta


On Sat, Aug 13, 2022 at 6:16 PM Alessandro Bellina <ab...@gmail.com>
wrote:

> This thread may be better suited as a discussion in our Spark plug-in’s
> repo:
> https://github.com/NVIDIA/spark-rapids/discussions.
>
> Just to answer the questions that were asked so far:
>
> I would recommend checking our documentation for what is supported as of
> our latest release (22.06):
> https://nvidia.github.io/spark-rapids/docs/supported_ops.html, as we have
> quite a bit of support for decimal and also nested types and keep adding
> coverage.
>
> For UDFs, if you are willing to rewrite it to use the RAPIDS cuDF API, we
> do have support and examples on how to do this, please check out this:
>
> https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html.
> Automatically translating UDFs to GPUs is not easy. We have a Scala UDF to
> catalyst transpiler that will be able to handle very simple UDFs where
> every operation has a corresponding catalyst expression, that may be worth
> checking out:
>
> https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html.
> This transpiler falls back if it can’t translate any part of the UDF.
>
> The plug-in will not fail in case where it can’t run part of a query on
> the GPU, it will fall back and run on the CPU for the parts of the query
> that are not supported. It will also output what it can’t optimize on the
> driver (on .explain), which should help narrow down an expression or exec
> that should be looked at further.
>
> There are other resources all linked from here:
> https://nvidia.github.io/spark-rapids/ (of interest may be the
> Qualification Tool, and our Getting Started guide for different cloud
> providers and distros).
>
> I’d say let’s continue this in the discussions or as issues in the
> spark-rapids repo if you have further questions or run into issues, as it’s
> not specific to Apache Spark.
>
> Thanks!
>
> Alessandro
>
> On Sat, Aug 13, 2022 at 10:53 AM Sean Owen <sr...@gmail.com> wrote:
>
>> This isn't a Spark question, but rather a question about whatever Spark
>> application you are talking about. RAPIDS?
>>
>> On Sat, Aug 13, 2022 at 10:35 AM rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Thanks Sean.
>>>
>>> Also, I observed that lots of things are not supported in GPU by NVIDIA.
>>> E.g. nested types/decimal type/Udfs etc.
>>> So, will it use CPU automatically for running those tasks which require
>>> nested types or will it run on GPU and fail.
>>>
>>> Thanks
>>> Rajat
>>>
>>> On Sat, Aug 13, 2022, 18:54 Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> Spark does not use GPUs itself, but tasks you run on Spark can.
>>>> The only 'support' there is is for requesting GPUs as resources for
>>>> tasks, so it's just a question of resource management. That's in OSS.
>>>>
>>>> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar <ku...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have been hearing about GPU in spark3.
>>>>>
>>>>> For batch jobs , will it help to improve GPU performance. Also is GPU
>>>>> support available only on Databricks or on cloud based Spark clusters ?
>>>>>
>>>>> I am new , if anyone can share insight , it will help
>>>>>
>>>>> Thanks
>>>>> Rajat
>>>>>
>>>>

Re: Spark with GPU

Posted by Alessandro Bellina <ab...@gmail.com>.
This thread may be better suited as a discussion in our Spark plug-in’s
repo:
https://github.com/NVIDIA/spark-rapids/discussions.

Just to answer the questions that were asked so far:

I would recommend checking our documentation for what is supported as of
our latest release (22.06):
https://nvidia.github.io/spark-rapids/docs/supported_ops.html, as we have
quite a bit of support for decimal and also nested types and keep adding
coverage.

For UDFs, if you are willing to rewrite it to use the RAPIDS cuDF API, we
do have support and examples on how to do this, please check out this:
https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html.
Automatically translating UDFs to GPUs is not easy. We have a Scala UDF to
catalyst transpiler that will be able to handle very simple UDFs where
every operation has a corresponding catalyst expression, that may be worth
checking out:
https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html.
This transpiler falls back if it can’t translate any part of the UDF.

The plug-in will not fail in case where it can’t run part of a query on the
GPU, it will fall back and run on the CPU for the parts of the query that
are not supported. It will also output what it can’t optimize on the driver
(on .explain), which should help narrow down an expression or exec that
should be looked at further.

There are other resources all linked from here:
https://nvidia.github.io/spark-rapids/ (of interest may be the
Qualification Tool, and our Getting Started guide for different cloud
providers and distros).

I’d say let’s continue this in the discussions or as issues in the
spark-rapids repo if you have further questions or run into issues, as it’s
not specific to Apache Spark.

Thanks!

Alessandro

On Sat, Aug 13, 2022 at 10:53 AM Sean Owen <sr...@gmail.com> wrote:

> This isn't a Spark question, but rather a question about whatever Spark
> application you are talking about. RAPIDS?
>
> On Sat, Aug 13, 2022 at 10:35 AM rajat kumar <ku...@gmail.com>
> wrote:
>
>> Thanks Sean.
>>
>> Also, I observed that lots of things are not supported in GPU by NVIDIA.
>> E.g. nested types/decimal type/Udfs etc.
>> So, will it use CPU automatically for running those tasks which require
>> nested types or will it run on GPU and fail.
>>
>> Thanks
>> Rajat
>>
>> On Sat, Aug 13, 2022, 18:54 Sean Owen <sr...@gmail.com> wrote:
>>
>>> Spark does not use GPUs itself, but tasks you run on Spark can.
>>> The only 'support' there is is for requesting GPUs as resources for
>>> tasks, so it's just a question of resource management. That's in OSS.
>>>
>>> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have been hearing about GPU in spark3.
>>>>
>>>> For batch jobs , will it help to improve GPU performance. Also is GPU
>>>> support available only on Databricks or on cloud based Spark clusters ?
>>>>
>>>> I am new , if anyone can share insight , it will help
>>>>
>>>> Thanks
>>>> Rajat
>>>>
>>>

Re: Spark with GPU

Posted by Sean Owen <sr...@gmail.com>.
This isn't a Spark question, but rather a question about whatever Spark
application you are talking about. RAPIDS?

On Sat, Aug 13, 2022 at 10:35 AM rajat kumar <ku...@gmail.com>
wrote:

> Thanks Sean.
>
> Also, I observed that lots of things are not supported in GPU by NVIDIA.
> E.g. nested types/decimal type/Udfs etc.
> So, will it use CPU automatically for running those tasks which require
> nested types or will it run on GPU and fail.
>
> Thanks
> Rajat
>
> On Sat, Aug 13, 2022, 18:54 Sean Owen <sr...@gmail.com> wrote:
>
>> Spark does not use GPUs itself, but tasks you run on Spark can.
>> The only 'support' there is is for requesting GPUs as resources for
>> tasks, so it's just a question of resource management. That's in OSS.
>>
>> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I have been hearing about GPU in spark3.
>>>
>>> For batch jobs , will it help to improve GPU performance. Also is GPU
>>> support available only on Databricks or on cloud based Spark clusters ?
>>>
>>> I am new , if anyone can share insight , it will help
>>>
>>> Thanks
>>> Rajat
>>>
>>

Re: Spark with GPU

Posted by rajat kumar <ku...@gmail.com>.
Thanks Sean.

Also, I observed that lots of things are not supported in GPU by NVIDIA.
E.g. nested types/decimal type/Udfs etc.
So, will it use CPU automatically for running those tasks which require
nested types or will it run on GPU and fail.

Thanks
Rajat

On Sat, Aug 13, 2022, 18:54 Sean Owen <sr...@gmail.com> wrote:

> Spark does not use GPUs itself, but tasks you run on Spark can.
> The only 'support' there is is for requesting GPUs as resources for tasks,
> so it's just a question of resource management. That's in OSS.
>
> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar <ku...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have been hearing about GPU in spark3.
>>
>> For batch jobs , will it help to improve GPU performance. Also is GPU
>> support available only on Databricks or on cloud based Spark clusters ?
>>
>> I am new , if anyone can share insight , it will help
>>
>> Thanks
>> Rajat
>>
>

Re: Spark with GPU

Posted by Sean Owen <sr...@gmail.com>.
Spark does not use GPUs itself, but tasks you run on Spark can.
The only 'support' there is is for requesting GPUs as resources for tasks,
so it's just a question of resource management. That's in OSS.

On Sat, Aug 13, 2022 at 8:16 AM rajat kumar <ku...@gmail.com>
wrote:

> Hello,
>
> I have been hearing about GPU in spark3.
>
> For batch jobs , will it help to improve GPU performance. Also is GPU
> support available only on Databricks or on cloud based Spark clusters ?
>
> I am new , if anyone can share insight , it will help
>
> Thanks
> Rajat
>