You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Alessandro Bellina <ab...@gmail.com> on 2023/12/04 14:39:32 UTC
Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

Hello devs,

We are going to be tabling the SPIP proposal given that we don't see
responses in the discussion thread. We still believe that making custom
ShuffleManagers easier to configure is worthwhile, given interactions with
our users, but we can revisit this later. If anyone in the list has any
additional comments please feel free to share.

Thank you

Alessandro


On Sun, Nov 5, 2023 at 8:11 AM Alessandro Bellina <ab...@gmail.com>
wrote:

> Thanks for the comments Reynold. This is an ease of use change, and it is
> not absolutely required (as other ease of use changes are not required
> either). That said, do we not want to invest in making Spark easier to
> configure for the average user, or even the user that is trying out Spark?
>
> Here are my thoughts:
>
> - Why can we use short names for SortShuffleManager ("sort"), but the same
> can't be extended? If spark.shuffle.manager is meant to be a pluggable API,
> it seems this mapping should be pluggable as well.
>
> - Plugin developers (like my project) would like to produce a simple
> plugin jar that can be used for all versions of Spark we support, but
> ShuffleManager APIs can change in non-binary compatible ways (it's a
> private API). As a result we document setting spark.shuffle.manager to a
> fully qualified class that is built for each version of Spark we bundle,
> guaranteeing a binary-compatible implementation. Having the ability to
> produce a short name for a fully qualified shuffle manager would remove
> having to look up this mapping.
>
> - ShuffleManager is very flexible (for good reasons) and it can be used to
> move shuffle in several ways, such as RDMA, caching, external stores, etc.
> With this flexibility comes working with other open source projects (such
> as UCX) that have their own configuration system. In this specific example,
> environment variables are needed to setup UCX for use from the JVM and with
> defaults that are particular to our shuffle usage. These configurations, as
> of today, need to be looked up by the user and applied to their
> application, and having a way to setup defaults would greatly improve the
> user experience.
>
> Thanks again for your feedback!
>
> Alessandro
>
> On Sat, Nov 4, 2023 at 6:04 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Why do we need this? The reason data source APIs need it is because it
>> will be used by very unsophisticated end users and used all the time (for
>> each connection / query). Shuffle is something you set up once, presumably
>> by fairly sophisticated admins / engineers.
>>
>>
>>
>> On Sat, Nov 04, 2023 at 2:42 PM, Alessandro Bellina <ab...@gmail.com>
>> wrote:
>>
>>> Hello devs,
>>>
>>> I would like to start discussion on the SPIP "ShuffleManager short name
>>> registration via SparkPlugin"
>>>
>>> The idea behind this change is to allow a driver plugin (spark.plugins)
>>> to export ShuffleManagers via short names, along with sensible default
>>> configurations. Users can then use this short name to enable this
>>> ShuffleManager + configs using spark.shuffle.manager.
>>>
>>> SPIP:
>>> https://docs.google.com/document/d/1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/edit#heading=h.vqpecs4nrsto
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-45792
>>>
>>> I look forward to hearing your feedback.
>>>
>>> Thanks
>>>
>>> Alessandro
>>>
>>
>>