You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yuval Degani <yu...@gmail.com> on 2017/10/10 01:40:28 UTC

[SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine

Dear Spark community,

I would like to call for the review of SPARK-22229: "RDMA Accelerated
Shuffle Engine".

The request purpose is to embed an RDMA accelerated Shuffle Manager into
mainstream Spark.



Such implementation is available as an external plugin as part of the
"SparkRDMA" project: <a href="https://github.com/Mellanox/SparkRDMA">
https://github.com/Mellanox/SparkRDMA</a>.

SparkRDMA already demonstrated enormous potential for accelerating shuffles
seamlessly in both benchmarks and actual production environments.



Adding RDMA capabilities to Spark will be one more important step in
enabling lower-level acceleration as conveyed by the "Tungsten" project.



SparkRDMA will be presented at Spark Summit 2017 in Dublin (
https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/
).



JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22229

PDF version:
https://issues.apache.org/jira/secure/attachment/12891122/SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf



*Overview*

An RDMA-accelerated shuffle engine can provide enormous performance
benefits to shuffle-intensive Spark jobs, as demonstrated in the
“SparkRDMA” plugin open-source project (
https://github.com/Mellanox/SparkRDMA).

Using RDMA for shuffle improves CPU utilization significantly and reduces
I/O processing overhead by bypassing the kernel and networking stack as
well as avoiding memory copies entirely. Those valuable CPU cycles are then
consumed directly by the actual Spark workloads, and help reducing the job
runtime significantly.

This performance gain is demonstrated with both industry standard HiBench
TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive
customer applications.

SparkRDMA will be presented at Spark Summit 2017 in Dublin (
https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/
).



*Background and Motivation*

Spark's current Shuffle engine implementation over “Netty” faces many
performance issues often seen in other socket-based applications.

Using standard TCP/IP communication socket-based model for heavy data
transfers usually requires copying the data multiple times and going
through many system calls in the I/O-path. These consume significant
amounts of CPU cycles and memory that could have been otherwise assigned to
the actual job at hand. This becomes even more critical with
latency-sensitive Spark streaming and Deep Learning applications over
SparkML.

RDMA (Remote Direct Memory Access) is already a commodity technology that
is supported on most mid-range to high-end Network Adapter cards,
manufactured by various companies. Furthermore, RDMA-capable networks are
already offered on public clouds such as Microsoft Azure, and will probably
be supported in AWS soon to appeal to MPI users. Existing users of Spark on
Microsoft Azure servers can get the benefits of RDMA by running on a
suitable instance with this plugin, without needing any application changes.

RDMA provides a unique approach for accessing memory locations over the
network, without the need for copying on the transmitter side nor on the
receiver side.

These remote memory read and write operations are enabled by a standard
interface which is part of mainstream Linux releases for many years now.
This standardized interface allows direct access to remote memory from
user-space, while skipping costly system calls in the I/O-path. Due to its
many virtues, RDMA has found its way to be a standard data transfer
protocol in HPC (High Performance Computing) applications, with MPI being
the most prominent.

RDMA has been traditionally associated with InfiniBand networks, but with
the standardization of RDMA over Converged Ethernet (RoCE), RDMA is
supported and widely used on Ethernet networks for many years now.

Since Spark is all about performing everything in-memory, RDMA seems like a
perfect fit for filling-in the gap of transferring intermediate in-memory
data between the participating nodes. SparkRDMA (
https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle
performance can dramatically improve with the use of RDMA. Today, it is
gaining significant traction with many users, and has successfully
demonstrated major performance improvement on production applications in
high-profile technology companies. SparkRDMA is a generic and easy-to-use
plugin. However, in order to gain wide adoption with its effort-less
acceleration, it must be integrated into Apache Spark itself.

The purpose of this SPIP, is to introduce RDMA into mainstream Spark for
improving Shuffle performance, and to pave the way for further
accelerations such as GPUDirect (acceleration with NVidia GPUs over CUDA),
NVMeoF (NVMe over Fabrics) and more.

SparkRDMA will be presented at Spark Summit 2017 in Dublin (
https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/
)



*Target Personas*

Any Spark user who cares about performance



*Goals*

Use RDMA to improve Shuffle data transfer performance and reduce total job
runtime.

Automatically activate RDMA accelerated shuffles where supported.



*Non-Goals*

This SPIP limits the usage of RDMA for Shuffle data transfers only. This
SPIP is the first step in introducing RDMA to Spark and open a range of
possibilities for future improvements. In the future, SPIPs can address
other network consumers that can benefit from RDMA such as, but not limited
to: Broadcast, RDD transfers, RPC messaging, storage access, GPU access and
HDFS-RDMA interface. .



*API Changes*

There will be no API changes required



*Proposed Design*

SparkRDMA currently utilizes the ShuffleManager interface (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala)
to allow RDMA accelerated Shuffles. This interface is sufficient to allow
network savvy users to take advantage of RDMA capabilities in the context
of Spark. However, to make this technology more easily accessible, we
propose to add the code to mainstream Spark, and implement a method to
automatically use the RdmaShuffleManager when RDMA is supported on the
system. This way, any Spark user that already has the hardware support for
RDMA, can seamlessly enjoy its performance benefits.

Furthermore, SparkRDMA in its current plugin form, is limited by several
constraints that can be removed once introduced into mainstream Spark.
Among those are:

·         SparkRDMA manages its own memory, off-heap. When integrated into
Spark, it can use Tungsten physical memory for all of its needs, allowing
for faster allocations and memory registrations that can increase
performance significantly. Also, any data that resides in Tungsten memory
can be transferred with almost no overhead.

·         MapStatuses are redundant – no need for those extra transfers
that take precious seconds in many jobs



*Rejected Designs*

Support RDMA with the SparkRDMA plugin:

·         The SparkRDMA plugin approach introduces limitations and overhead
that reduce performance

·         Plugins are awkward to build, install and deploy and that is why
they are usually avoided

·         Forward support is difficult to maintain for plugins that are not
part of the upstream project, specifically for Spark which is a rapidly
changing project

To ensure maximum performance and to allow mass adoption of this general
solution, RDMA capabilities must be introduced into Spark itself.