You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Slavitch <sl...@gmail.com> on 2016/04/01 20:32:05 UTC

Eliminating shuffle write and spill disk IO reads/writes in Spark

Hello;

I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle.  Is there a way to force spark to stay in memory when doing shuffle operations?   The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never touch the IO subsystem.  I am willing to have the job fail if it runs out of RAM.

spark.shuffle.spill true  is deprecated in 1.6 and does not work in Tungsten sort in 1.5.x

"WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this is ignored by the tungsten-sort shuffle manager; its optimized shuffles will continue to spill to disk when necessary.”

If this is impossible via configuration changes what code changes would be needed to accomplish this?





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

If you work for a certain hardware vendor that builds expensive, high
performance nodes, and want to use Spark to demonstrate the performance
gains of your new great systems, you will of course totally disagree.

Anyway - I offered you a simple solution to work around the low hanging
fruits. Feel free to totally disagree and reject that. Yes you might see
problems with kernel being unable to manage the buffer pool as well as
Spark itself can, but you also might not because most of the software stack
(not just Spark, but software in general) are in general inefficient and
far away from what the hardware can do at its limit, so having minor, or
sometimes even major imperfections somewhere in the stack isn't necessary a
problem.

For example, you will find that the network software stack in Spark (or
majority of the open source projects you will find) actually won't be able
to saturate a 10G network in practical jobs, let alone 40G. Decryption,
deserialization, or data processing themselves can be expensive, to the
point that it doesn't really matter how high your disk throughput or
network throughput is.

While I think 40G network is coming, they are far away from ubiquity. Does
that mean we shouldn't care? No. But it takes time and resources to address
them, and in most cases they are not actually the bottleneck. It is not as
simple as putting the data in memory, because we'd need to build a bunch of
machinery to share that limited memory with the execution part, which have
been by far the largest bottleneck.

So what does it take to improve this?

First and foremost, we would need to substantially speed up the execution
engine itself. We are making great progress in Spark 2.0. For a lot of the
common SQL-like operations, Spark 2.0 can be pretty fast (e.g. filtering 1
billion records a second, or joining 100 million records a second; using a
single core).

However, I still don't think it matters much disk vs memory for temporary
shuffle files in a moderately sized cluster with SSDs, until we rewrite the
network stack to be able to sustainably saturate 10G links. Spark was able
to do that two years ago when I first implemented the current network
module, but I'm sure after two years of feature development, bug fixes, and
security improvement, the network module can no longer do that. Why haven't
we fixed it yet? Because most workloads don't shuffle enormous amount of
data and when they do, they are not bounded by slower network stack (either
software or hardware).

One environment this issue would matter a lot is on clusters with a small
number of nodes. In the most extreme case with only a single node, the
current way we do shuffle in Spark is one to two orders of magnitude slower
than some simple in-memory data partitioning algorithm (e.g. radix sort).
Addressing that can speed up certain Spark workloads (large joins, large
aggregations) quite a bit.

On Fri, Apr 1, 2016 at 2:22 PM, Reynold Xin <rx...@databricks.com> wrote:

> Sure - feel free to totally disagree.
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> I totally disagree that it’s not a problem.
>>
>> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
>> drives.
>> - What Spark is depending on is Linux’s IO cache as an effective buffer
>> pool  This is fine for small jobs but not for jobs with datasets in the
>> TB/node range.
>> - On larger jobs flushing the cache causes Linux to block.
>> - On a modern 56-hyperthread 2-socket host the latency caused by multiple
>> executors writing out to disk increases greatly.
>>
>> I thought the whole point of Spark was in-memory computing?  It’s in fact
>> in-memory for some things but  use spark.local.dir as a buffer pool of
>> others.
>>
>> *Hence, the performance of  Spark is gated by the performance of
>> spark.local.dir, even on large memory systems.*
>>
>> "Currently it is not possible to not write shuffle files to disk.”
>>
>> What changes >would< make it possible?
>>
>> The only one that seems possible is to clone the shuffle service and make
>> it in-memory.
>>
>>
>>
>>
>>
>> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> spark.shuffle.spill actually has nothing to do with whether we write
>> shuffle files to disk. Currently it is not possible to not write shuffle
>> files to disk, and typically it is not a problem because the network fetch
>> throughput is lower than what disks can sustain. In most cases, especially
>> with SSDs, there is little difference between putting all of those in
>> memory and on disk.
>>
>> However, it is becoming more common to run Spark on a few number of beefy
>> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>> improving performance for those. Meantime, you can setup local ramdisks on
>> each node for shuffle writes.
>>
>>
>>
>> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>>
>>> Hello;
>>>
>>> I’m working on spark with very large memory systems (2TB+) and notice
>>> that Spark spills to disk in shuffle.  Is there a way to force spark to
>>> stay in memory when doing shuffle operations?   The goal is to keep the
>>> shuffle data either in the heap or in off-heap memory (in 1.6.x) and never
>>> touch the IO subsystem.  I am willing to have the job fail if it runs out
>>> of RAM.
>>>
>>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>> Tungsten sort in 1.5.x
>>>
>>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>>> this is ignored by the tungsten-sort shuffle manager; its optimized
>>> shuffles will continue to spill to disk when necessary.”
>>>
>>> If this is impossible via configuration changes what code changes would
>>> be needed to accomplish this?
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

Sure - feel free to totally disagree.


On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com> wrote:

> I totally disagree that it’s not a problem.
>
> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> drives.
> - What Spark is depending on is Linux’s IO cache as an effective buffer
> pool  This is fine for small jobs but not for jobs with datasets in the
> TB/node range.
> - On larger jobs flushing the cache causes Linux to block.
> - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> executors writing out to disk increases greatly.
>
> I thought the whole point of Spark was in-memory computing?  It’s in fact
> in-memory for some things but  use spark.local.dir as a buffer pool of
> others.
>
> *Hence, the performance of  Spark is gated by the performance of
> spark.local.dir, even on large memory systems.*
>
> "Currently it is not possible to not write shuffle files to disk.”
>
> What changes >would< make it possible?
>
> The only one that seems possible is to clone the shuffle service and make
> it in-memory.
>
>
>
>
>
> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> spark.shuffle.spill actually has nothing to do with whether we write
> shuffle files to disk. Currently it is not possible to not write shuffle
> files to disk, and typically it is not a problem because the network fetch
> throughput is lower than what disks can sustain. In most cases, especially
> with SSDs, there is little difference between putting all of those in
> memory and on disk.
>
> However, it is becoming more common to run Spark on a few number of beefy
> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving performance for those. Meantime, you can setup local ramdisks on
> each node for shuffle writes.
>
>
>
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> Hello;
>>
>> I’m working on spark with very large memory systems (2TB+) and notice
>> that Spark spills to disk in shuffle.  Is there a way to force spark to
>> stay in memory when doing shuffle operations?   The goal is to keep the
>> shuffle data either in the heap or in off-heap memory (in 1.6.x) and never
>> touch the IO subsystem.  I am willing to have the job fail if it runs out
>> of RAM.
>>
>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>> Tungsten sort in 1.5.x
>>
>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>> this is ignored by the tungsten-sort shuffle manager; its optimized
>> shuffles will continue to spill to disk when necessary.”
>>
>> If this is impossible via configuration changes what code changes would
>> be needed to accomplish this?
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Saisai Shao <sa...@gmail.com>.

So I think ramdisk is simple way to try.

Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency (serialization-deserialization, data processing...).
Also code design should well consider such usage scenario, to use resource
more efficiently.

Thanks
Saisai

On Sat, Apr 2, 2016 at 7:27 AM, Michael Slavitch <sl...@gmail.com> wrote:

> Yes we see it on final write.  Our preference is to eliminate this.
>
>
> On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sa...@gmail.com> wrote:
>
>> Hi Michael, shuffle data (mapper output) have to be materialized into
>> disk finally, no matter how large memory you have, it is the design purpose
>> of Spark. In you scenario, since you have a big memory, shuffle spill
>> should not happen frequently, most of the disk IO you see might be final
>> shuffle file write.
>>
>> So if you want to avoid this disk IO, you could use ramdisk as Reynold
>> suggested. If you want to avoid FS overhead of ramdisk, you could try to
>> hack a new shuffle implementation, since shuffle framework is pluggable.
>>
>>
>> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>>
>>> As I mentioned earlier this flag is now ignored.
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>>
>>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>>> eliminate shuffling causing io?
>>>>
>>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>>> memory when data doesn't fit in memory.
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> RAMdisk is a fine interim step but there is a lot of layers
>>>>>> eliminated by keeping things in memory unless there is need for spillover.
>>>>>>   At one time there was support for turning off spilling.  That was
>>>>>> eliminated.  Why?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>>> test if these are the bottlenecks or something else is.
>>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>>
>>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>>> any, but I am not sure if I missed something).
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I totally disagree that it’s not a problem.
>>>>>>> >
>>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>>> of NVME
>>>>>>> > drives.
>>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>>> buffer pool
>>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>>> TB/node
>>>>>>> > range.
>>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>>> multiple
>>>>>>> > executors writing out to disk increases greatly.
>>>>>>> >
>>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>>> in fact
>>>>>>> > in-memory for some things but  use spark.local.dir as a buffer
>>>>>>> pool of
>>>>>>> > others.
>>>>>>> >
>>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>>> > spark.local.dir, even on large memory systems.
>>>>>>> >
>>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>>> >
>>>>>>> > What changes >would< make it possible?
>>>>>>> >
>>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>>> and make it
>>>>>>> > in-memory.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>>> write shuffle
>>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>>> files to
>>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>>> throughput
>>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>>> with SSDs,
>>>>>>> > there is little difference between putting all of those in memory
>>>>>>> and on
>>>>>>> > disk.
>>>>>>> >
>>>>>>> > However, it is becoming more common to run Spark on a few number
>>>>>>> of beefy
>>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>>> improving
>>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>>> each node
>>>>>>> > for shuffle writes.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>>> slavitch@gmail.com>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> Hello;
>>>>>>> >>
>>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>>> notice that
>>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark
>>>>>>> to stay in
>>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>>> shuffle data
>>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>>> touch the IO
>>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>>> RAM.
>>>>>>> >>
>>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work
>>>>>>> in
>>>>>>> >> Tungsten sort in 1.5.x
>>>>>>> >>
>>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>>> but this
>>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>>> shuffles will
>>>>>>> >> continue to spill to disk when necessary.”
>>>>>>> >>
>>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>>> would be
>>>>>>> >> needed to accomplish this?
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>> --
>>>>>> Michael Slavitch
>>>>>> 62 Renfrew Ave.
>>>>>> Ottawa Ontario
>>>>>> K1S 1Z5
>>>>>>
>>>>>
>>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Saisai Shao <sa...@gmail.com>.

So I think ramdisk is simple way to try.

Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency (serialization-deserialization, data processing...).
Also code design should well consider such usage scenario, to use resource
more efficiently.

Thanks
Saisai

On Sat, Apr 2, 2016 at 7:27 AM, Michael Slavitch <sl...@gmail.com> wrote:

> Yes we see it on final write.  Our preference is to eliminate this.
>
>
> On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sa...@gmail.com> wrote:
>
>> Hi Michael, shuffle data (mapper output) have to be materialized into
>> disk finally, no matter how large memory you have, it is the design purpose
>> of Spark. In you scenario, since you have a big memory, shuffle spill
>> should not happen frequently, most of the disk IO you see might be final
>> shuffle file write.
>>
>> So if you want to avoid this disk IO, you could use ramdisk as Reynold
>> suggested. If you want to avoid FS overhead of ramdisk, you could try to
>> hack a new shuffle implementation, since shuffle framework is pluggable.
>>
>>
>> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>>
>>> As I mentioned earlier this flag is now ignored.
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>>
>>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>>> eliminate shuffling causing io?
>>>>
>>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>>> memory when data doesn't fit in memory.
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> RAMdisk is a fine interim step but there is a lot of layers
>>>>>> eliminated by keeping things in memory unless there is need for spillover.
>>>>>>   At one time there was support for turning off spilling.  That was
>>>>>> eliminated.  Why?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>>> test if these are the bottlenecks or something else is.
>>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>>
>>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>>> any, but I am not sure if I missed something).
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I totally disagree that it’s not a problem.
>>>>>>> >
>>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>>> of NVME
>>>>>>> > drives.
>>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>>> buffer pool
>>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>>> TB/node
>>>>>>> > range.
>>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>>> multiple
>>>>>>> > executors writing out to disk increases greatly.
>>>>>>> >
>>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>>> in fact
>>>>>>> > in-memory for some things but  use spark.local.dir as a buffer
>>>>>>> pool of
>>>>>>> > others.
>>>>>>> >
>>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>>> > spark.local.dir, even on large memory systems.
>>>>>>> >
>>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>>> >
>>>>>>> > What changes >would< make it possible?
>>>>>>> >
>>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>>> and make it
>>>>>>> > in-memory.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>>> write shuffle
>>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>>> files to
>>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>>> throughput
>>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>>> with SSDs,
>>>>>>> > there is little difference between putting all of those in memory
>>>>>>> and on
>>>>>>> > disk.
>>>>>>> >
>>>>>>> > However, it is becoming more common to run Spark on a few number
>>>>>>> of beefy
>>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>>> improving
>>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>>> each node
>>>>>>> > for shuffle writes.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>>> slavitch@gmail.com>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> Hello;
>>>>>>> >>
>>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>>> notice that
>>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark
>>>>>>> to stay in
>>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>>> shuffle data
>>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>>> touch the IO
>>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>>> RAM.
>>>>>>> >>
>>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work
>>>>>>> in
>>>>>>> >> Tungsten sort in 1.5.x
>>>>>>> >>
>>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>>> but this
>>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>>> shuffles will
>>>>>>> >> continue to spill to disk when necessary.”
>>>>>>> >>
>>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>>> would be
>>>>>>> >> needed to accomplish this?
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>> --
>>>>>> Michael Slavitch
>>>>>> 62 Renfrew Ave.
>>>>>> Ottawa Ontario
>>>>>> K1S 1Z5
>>>>>>
>>>>>
>>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

Yes we see it on final write.  Our preference is to eliminate this.

On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sa...@gmail.com> wrote:

> Hi Michael, shuffle data (mapper output) have to be materialized into disk
> finally, no matter how large memory you have, it is the design purpose of
> Spark. In you scenario, since you have a big memory, shuffle spill should
> not happen frequently, most of the disk IO you see might be final shuffle
> file write.
>
> So if you want to avoid this disk IO, you could use ramdisk as Reynold
> suggested. If you want to avoid FS overhead of ramdisk, you could try to
> hack a new shuffle implementation, since shuffle framework is pluggable.
>
>
> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> As I mentioned earlier this flag is now ignored.
>>
>>
>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:
>>
>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>> eliminate shuffling causing io?
>>>
>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>> memory when data doesn't fit in memory.
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>>> wrote:
>>>>
>>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>>> by keeping things in memory unless there is need for spillover.   At one
>>>>> time there was support for turning off spilling.  That was eliminated.
>>>>> Why?
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>> test if these are the bottlenecks or something else is.
>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>
>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>> any, but I am not sure if I missed something).
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>>> wrote:
>>>>>> > I totally disagree that it’s not a problem.
>>>>>> >
>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>> of NVME
>>>>>> > drives.
>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>> buffer pool
>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>> TB/node
>>>>>> > range.
>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>> multiple
>>>>>> > executors writing out to disk increases greatly.
>>>>>> >
>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>> in fact
>>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>>> of
>>>>>> > others.
>>>>>> >
>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>> > spark.local.dir, even on large memory systems.
>>>>>> >
>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>> >
>>>>>> > What changes >would< make it possible?
>>>>>> >
>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>> and make it
>>>>>> > in-memory.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>> write shuffle
>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>> files to
>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>> throughput
>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>> with SSDs,
>>>>>> > there is little difference between putting all of those in memory
>>>>>> and on
>>>>>> > disk.
>>>>>> >
>>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>>> beefy
>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>> improving
>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>> each node
>>>>>> > for shuffle writes.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>> slavitch@gmail.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> Hello;
>>>>>> >>
>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>> notice that
>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>>> stay in
>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>> shuffle data
>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>> touch the IO
>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>> RAM.
>>>>>> >>
>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>>> >> Tungsten sort in 1.5.x
>>>>>> >>
>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>> but this
>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>> shuffles will
>>>>>> >> continue to spill to disk when necessary.”
>>>>>> >>
>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>> would be
>>>>>> >> needed to accomplish this?
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>> --
>>>>> Michael Slavitch
>>>>> 62 Renfrew Ave.
>>>>> Ottawa Ontario
>>>>> K1S 1Z5
>>>>>
>>>>
>>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
>
> --
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

Yes we see it on final write.  Our preference is to eliminate this.

On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sa...@gmail.com> wrote:

> Hi Michael, shuffle data (mapper output) have to be materialized into disk
> finally, no matter how large memory you have, it is the design purpose of
> Spark. In you scenario, since you have a big memory, shuffle spill should
> not happen frequently, most of the disk IO you see might be final shuffle
> file write.
>
> So if you want to avoid this disk IO, you could use ramdisk as Reynold
> suggested. If you want to avoid FS overhead of ramdisk, you could try to
> hack a new shuffle implementation, since shuffle framework is pluggable.
>
>
> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> As I mentioned earlier this flag is now ignored.
>>
>>
>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:
>>
>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>> eliminate shuffling causing io?
>>>
>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>> memory when data doesn't fit in memory.
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>>> wrote:
>>>>
>>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>>> by keeping things in memory unless there is need for spillover.   At one
>>>>> time there was support for turning off spilling.  That was eliminated.
>>>>> Why?
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>> test if these are the bottlenecks or something else is.
>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>
>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>> any, but I am not sure if I missed something).
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>>> wrote:
>>>>>> > I totally disagree that it’s not a problem.
>>>>>> >
>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>> of NVME
>>>>>> > drives.
>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>> buffer pool
>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>> TB/node
>>>>>> > range.
>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>> multiple
>>>>>> > executors writing out to disk increases greatly.
>>>>>> >
>>>>>> > I thought the whole point of Spark was in-memory computing?  It’s
>>>>>> in fact
>>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>>> of
>>>>>> > others.
>>>>>> >
>>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>>> > spark.local.dir, even on large memory systems.
>>>>>> >
>>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>>> >
>>>>>> > What changes >would< make it possible?
>>>>>> >
>>>>>> > The only one that seems possible is to clone the shuffle service
>>>>>> and make it
>>>>>> > in-memory.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > spark.shuffle.spill actually has nothing to do with whether we
>>>>>> write shuffle
>>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>>> files to
>>>>>> > disk, and typically it is not a problem because the network fetch
>>>>>> throughput
>>>>>> > is lower than what disks can sustain. In most cases, especially
>>>>>> with SSDs,
>>>>>> > there is little difference between putting all of those in memory
>>>>>> and on
>>>>>> > disk.
>>>>>> >
>>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>>> beefy
>>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>>> improving
>>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>>> each node
>>>>>> > for shuffle writes.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>>> slavitch@gmail.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> Hello;
>>>>>> >>
>>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>>> notice that
>>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>>> stay in
>>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>>> shuffle data
>>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never
>>>>>> touch the IO
>>>>>> >> subsystem.  I am willing to have the job fail if it runs out of
>>>>>> RAM.
>>>>>> >>
>>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>>> >> Tungsten sort in 1.5.x
>>>>>> >>
>>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>>> but this
>>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>>> shuffles will
>>>>>> >> continue to spill to disk when necessary.”
>>>>>> >>
>>>>>> >> If this is impossible via configuration changes what code changes
>>>>>> would be
>>>>>> >> needed to accomplish this?
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>> --
>>>>> Michael Slavitch
>>>>> 62 Renfrew Ave.
>>>>> Ottawa Ontario
>>>>> K1S 1Z5
>>>>>
>>>>
>>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
>
> --
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Saisai Shao <sa...@gmail.com>.

Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
file write.

So if you want to avoid this disk IO, you could use ramdisk as Reynold
suggested. If you want to avoid FS overhead of ramdisk, you could try to
hack a new shuffle implementation, since shuffle framework is pluggable.


On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com> wrote:

> As I mentioned earlier this flag is now ignored.
>
>
> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:
>
>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>> about 500gb of io to disk if compression is enabled. Is there any way to
>> eliminate shuffling causing io?
>>
>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Michael - I'm not sure if you actually read my email, but spill has
>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>> memory when data doesn't fit in memory.
>>>
>>>
>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>>
>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>> by keeping things in memory unless there is need for spillover.   At one
>>>> time there was support for turning off spilling.  That was eliminated.
>>>> Why?
>>>>
>>>>
>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>> test if these are the bottlenecks or something else is.
>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>
>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>> any, but I am not sure if I missed something).
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>> wrote:
>>>>> > I totally disagree that it’s not a problem.
>>>>> >
>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>>> NVME
>>>>> > drives.
>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>> buffer pool
>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>> TB/node
>>>>> > range.
>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>> multiple
>>>>> > executors writing out to disk increases greatly.
>>>>> >
>>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>>> fact
>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>> of
>>>>> > others.
>>>>> >
>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>> > spark.local.dir, even on large memory systems.
>>>>> >
>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>> >
>>>>> > What changes >would< make it possible?
>>>>> >
>>>>> > The only one that seems possible is to clone the shuffle service and
>>>>> make it
>>>>> > in-memory.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>> >
>>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>>> shuffle
>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>> files to
>>>>> > disk, and typically it is not a problem because the network fetch
>>>>> throughput
>>>>> > is lower than what disks can sustain. In most cases, especially with
>>>>> SSDs,
>>>>> > there is little difference between putting all of those in memory
>>>>> and on
>>>>> > disk.
>>>>> >
>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>> beefy
>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>> improving
>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>> each node
>>>>> > for shuffle writes.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>> slavitch@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hello;
>>>>> >>
>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>> notice that
>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>> stay in
>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>> shuffle data
>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>>> the IO
>>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>>> >>
>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>> >> Tungsten sort in 1.5.x
>>>>> >>
>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>> but this
>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>> shuffles will
>>>>> >> continue to spill to disk when necessary.”
>>>>> >>
>>>>> >> If this is impossible via configuration changes what code changes
>>>>> would be
>>>>> >> needed to accomplish this?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>>
>>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Saisai Shao <sa...@gmail.com>.

Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
file write.

So if you want to avoid this disk IO, you could use ramdisk as Reynold
suggested. If you want to avoid FS overhead of ramdisk, you could try to
hack a new shuffle implementation, since shuffle framework is pluggable.


On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <sl...@gmail.com> wrote:

> As I mentioned earlier this flag is now ignored.
>
>
> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:
>
>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>> about 500gb of io to disk if compression is enabled. Is there any way to
>> eliminate shuffling causing io?
>>
>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Michael - I'm not sure if you actually read my email, but spill has
>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>> memory when data doesn't fit in memory.
>>>
>>>
>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>>
>>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>>> by keeping things in memory unless there is need for spillover.   At one
>>>> time there was support for turning off spilling.  That was eliminated.
>>>> Why?
>>>>
>>>>
>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>> test if these are the bottlenecks or something else is.
>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>
>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>> any, but I am not sure if I missed something).
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>>> wrote:
>>>>> > I totally disagree that it’s not a problem.
>>>>> >
>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>>> NVME
>>>>> > drives.
>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>> buffer pool
>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>> TB/node
>>>>> > range.
>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>> multiple
>>>>> > executors writing out to disk increases greatly.
>>>>> >
>>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>>> fact
>>>>> > in-memory for some things but  use spark.local.dir as a buffer pool
>>>>> of
>>>>> > others.
>>>>> >
>>>>> > Hence, the performance of  Spark is gated by the performance of
>>>>> > spark.local.dir, even on large memory systems.
>>>>> >
>>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>>> >
>>>>> > What changes >would< make it possible?
>>>>> >
>>>>> > The only one that seems possible is to clone the shuffle service and
>>>>> make it
>>>>> > in-memory.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>> >
>>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>>> shuffle
>>>>> > files to disk. Currently it is not possible to not write shuffle
>>>>> files to
>>>>> > disk, and typically it is not a problem because the network fetch
>>>>> throughput
>>>>> > is lower than what disks can sustain. In most cases, especially with
>>>>> SSDs,
>>>>> > there is little difference between putting all of those in memory
>>>>> and on
>>>>> > disk.
>>>>> >
>>>>> > However, it is becoming more common to run Spark on a few number of
>>>>> beefy
>>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>>> improving
>>>>> > performance for those. Meantime, you can setup local ramdisks on
>>>>> each node
>>>>> > for shuffle writes.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
>>>>> slavitch@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hello;
>>>>> >>
>>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>>> notice that
>>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>>> stay in
>>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>>> shuffle data
>>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>>> the IO
>>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>>> >>
>>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>>> >> Tungsten sort in 1.5.x
>>>>> >>
>>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>>> but this
>>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>>> shuffles will
>>>>> >> continue to spill to disk when necessary.”
>>>>> >>
>>>>> >> If this is impossible via configuration changes what code changes
>>>>> would be
>>>>> >> needed to accomplish this?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>> --
>>>> Michael Slavitch
>>>> 62 Renfrew Ave.
>>>> Ottawa Ontario
>>>> K1S 1Z5
>>>>
>>>
>>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

As I mentioned earlier this flag is now ignored.

On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:

> Shuffling a 1tb set of keys and values (aka sort by key)  results in about
> 500gb of io to disk if compression is enabled. Is there any way to
> eliminate shuffling causing io?
>
> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Michael - I'm not sure if you actually read my email, but spill has
>> nothing to do with the shuffle files on disk. It was for the partitioning
>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>> memory when data doesn't fit in memory.
>>
>>
>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>>
>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>> by keeping things in memory unless there is need for spillover.   At one
>>> time there was support for turning off spilling.  That was eliminated.
>>> Why?
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>>
>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>> test if these are the bottlenecks or something else is.
>>>> For most practical purposes, pointing local dir to ramdisk should
>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>
>>>> Are there concerns with taking that approach to test ? (I dont see
>>>> any, but I am not sure if I missed something).
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>> wrote:
>>>> > I totally disagree that it’s not a problem.
>>>> >
>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>> NVME
>>>> > drives.
>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>> buffer pool
>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>> TB/node
>>>> > range.
>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>> multiple
>>>> > executors writing out to disk increases greatly.
>>>> >
>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>> fact
>>>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>>>> > others.
>>>> >
>>>> > Hence, the performance of  Spark is gated by the performance of
>>>> > spark.local.dir, even on large memory systems.
>>>> >
>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>> >
>>>> > What changes >would< make it possible?
>>>> >
>>>> > The only one that seems possible is to clone the shuffle service and
>>>> make it
>>>> > in-memory.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>> >
>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>> shuffle
>>>> > files to disk. Currently it is not possible to not write shuffle
>>>> files to
>>>> > disk, and typically it is not a problem because the network fetch
>>>> throughput
>>>> > is lower than what disks can sustain. In most cases, especially with
>>>> SSDs,
>>>> > there is little difference between putting all of those in memory and
>>>> on
>>>> > disk.
>>>> >
>>>> > However, it is becoming more common to run Spark on a few number of
>>>> beefy
>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>> improving
>>>> > performance for those. Meantime, you can setup local ramdisks on each
>>>> node
>>>> > for shuffle writes.
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavitch@gmail.com
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> Hello;
>>>> >>
>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>> notice that
>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>> stay in
>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>> shuffle data
>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>> the IO
>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>> >>
>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>> >> Tungsten sort in 1.5.x
>>>> >>
>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>> but this
>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>> shuffles will
>>>> >> continue to spill to disk when necessary.”
>>>> >>
>>>> >> If this is impossible via configuration changes what code changes
>>>> would be
>>>> >> needed to accomplish this?
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>> >>
>>>> >
>>>> >
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>
-- 
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

As I mentioned earlier this flag is now ignored.

On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <sl...@gmail.com> wrote:

> Shuffling a 1tb set of keys and values (aka sort by key)  results in about
> 500gb of io to disk if compression is enabled. Is there any way to
> eliminate shuffling causing io?
>
> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Michael - I'm not sure if you actually read my email, but spill has
>> nothing to do with the shuffle files on disk. It was for the partitioning
>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>> memory when data doesn't fit in memory.
>>
>>
>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>>
>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>> by keeping things in memory unless there is need for spillover.   At one
>>> time there was support for turning off spilling.  That was eliminated.
>>> Why?
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>>
>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>> test if these are the bottlenecks or something else is.
>>>> For most practical purposes, pointing local dir to ramdisk should
>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>
>>>> Are there concerns with taking that approach to test ? (I dont see
>>>> any, but I am not sure if I missed something).
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>>> wrote:
>>>> > I totally disagree that it’s not a problem.
>>>> >
>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>> NVME
>>>> > drives.
>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>> buffer pool
>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>> TB/node
>>>> > range.
>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>> multiple
>>>> > executors writing out to disk increases greatly.
>>>> >
>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>> fact
>>>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>>>> > others.
>>>> >
>>>> > Hence, the performance of  Spark is gated by the performance of
>>>> > spark.local.dir, even on large memory systems.
>>>> >
>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>> >
>>>> > What changes >would< make it possible?
>>>> >
>>>> > The only one that seems possible is to clone the shuffle service and
>>>> make it
>>>> > in-memory.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>> >
>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>> shuffle
>>>> > files to disk. Currently it is not possible to not write shuffle
>>>> files to
>>>> > disk, and typically it is not a problem because the network fetch
>>>> throughput
>>>> > is lower than what disks can sustain. In most cases, especially with
>>>> SSDs,
>>>> > there is little difference between putting all of those in memory and
>>>> on
>>>> > disk.
>>>> >
>>>> > However, it is becoming more common to run Spark on a few number of
>>>> beefy
>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>> improving
>>>> > performance for those. Meantime, you can setup local ramdisks on each
>>>> node
>>>> > for shuffle writes.
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavitch@gmail.com
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> Hello;
>>>> >>
>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>> notice that
>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>> stay in
>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>> shuffle data
>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>> the IO
>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>> >>
>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>> >> Tungsten sort in 1.5.x
>>>> >>
>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>> but this
>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>> shuffles will
>>>> >> continue to spill to disk when necessary.”
>>>> >>
>>>> >> If this is impossible via configuration changes what code changes
>>>> would be
>>>> >> needed to accomplish this?
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>> >>
>>>> >
>>>> >
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>
-- 
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

Shuffling a 1tb set of keys and values (aka sort by key)  results in about
500gb of io to disk if compression is enabled. Is there any way to
eliminate shuffling causing io?

On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:

> Michael - I'm not sure if you actually read my email, but spill has
> nothing to do with the shuffle files on disk. It was for the partitioning
> (i.e. sorting) process. If that flag is off, Spark will just run out of
> memory when data doesn't fit in memory.
>
>
> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> RAMdisk is a fine interim step but there is a lot of layers eliminated by
>> keeping things in memory unless there is need for spillover.   At one time
>> there was support for turning off spilling.  That was eliminated.  Why?
>>
>>
>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>>
>>> I think Reynold's suggestion of using ram disk would be a good way to
>>> test if these are the bottlenecks or something else is.
>>> For most practical purposes, pointing local dir to ramdisk should
>>> effectively give you 'similar' performance as shuffling from memory.
>>>
>>> Are there concerns with taking that approach to test ? (I dont see
>>> any, but I am not sure if I missed something).
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>>
>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>> > I totally disagree that it’s not a problem.
>>> >
>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>> NVME
>>> > drives.
>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>> buffer pool
>>> > This is fine for small jobs but not for jobs with datasets in the
>>> TB/node
>>> > range.
>>> > - On larger jobs flushing the cache causes Linux to block.
>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>> multiple
>>> > executors writing out to disk increases greatly.
>>> >
>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>> fact
>>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>>> > others.
>>> >
>>> > Hence, the performance of  Spark is gated by the performance of
>>> > spark.local.dir, even on large memory systems.
>>> >
>>> > "Currently it is not possible to not write shuffle files to disk.”
>>> >
>>> > What changes >would< make it possible?
>>> >
>>> > The only one that seems possible is to clone the shuffle service and
>>> make it
>>> > in-memory.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>> >
>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>> shuffle
>>> > files to disk. Currently it is not possible to not write shuffle files
>>> to
>>> > disk, and typically it is not a problem because the network fetch
>>> throughput
>>> > is lower than what disks can sustain. In most cases, especially with
>>> SSDs,
>>> > there is little difference between putting all of those in memory and
>>> on
>>> > disk.
>>> >
>>> > However, it is becoming more common to run Spark on a few number of
>>> beefy
>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>> improving
>>> > performance for those. Meantime, you can setup local ramdisks on each
>>> node
>>> > for shuffle writes.
>>> >
>>> >
>>> >
>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello;
>>> >>
>>> >> I’m working on spark with very large memory systems (2TB+) and notice
>>> that
>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>> stay in
>>> >> memory when doing shuffle operations?   The goal is to keep the
>>> shuffle data
>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>> the IO
>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>> >>
>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>> >> Tungsten sort in 1.5.x
>>> >>
>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>>> this
>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>> shuffles will
>>> >> continue to spill to disk when necessary.”
>>> >>
>>> >> If this is impossible via configuration changes what code changes
>>> would be
>>> >> needed to accomplish this?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>> >>
>>> >
>>> >
>>>
>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
>
> --
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

It's spark.local.dir.


On Fri, Apr 1, 2016 at 3:37 PM, Yong Zhang <ja...@hotmail.com> wrote:

> Is there a configuration in the Spark of location of "shuffle spilling"? I
> didn't recall ever see that one. Can you share it out?
>
> It will be good for a test writing to RAM Disk if that configuration is
> available.
>
> Thanks
>
> Yong
>
> ------------------------------
> From: rxin@databricks.com
> Date: Fri, 1 Apr 2016 15:32:23 -0700
> Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in
> Spark
> To: slavitch@gmail.com
> CC: mridul@gmail.com; dev@spark.apache.org; user@spark.apache.org
>
>
> Michael - I'm not sure if you actually read my email, but spill has
> nothing to do with the shuffle files on disk. It was for the partitioning
> (i.e. sorting) process. If that flag is off, Spark will just run out of
> memory when data doesn't fit in memory.
>
>
> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
> RAMdisk is a fine interim step but there is a lot of layers eliminated by
> keeping things in memory unless there is need for spillover.   At one time
> there was support for turning off spilling.  That was eliminated.  Why?
>
>
> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
>
> I think Reynold's suggestion of using ram disk would be a good way to
> test if these are the bottlenecks or something else is.
> For most practical purposes, pointing local dir to ramdisk should
> effectively give you 'similar' performance as shuffling from memory.
>
> Are there concerns with taking that approach to test ? (I dont see
> any, but I am not sure if I missed something).
>
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
> > I totally disagree that it’s not a problem.
> >
> > - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> > drives.
> > - What Spark is depending on is Linux’s IO cache as an effective buffer
> pool
> > This is fine for small jobs but not for jobs with datasets in the TB/node
> > range.
> > - On larger jobs flushing the cache causes Linux to block.
> > - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> > executors writing out to disk increases greatly.
> >
> > I thought the whole point of Spark was in-memory computing?  It’s in fact
> > in-memory for some things but  use spark.local.dir as a buffer pool of
> > others.
> >
> > Hence, the performance of  Spark is gated by the performance of
> > spark.local.dir, even on large memory systems.
> >
> > "Currently it is not possible to not write shuffle files to disk.”
> >
> > What changes >would< make it possible?
> >
> > The only one that seems possible is to clone the shuffle service and
> make it
> > in-memory.
> >
> >
> >
> >
> >
> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> >
> > spark.shuffle.spill actually has nothing to do with whether we write
> shuffle
> > files to disk. Currently it is not possible to not write shuffle files to
> > disk, and typically it is not a problem because the network fetch
> throughput
> > is lower than what disks can sustain. In most cases, especially with
> SSDs,
> > there is little difference between putting all of those in memory and on
> > disk.
> >
> > However, it is becoming more common to run Spark on a few number of beefy
> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving
> > performance for those. Meantime, you can setup local ramdisks on each
> node
> > for shuffle writes.
> >
> >
> >
> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> > wrote:
> >>
> >> Hello;
> >>
> >> I’m working on spark with very large memory systems (2TB+) and notice
> that
> >> Spark spills to disk in shuffle.  Is there a way to force spark to stay
> in
> >> memory when doing shuffle operations?   The goal is to keep the shuffle
> data
> >> either in the heap or in off-heap memory (in 1.6.x) and never touch the
> IO
> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
> >>
> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> >> Tungsten sort in 1.5.x
> >>
> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
> this
> >> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will
> >> continue to spill to disk when necessary.”
> >>
> >> If this is impossible via configuration changes what code changes would
> be
> >> needed to accomplish this?
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
> >
>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>
>
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

It's spark.local.dir.


On Fri, Apr 1, 2016 at 3:37 PM, Yong Zhang <ja...@hotmail.com> wrote:

> Is there a configuration in the Spark of location of "shuffle spilling"? I
> didn't recall ever see that one. Can you share it out?
>
> It will be good for a test writing to RAM Disk if that configuration is
> available.
>
> Thanks
>
> Yong
>
> ------------------------------
> From: rxin@databricks.com
> Date: Fri, 1 Apr 2016 15:32:23 -0700
> Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in
> Spark
> To: slavitch@gmail.com
> CC: mridul@gmail.com; dev@spark.apache.org; user@spark.apache.org
>
>
> Michael - I'm not sure if you actually read my email, but spill has
> nothing to do with the shuffle files on disk. It was for the partitioning
> (i.e. sorting) process. If that flag is off, Spark will just run out of
> memory when data doesn't fit in memory.
>
>
> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
> RAMdisk is a fine interim step but there is a lot of layers eliminated by
> keeping things in memory unless there is need for spillover.   At one time
> there was support for turning off spilling.  That was eliminated.  Why?
>
>
> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
>
> I think Reynold's suggestion of using ram disk would be a good way to
> test if these are the bottlenecks or something else is.
> For most practical purposes, pointing local dir to ramdisk should
> effectively give you 'similar' performance as shuffling from memory.
>
> Are there concerns with taking that approach to test ? (I dont see
> any, but I am not sure if I missed something).
>
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
> > I totally disagree that it’s not a problem.
> >
> > - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> > drives.
> > - What Spark is depending on is Linux’s IO cache as an effective buffer
> pool
> > This is fine for small jobs but not for jobs with datasets in the TB/node
> > range.
> > - On larger jobs flushing the cache causes Linux to block.
> > - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> > executors writing out to disk increases greatly.
> >
> > I thought the whole point of Spark was in-memory computing?  It’s in fact
> > in-memory for some things but  use spark.local.dir as a buffer pool of
> > others.
> >
> > Hence, the performance of  Spark is gated by the performance of
> > spark.local.dir, even on large memory systems.
> >
> > "Currently it is not possible to not write shuffle files to disk.”
> >
> > What changes >would< make it possible?
> >
> > The only one that seems possible is to clone the shuffle service and
> make it
> > in-memory.
> >
> >
> >
> >
> >
> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> >
> > spark.shuffle.spill actually has nothing to do with whether we write
> shuffle
> > files to disk. Currently it is not possible to not write shuffle files to
> > disk, and typically it is not a problem because the network fetch
> throughput
> > is lower than what disks can sustain. In most cases, especially with
> SSDs,
> > there is little difference between putting all of those in memory and on
> > disk.
> >
> > However, it is becoming more common to run Spark on a few number of beefy
> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving
> > performance for those. Meantime, you can setup local ramdisks on each
> node
> > for shuffle writes.
> >
> >
> >
> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> > wrote:
> >>
> >> Hello;
> >>
> >> I’m working on spark with very large memory systems (2TB+) and notice
> that
> >> Spark spills to disk in shuffle.  Is there a way to force spark to stay
> in
> >> memory when doing shuffle operations?   The goal is to keep the shuffle
> data
> >> either in the heap or in off-heap memory (in 1.6.x) and never touch the
> IO
> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
> >>
> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> >> Tungsten sort in 1.5.x
> >>
> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
> this
> >> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will
> >> continue to spill to disk when necessary.”
> >>
> >> If this is impossible via configuration changes what code changes would
> be
> >> needed to accomplish this?
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
> >
>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>
>
>

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Yong Zhang <ja...@hotmail.com>.

Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out?
It will be good for a test writing to RAM Disk if that configuration is available.
Thanks
Yong

From: rxin@databricks.com
Date: Fri, 1 Apr 2016 15:32:23 -0700
Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in Spark
To: slavitch@gmail.com
CC: mridul@gmail.com; dev@spark.apache.org; user@spark.apache.org

Michael - I'm not sure if you actually read my email, but spill has nothing to do with the shuffle files on disk. It was for the partitioning (i.e. sorting) process. If that flag is off, Spark will just run out of memory when data doesn't fit in memory. 

On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com> wrote:
RAMdisk is a fine interim step but there is a lot of layers eliminated by keeping things in memory unless there is need for spillover.   At one time there was support for turning off spilling.  That was eliminated.  Why? 

On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
I think Reynold's suggestion of using ram disk would be a good way to

test if these are the bottlenecks or something else is.

For most practical purposes, pointing local dir to ramdisk should

effectively give you 'similar' performance as shuffling from memory.



Are there concerns with taking that approach to test ? (I dont see

any, but I am not sure if I missed something).





Regards,

Mridul









On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com> wrote:

> I totally disagree that it’s not a problem.

>

> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME

> drives.

> - What Spark is depending on is Linux’s IO cache as an effective buffer pool

> This is fine for small jobs but not for jobs with datasets in the TB/node

> range.

> - On larger jobs flushing the cache causes Linux to block.

> - On a modern 56-hyperthread 2-socket host the latency caused by multiple

> executors writing out to disk increases greatly.

>

> I thought the whole point of Spark was in-memory computing?  It’s in fact

> in-memory for some things but  use spark.local.dir as a buffer pool of

> others.

>

> Hence, the performance of  Spark is gated by the performance of

> spark.local.dir, even on large memory systems.

>

> "Currently it is not possible to not write shuffle files to disk.”

>

> What changes >would< make it possible?

>

> The only one that seems possible is to clone the shuffle service and make it

> in-memory.

>

>

>

>

>

> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:

>

> spark.shuffle.spill actually has nothing to do with whether we write shuffle

> files to disk. Currently it is not possible to not write shuffle files to

> disk, and typically it is not a problem because the network fetch throughput

> is lower than what disks can sustain. In most cases, especially with SSDs,

> there is little difference between putting all of those in memory and on

> disk.

>

> However, it is becoming more common to run Spark on a few number of beefy

> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving

> performance for those. Meantime, you can setup local ramdisks on each node

> for shuffle writes.

>

>

>

> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>

> wrote:

>>

>> Hello;

>>

>> I’m working on spark with very large memory systems (2TB+) and notice that

>> Spark spills to disk in shuffle.  Is there a way to force spark to stay in

>> memory when doing shuffle operations?   The goal is to keep the shuffle data

>> either in the heap or in off-heap memory (in 1.6.x) and never touch the IO

>> subsystem.  I am willing to have the job fail if it runs out of RAM.

>>

>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in

>> Tungsten sort in 1.5.x

>>

>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this

>> is ignored by the tungsten-sort shuffle manager; its optimized shuffles will

>> continue to spill to disk when necessary.”

>>

>> If this is impossible via configuration changes what code changes would be

>> needed to accomplish this?

>>

>>

>>

>>

>>

>> ---------------------------------------------------------------------

>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

>> For additional commands, e-mail: user-help@spark.apache.org

>>

>

>

-- 
Michael Slavitch62 Renfrew Ave.
Ottawa Ontario 
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

Shuffling a 1tb set of keys and values (aka sort by key)  results in about
500gb of io to disk if compression is enabled. Is there any way to
eliminate shuffling causing io?

On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <rx...@databricks.com> wrote:

> Michael - I'm not sure if you actually read my email, but spill has
> nothing to do with the shuffle files on disk. It was for the partitioning
> (i.e. sorting) process. If that flag is off, Spark will just run out of
> memory when data doesn't fit in memory.
>
>
> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
>
>> RAMdisk is a fine interim step but there is a lot of layers eliminated by
>> keeping things in memory unless there is need for spillover.   At one time
>> there was support for turning off spilling.  That was eliminated.  Why?
>>
>>
>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>>
>>> I think Reynold's suggestion of using ram disk would be a good way to
>>> test if these are the bottlenecks or something else is.
>>> For most practical purposes, pointing local dir to ramdisk should
>>> effectively give you 'similar' performance as shuffling from memory.
>>>
>>> Are there concerns with taking that approach to test ? (I dont see
>>> any, but I am not sure if I missed something).
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>>
>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>>> wrote:
>>> > I totally disagree that it’s not a problem.
>>> >
>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>> NVME
>>> > drives.
>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>> buffer pool
>>> > This is fine for small jobs but not for jobs with datasets in the
>>> TB/node
>>> > range.
>>> > - On larger jobs flushing the cache causes Linux to block.
>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>> multiple
>>> > executors writing out to disk increases greatly.
>>> >
>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>> fact
>>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>>> > others.
>>> >
>>> > Hence, the performance of  Spark is gated by the performance of
>>> > spark.local.dir, even on large memory systems.
>>> >
>>> > "Currently it is not possible to not write shuffle files to disk.”
>>> >
>>> > What changes >would< make it possible?
>>> >
>>> > The only one that seems possible is to clone the shuffle service and
>>> make it
>>> > in-memory.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>>> >
>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>> shuffle
>>> > files to disk. Currently it is not possible to not write shuffle files
>>> to
>>> > disk, and typically it is not a problem because the network fetch
>>> throughput
>>> > is lower than what disks can sustain. In most cases, especially with
>>> SSDs,
>>> > there is little difference between putting all of those in memory and
>>> on
>>> > disk.
>>> >
>>> > However, it is becoming more common to run Spark on a few number of
>>> beefy
>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>> improving
>>> > performance for those. Meantime, you can setup local ramdisks on each
>>> node
>>> > for shuffle writes.
>>> >
>>> >
>>> >
>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello;
>>> >>
>>> >> I’m working on spark with very large memory systems (2TB+) and notice
>>> that
>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>> stay in
>>> >> memory when doing shuffle operations?   The goal is to keep the
>>> shuffle data
>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>> the IO
>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>> >>
>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>> >> Tungsten sort in 1.5.x
>>> >>
>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>>> this
>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>> shuffles will
>>> >> continue to spill to disk when necessary.”
>>> >>
>>> >> If this is impossible via configuration changes what code changes
>>> would be
>>> >> needed to accomplish this?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>> >>
>>> >
>>> >
>>>
>> --
>> Michael Slavitch
>> 62 Renfrew Ave.
>> Ottawa Ontario
>> K1S 1Z5
>>
>
> --
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Yong Zhang <ja...@hotmail.com>.

Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out?
It will be good for a test writing to RAM Disk if that configuration is available.
Thanks
Yong

From: rxin@databricks.com
Date: Fri, 1 Apr 2016 15:32:23 -0700
Subject: Re: Eliminating shuffle write and spill disk IO reads/writes in Spark
To: slavitch@gmail.com
CC: mridul@gmail.com; dev@spark.apache.org; user@spark.apache.org

Michael - I'm not sure if you actually read my email, but spill has nothing to do with the shuffle files on disk. It was for the partitioning (i.e. sorting) process. If that flag is off, Spark will just run out of memory when data doesn't fit in memory. 

On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com> wrote:
RAMdisk is a fine interim step but there is a lot of layers eliminated by keeping things in memory unless there is need for spillover.   At one time there was support for turning off spilling.  That was eliminated.  Why? 

On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
I think Reynold's suggestion of using ram disk would be a good way to

test if these are the bottlenecks or something else is.

For most practical purposes, pointing local dir to ramdisk should

effectively give you 'similar' performance as shuffling from memory.



Are there concerns with taking that approach to test ? (I dont see

any, but I am not sure if I missed something).





Regards,

Mridul









On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com> wrote:

> I totally disagree that it’s not a problem.

>

> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME

> drives.

> - What Spark is depending on is Linux’s IO cache as an effective buffer pool

> This is fine for small jobs but not for jobs with datasets in the TB/node

> range.

> - On larger jobs flushing the cache causes Linux to block.

> - On a modern 56-hyperthread 2-socket host the latency caused by multiple

> executors writing out to disk increases greatly.

>

> I thought the whole point of Spark was in-memory computing?  It’s in fact

> in-memory for some things but  use spark.local.dir as a buffer pool of

> others.

>

> Hence, the performance of  Spark is gated by the performance of

> spark.local.dir, even on large memory systems.

>

> "Currently it is not possible to not write shuffle files to disk.”

>

> What changes >would< make it possible?

>

> The only one that seems possible is to clone the shuffle service and make it

> in-memory.

>

>

>

>

>

> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:

>

> spark.shuffle.spill actually has nothing to do with whether we write shuffle

> files to disk. Currently it is not possible to not write shuffle files to

> disk, and typically it is not a problem because the network fetch throughput

> is lower than what disks can sustain. In most cases, especially with SSDs,

> there is little difference between putting all of those in memory and on

> disk.

>

> However, it is becoming more common to run Spark on a few number of beefy

> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving

> performance for those. Meantime, you can setup local ramdisks on each node

> for shuffle writes.

>

>

>

> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>

> wrote:

>>

>> Hello;

>>

>> I’m working on spark with very large memory systems (2TB+) and notice that

>> Spark spills to disk in shuffle.  Is there a way to force spark to stay in

>> memory when doing shuffle operations?   The goal is to keep the shuffle data

>> either in the heap or in off-heap memory (in 1.6.x) and never touch the IO

>> subsystem.  I am willing to have the job fail if it runs out of RAM.

>>

>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in

>> Tungsten sort in 1.5.x

>>

>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this

>> is ignored by the tungsten-sort shuffle manager; its optimized shuffles will

>> continue to spill to disk when necessary.”

>>

>> If this is impossible via configuration changes what code changes would be

>> needed to accomplish this?

>>

>>

>>

>>

>>

>> ---------------------------------------------------------------------

>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

>> For additional commands, e-mail: user-help@spark.apache.org

>>

>

>

-- 
Michael Slavitch62 Renfrew Ave.
Ottawa Ontario 
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

Michael - I'm not sure if you actually read my email, but spill has nothing
to do with the shuffle files on disk. It was for the partitioning (i.e.
sorting) process. If that flag is off, Spark will just run out of memory
when data doesn't fit in memory.


On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com> wrote:

> RAMdisk is a fine interim step but there is a lot of layers eliminated by
> keeping things in memory unless there is need for spillover.   At one time
> there was support for turning off spilling.  That was eliminated.  Why?
>
>
> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
>
>> I think Reynold's suggestion of using ram disk would be a good way to
>> test if these are the bottlenecks or something else is.
>> For most practical purposes, pointing local dir to ramdisk should
>> effectively give you 'similar' performance as shuffling from memory.
>>
>> Are there concerns with taking that approach to test ? (I dont see
>> any, but I am not sure if I missed something).
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>> > I totally disagree that it’s not a problem.
>> >
>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>> NVME
>> > drives.
>> > - What Spark is depending on is Linux’s IO cache as an effective buffer
>> pool
>> > This is fine for small jobs but not for jobs with datasets in the
>> TB/node
>> > range.
>> > - On larger jobs flushing the cache causes Linux to block.
>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>> multiple
>> > executors writing out to disk increases greatly.
>> >
>> > I thought the whole point of Spark was in-memory computing?  It’s in
>> fact
>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>> > others.
>> >
>> > Hence, the performance of  Spark is gated by the performance of
>> > spark.local.dir, even on large memory systems.
>> >
>> > "Currently it is not possible to not write shuffle files to disk.”
>> >
>> > What changes >would< make it possible?
>> >
>> > The only one that seems possible is to clone the shuffle service and
>> make it
>> > in-memory.
>> >
>> >
>> >
>> >
>> >
>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>> >
>> > spark.shuffle.spill actually has nothing to do with whether we write
>> shuffle
>> > files to disk. Currently it is not possible to not write shuffle files
>> to
>> > disk, and typically it is not a problem because the network fetch
>> throughput
>> > is lower than what disks can sustain. In most cases, especially with
>> SSDs,
>> > there is little difference between putting all of those in memory and on
>> > disk.
>> >
>> > However, it is becoming more common to run Spark on a few number of
>> beefy
>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>> improving
>> > performance for those. Meantime, you can setup local ramdisks on each
>> node
>> > for shuffle writes.
>> >
>> >
>> >
>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
>> > wrote:
>> >>
>> >> Hello;
>> >>
>> >> I’m working on spark with very large memory systems (2TB+) and notice
>> that
>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>> stay in
>> >> memory when doing shuffle operations?   The goal is to keep the
>> shuffle data
>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>> the IO
>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>> >>
>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>> >> Tungsten sort in 1.5.x
>> >>
>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>> this
>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>> shuffles will
>> >> continue to spill to disk when necessary.”
>> >>
>> >> If this is impossible via configuration changes what code changes
>> would be
>> >> needed to accomplish this?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>> >
>>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

Michael - I'm not sure if you actually read my email, but spill has nothing
to do with the shuffle files on disk. It was for the partitioning (i.e.
sorting) process. If that flag is off, Spark will just run out of memory
when data doesn't fit in memory.


On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <sl...@gmail.com> wrote:

> RAMdisk is a fine interim step but there is a lot of layers eliminated by
> keeping things in memory unless there is need for spillover.   At one time
> there was support for turning off spilling.  That was eliminated.  Why?
>
>
> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:
>
>> I think Reynold's suggestion of using ram disk would be a good way to
>> test if these are the bottlenecks or something else is.
>> For most practical purposes, pointing local dir to ramdisk should
>> effectively give you 'similar' performance as shuffling from memory.
>>
>> Are there concerns with taking that approach to test ? (I dont see
>> any, but I am not sure if I missed something).
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
>> wrote:
>> > I totally disagree that it’s not a problem.
>> >
>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>> NVME
>> > drives.
>> > - What Spark is depending on is Linux’s IO cache as an effective buffer
>> pool
>> > This is fine for small jobs but not for jobs with datasets in the
>> TB/node
>> > range.
>> > - On larger jobs flushing the cache causes Linux to block.
>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>> multiple
>> > executors writing out to disk increases greatly.
>> >
>> > I thought the whole point of Spark was in-memory computing?  It’s in
>> fact
>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>> > others.
>> >
>> > Hence, the performance of  Spark is gated by the performance of
>> > spark.local.dir, even on large memory systems.
>> >
>> > "Currently it is not possible to not write shuffle files to disk.”
>> >
>> > What changes >would< make it possible?
>> >
>> > The only one that seems possible is to clone the shuffle service and
>> make it
>> > in-memory.
>> >
>> >
>> >
>> >
>> >
>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>> >
>> > spark.shuffle.spill actually has nothing to do with whether we write
>> shuffle
>> > files to disk. Currently it is not possible to not write shuffle files
>> to
>> > disk, and typically it is not a problem because the network fetch
>> throughput
>> > is lower than what disks can sustain. In most cases, especially with
>> SSDs,
>> > there is little difference between putting all of those in memory and on
>> > disk.
>> >
>> > However, it is becoming more common to run Spark on a few number of
>> beefy
>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>> improving
>> > performance for those. Meantime, you can setup local ramdisks on each
>> node
>> > for shuffle writes.
>> >
>> >
>> >
>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
>> > wrote:
>> >>
>> >> Hello;
>> >>
>> >> I’m working on spark with very large memory systems (2TB+) and notice
>> that
>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>> stay in
>> >> memory when doing shuffle operations?   The goal is to keep the
>> shuffle data
>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>> the IO
>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>> >>
>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>> >> Tungsten sort in 1.5.x
>> >>
>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
>> this
>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>> shuffles will
>> >> continue to spill to disk when necessary.”
>> >>
>> >> If this is impossible via configuration changes what code changes
>> would be
>> >> needed to accomplish this?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>> >
>>
> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

RAMdisk is a fine interim step but there is a lot of layers eliminated by
keeping things in memory unless there is need for spillover.   At one time
there was support for turning off spilling.  That was eliminated.  Why?

On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:

> I think Reynold's suggestion of using ram disk would be a good way to
> test if these are the bottlenecks or something else is.
> For most practical purposes, pointing local dir to ramdisk should
> effectively give you 'similar' performance as shuffling from memory.
>
> Are there concerns with taking that approach to test ? (I dont see
> any, but I am not sure if I missed something).
>
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
> > I totally disagree that it’s not a problem.
> >
> > - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> > drives.
> > - What Spark is depending on is Linux’s IO cache as an effective buffer
> pool
> > This is fine for small jobs but not for jobs with datasets in the TB/node
> > range.
> > - On larger jobs flushing the cache causes Linux to block.
> > - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> > executors writing out to disk increases greatly.
> >
> > I thought the whole point of Spark was in-memory computing?  It’s in fact
> > in-memory for some things but  use spark.local.dir as a buffer pool of
> > others.
> >
> > Hence, the performance of  Spark is gated by the performance of
> > spark.local.dir, even on large memory systems.
> >
> > "Currently it is not possible to not write shuffle files to disk.”
> >
> > What changes >would< make it possible?
> >
> > The only one that seems possible is to clone the shuffle service and
> make it
> > in-memory.
> >
> >
> >
> >
> >
> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> >
> > spark.shuffle.spill actually has nothing to do with whether we write
> shuffle
> > files to disk. Currently it is not possible to not write shuffle files to
> > disk, and typically it is not a problem because the network fetch
> throughput
> > is lower than what disks can sustain. In most cases, especially with
> SSDs,
> > there is little difference between putting all of those in memory and on
> > disk.
> >
> > However, it is becoming more common to run Spark on a few number of beefy
> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving
> > performance for those. Meantime, you can setup local ramdisks on each
> node
> > for shuffle writes.
> >
> >
> >
> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> > wrote:
> >>
> >> Hello;
> >>
> >> I’m working on spark with very large memory systems (2TB+) and notice
> that
> >> Spark spills to disk in shuffle.  Is there a way to force spark to stay
> in
> >> memory when doing shuffle operations?   The goal is to keep the shuffle
> data
> >> either in the heap or in off-heap memory (in 1.6.x) and never touch the
> IO
> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
> >>
> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> >> Tungsten sort in 1.5.x
> >>
> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
> this
> >> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will
> >> continue to spill to disk when necessary.”
> >>
> >> If this is impossible via configuration changes what code changes would
> be
> >> needed to accomplish this?
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
> >
>
-- 
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

RAMdisk is a fine interim step but there is a lot of layers eliminated by
keeping things in memory unless there is need for spillover.   At one time
there was support for turning off spilling.  That was eliminated.  Why?

On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mr...@gmail.com> wrote:

> I think Reynold's suggestion of using ram disk would be a good way to
> test if these are the bottlenecks or something else is.
> For most practical purposes, pointing local dir to ramdisk should
> effectively give you 'similar' performance as shuffling from memory.
>
> Are there concerns with taking that approach to test ? (I dont see
> any, but I am not sure if I missed something).
>
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com>
> wrote:
> > I totally disagree that it’s not a problem.
> >
> > - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> > drives.
> > - What Spark is depending on is Linux’s IO cache as an effective buffer
> pool
> > This is fine for small jobs but not for jobs with datasets in the TB/node
> > range.
> > - On larger jobs flushing the cache causes Linux to block.
> > - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> > executors writing out to disk increases greatly.
> >
> > I thought the whole point of Spark was in-memory computing?  It’s in fact
> > in-memory for some things but  use spark.local.dir as a buffer pool of
> > others.
> >
> > Hence, the performance of  Spark is gated by the performance of
> > spark.local.dir, even on large memory systems.
> >
> > "Currently it is not possible to not write shuffle files to disk.”
> >
> > What changes >would< make it possible?
> >
> > The only one that seems possible is to clone the shuffle service and
> make it
> > in-memory.
> >
> >
> >
> >
> >
> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> >
> > spark.shuffle.spill actually has nothing to do with whether we write
> shuffle
> > files to disk. Currently it is not possible to not write shuffle files to
> > disk, and typically it is not a problem because the network fetch
> throughput
> > is lower than what disks can sustain. In most cases, especially with
> SSDs,
> > there is little difference between putting all of those in memory and on
> > disk.
> >
> > However, it is becoming more common to run Spark on a few number of beefy
> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving
> > performance for those. Meantime, you can setup local ramdisks on each
> node
> > for shuffle writes.
> >
> >
> >
> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> > wrote:
> >>
> >> Hello;
> >>
> >> I’m working on spark with very large memory systems (2TB+) and notice
> that
> >> Spark spills to disk in shuffle.  Is there a way to force spark to stay
> in
> >> memory when doing shuffle operations?   The goal is to keep the shuffle
> data
> >> either in the heap or in off-heap memory (in 1.6.x) and never touch the
> IO
> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
> >>
> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> >> Tungsten sort in 1.5.x
> >>
> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but
> this
> >> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will
> >> continue to spill to disk when necessary.”
> >>
> >> If this is impossible via configuration changes what code changes would
> be
> >> needed to accomplish this?
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
> >
>
-- 
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Mridul Muralidharan <mr...@gmail.com>.

I think Reynold's suggestion of using ram disk would be a good way to
test if these are the bottlenecks or something else is.
For most practical purposes, pointing local dir to ramdisk should
effectively give you 'similar' performance as shuffling from memory.

Are there concerns with taking that approach to test ? (I dont see
any, but I am not sure if I missed something).


Regards,
Mridul




On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com> wrote:
> I totally disagree that it’s not a problem.
>
> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> drives.
> - What Spark is depending on is Linux’s IO cache as an effective buffer pool
> This is fine for small jobs but not for jobs with datasets in the TB/node
> range.
> - On larger jobs flushing the cache causes Linux to block.
> - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> executors writing out to disk increases greatly.
>
> I thought the whole point of Spark was in-memory computing?  It’s in fact
> in-memory for some things but  use spark.local.dir as a buffer pool of
> others.
>
> Hence, the performance of  Spark is gated by the performance of
> spark.local.dir, even on large memory systems.
>
> "Currently it is not possible to not write shuffle files to disk.”
>
> What changes >would< make it possible?
>
> The only one that seems possible is to clone the shuffle service and make it
> in-memory.
>
>
>
>
>
> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> spark.shuffle.spill actually has nothing to do with whether we write shuffle
> files to disk. Currently it is not possible to not write shuffle files to
> disk, and typically it is not a problem because the network fetch throughput
> is lower than what disks can sustain. In most cases, especially with SSDs,
> there is little difference between putting all of those in memory and on
> disk.
>
> However, it is becoming more common to run Spark on a few number of beefy
> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving
> performance for those. Meantime, you can setup local ramdisks on each node
> for shuffle writes.
>
>
>
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> wrote:
>>
>> Hello;
>>
>> I’m working on spark with very large memory systems (2TB+) and notice that
>> Spark spills to disk in shuffle.  Is there a way to force spark to stay in
>> memory when doing shuffle operations?   The goal is to keep the shuffle data
>> either in the heap or in off-heap memory (in 1.6.x) and never touch the IO
>> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>
>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>> Tungsten sort in 1.5.x
>>
>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
>> is ignored by the tungsten-sort shuffle manager; its optimized shuffles will
>> continue to spill to disk when necessary.”
>>
>> If this is impossible via configuration changes what code changes would be
>> needed to accomplish this?
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Mridul Muralidharan <mr...@gmail.com>.

I think Reynold's suggestion of using ram disk would be a good way to
test if these are the bottlenecks or something else is.
For most practical purposes, pointing local dir to ramdisk should
effectively give you 'similar' performance as shuffling from memory.

Are there concerns with taking that approach to test ? (I dont see
any, but I am not sure if I missed something).


Regards,
Mridul




On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <sl...@gmail.com> wrote:
> I totally disagree that it’s not a problem.
>
> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> drives.
> - What Spark is depending on is Linux’s IO cache as an effective buffer pool
> This is fine for small jobs but not for jobs with datasets in the TB/node
> range.
> - On larger jobs flushing the cache causes Linux to block.
> - On a modern 56-hyperthread 2-socket host the latency caused by multiple
> executors writing out to disk increases greatly.
>
> I thought the whole point of Spark was in-memory computing?  It’s in fact
> in-memory for some things but  use spark.local.dir as a buffer pool of
> others.
>
> Hence, the performance of  Spark is gated by the performance of
> spark.local.dir, even on large memory systems.
>
> "Currently it is not possible to not write shuffle files to disk.”
>
> What changes >would< make it possible?
>
> The only one that seems possible is to clone the shuffle service and make it
> in-memory.
>
>
>
>
>
> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> spark.shuffle.spill actually has nothing to do with whether we write shuffle
> files to disk. Currently it is not possible to not write shuffle files to
> disk, and typically it is not a problem because the network fetch throughput
> is lower than what disks can sustain. In most cases, especially with SSDs,
> there is little difference between putting all of those in memory and on
> disk.
>
> However, it is becoming more common to run Spark on a few number of beefy
> nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving
> performance for those. Meantime, you can setup local ramdisks on each node
> for shuffle writes.
>
>
>
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
> wrote:
>>
>> Hello;
>>
>> I’m working on spark with very large memory systems (2TB+) and notice that
>> Spark spills to disk in shuffle.  Is there a way to force spark to stay in
>> memory when doing shuffle operations?   The goal is to keep the shuffle data
>> either in the heap or in off-heap memory (in 1.6.x) and never touch the IO
>> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>
>> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>> Tungsten sort in 1.5.x
>>
>> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
>> is ignored by the tungsten-sort shuffle manager; its optimized shuffles will
>> continue to spill to disk when necessary.”
>>
>> If this is impossible via configuration changes what code changes would be
>> needed to accomplish this?
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

I totally disagree that it’s not a problem.

- Network fetch throughput on 40G Ethernet exceeds the throughput of NVME drives.
- What Spark is depending on is Linux’s IO cache as an effective buffer pool  This is fine for small jobs but not for jobs with datasets in the TB/node range.
- On larger jobs flushing the cache causes Linux to block.
- On a modern 56-hyperthread 2-socket host the latency caused by multiple executors writing out to disk increases greatly. 

I thought the whole point of Spark was in-memory computing?  It’s in fact in-memory for some things but  use spark.local.dir as a buffer pool of others.  

Hence, the performance of  Spark is gated by the performance of spark.local.dir, even on large memory systems.

"Currently it is not possible to not write shuffle files to disk.”

What changes >would< make it possible?

The only one that seems possible is to clone the shuffle service and make it in-memory.





> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with SSDs, there is little difference between putting all of those in memory and on disk.
> 
> However, it is becoming more common to run Spark on a few number of beefy nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving performance for those. Meantime, you can setup local ramdisks on each node for shuffle writes.
> 
> 
> 
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavitch@gmail.com <ma...@gmail.com>> wrote:
> Hello;
> 
> I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle.  Is there a way to force spark to stay in memory when doing shuffle operations?   The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never touch the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
> 
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in Tungsten sort in 1.5.x
> 
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this is ignored by the tungsten-sort shuffle manager; its optimized shuffles will continue to spill to disk when necessary.”
> 
> If this is impossible via configuration changes what code changes would be needed to accomplish this?
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Michael Slavitch <sl...@gmail.com>.

I totally disagree that it’s not a problem.

- Network fetch throughput on 40G Ethernet exceeds the throughput of NVME drives.
- What Spark is depending on is Linux’s IO cache as an effective buffer pool  This is fine for small jobs but not for jobs with datasets in the TB/node range.
- On larger jobs flushing the cache causes Linux to block.
- On a modern 56-hyperthread 2-socket host the latency caused by multiple executors writing out to disk increases greatly. 

I thought the whole point of Spark was in-memory computing?  It’s in fact in-memory for some things but  use spark.local.dir as a buffer pool of others.  

Hence, the performance of  Spark is gated by the performance of spark.local.dir, even on large memory systems.

"Currently it is not possible to not write shuffle files to disk.”

What changes >would< make it possible?

The only one that seems possible is to clone the shuffle service and make it in-memory.





> On Apr 1, 2016, at 4:57 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with SSDs, there is little difference between putting all of those in memory and on disk.
> 
> However, it is becoming more common to run Spark on a few number of beefy nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into improving performance for those. Meantime, you can setup local ramdisks on each node for shuffle writes.
> 
> 
> 
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavitch@gmail.com <ma...@gmail.com>> wrote:
> Hello;
> 
> I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle.  Is there a way to force spark to stay in memory when doing shuffle operations?   The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never touch the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
> 
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in Tungsten sort in 1.5.x
> 
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this is ignored by the tungsten-sort shuffle manager; its optimized shuffles will continue to spill to disk when necessary.”
> 
> If this is impossible via configuration changes what code changes would be needed to accomplish this?
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

spark.shuffle.spill actually has nothing to do with whether we write
shuffle files to disk. Currently it is not possible to not write shuffle
files to disk, and typically it is not a problem because the network fetch
throughput is lower than what disks can sustain. In most cases, especially
with SSDs, there is little difference between putting all of those in
memory and on disk.

However, it is becoming more common to run Spark on a few number of beefy
nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
improving performance for those. Meantime, you can setup local ramdisks on
each node for shuffle writes.

On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
wrote:

> Hello;
>
> I’m working on spark with very large memory systems (2TB+) and notice that
> Spark spills to disk in shuffle.  Is there a way to force spark to stay in
> memory when doing shuffle operations?   The goal is to keep the shuffle
> data either in the heap or in off-heap memory (in 1.6.x) and never touch
> the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
>
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> Tungsten sort in 1.5.x
>
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will continue to spill to disk when necessary.”
>
> If this is impossible via configuration changes what code changes would be
> needed to accomplish this?
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Posted by Reynold Xin <rx...@databricks.com>.

spark.shuffle.spill actually has nothing to do with whether we write
shuffle files to disk. Currently it is not possible to not write shuffle
files to disk, and typically it is not a problem because the network fetch
throughput is lower than what disks can sustain. In most cases, especially
with SSDs, there is little difference between putting all of those in
memory and on disk.

However, it is becoming more common to run Spark on a few number of beefy
nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
improving performance for those. Meantime, you can setup local ramdisks on
each node for shuffle writes.

On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <sl...@gmail.com>
wrote:

> Hello;
>
> I’m working on spark with very large memory systems (2TB+) and notice that
> Spark spills to disk in shuffle.  Is there a way to force spark to stay in
> memory when doing shuffle operations?   The goal is to keep the shuffle
> data either in the heap or in off-heap memory (in 1.6.x) and never touch
> the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
>
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> Tungsten sort in 1.5.x
>
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will continue to spill to disk when necessary.”
>
> If this is impossible via configuration changes what code changes would be
> needed to accomplish this?
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>