You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2014/07/14 09:01:16 UTC

better compression codecs for shuffle blocks?

Hi Spark devs,

I was looking into the memory usage of shuffle and one annoying thing is
the default compression codec (LZF) is that the implementation we use
allocates buffers pretty generously. I did a simple experiment and found
that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
we have a shuffle task that uses 10k reducers and 32 threads running
currently, the memory used by the lzf stream alone would be ~ 60GB.

In comparison, Snappy only allocates ~ 65MB for every
1k SnappyOutputStream. However, Snappy's compression is slightly lower than
LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
ratio does matter here because we are sending data across the network.

In future releases we will likely change the shuffle implementation to open
less streams. Until that happens, I'm looking for compression codec
implementations that are fast, allocate small buffers, and have decent
compression ratio.

Does anybody on this list have any suggestions? If not, I will submit a
patch for 1.1 that replaces LZF with Snappy for the default compression
codec to lower memory usage.


allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567

Re: better compression codecs for shuffle blocks?

Posted by Sandy Ryza <sa...@cloudera.com>.

Stephen,
Often the shuffle is bound by writes to disk, so even if disks have enough
space to store the uncompressed data, the shuffle can complete faster by
writing less data.

Reynold,
This isn't a big help in the short term, but if we switch to a sort-based
shuffle, we'll only need a single LZFOutputStream per map task.

On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman <
stephen.haberman@gmail.com> wrote:

>
> Just a comment from the peanut gallery, but these buffers are a real
> PITA for us as well. Probably 75% of our non-user-error job failures
> are related to them.
>
> Just naively, what about not doing compression on the fly? E.g. during
> the shuffle just write straight to disk, uncompressed?
>
> For us, we always have plenty of disk space, and if you're concerned
> about network transmission, you could add a separate compress step
> after the blocks have been written to disk, but before being sent over
> the wire.
>
> Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> see work in this area!
>
> - Stephen
>
>

Re: better compression codecs for shuffle blocks?

Posted by Reynold Xin <rx...@databricks.com>.

FYI dev,

I submitted a PR making Snappy as the default compression codec:
https://github.com/apache/spark/pull/1415

Also submitted a separate PR to add lz4 support:
https://github.com/apache/spark/pull/1416


On Mon, Jul 14, 2014 at 5:06 PM, Aaron Davidson <il...@gmail.com> wrote:

> One of the core problems here is the number of open streams we have, which
> is (# cores * # reduce partitions), which can easily climb into the tens of
> thousands for large jobs. This is a more general problem that we are
> planning on fixing for our largest shuffles, as even moderate buffer sizes
> can explode to use huge amounts of memory at that scale.
>
>
> On Mon, Jul 14, 2014 at 4:53 PM, Jon Hartlaub <jh...@gmail.com> wrote:
>
> > Is the held memory due to just instantiating the LZFOutputStream?  If so,
> > I'm a surprised and I consider that a bug.
> >
> > I suspect the held memory may be due to a SoftReference - memory will be
> > released with enough memory pressure.
> >
> > Finally, is it necessary to keep 1000 (or more) decoders active?  Would
> it
> > be possible to keep an object pool of encoders and check them in and out
> as
> > needed?  I admit I have not done much homework to determine if this is
> > viable.
> >
> > -Jon
> >
> >
> > On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >
> > > Copying Jon here since he worked on the lzf library at Ning.
> > >
> > > Jon - any comments on this topic?
> > >
> > >
> > > On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia <
> matei.zaharia@gmail.com>
> > > wrote:
> > >
> > >> You can actually turn off shuffle compression by setting
> > >> spark.shuffle.compress to false. Try that out, there will still be
> some
> > >> buffers for the various OutputStreams, but they should be smaller.
> > >>
> > >> Matei
> > >>
> > >> On Jul 14, 2014, at 3:30 PM, Stephen Haberman <
> > stephen.haberman@gmail.com>
> > >> wrote:
> > >>
> > >> >
> > >> > Just a comment from the peanut gallery, but these buffers are a real
> > >> > PITA for us as well. Probably 75% of our non-user-error job failures
> > >> > are related to them.
> > >> >
> > >> > Just naively, what about not doing compression on the fly? E.g.
> during
> > >> > the shuffle just write straight to disk, uncompressed?
> > >> >
> > >> > For us, we always have plenty of disk space, and if you're concerned
> > >> > about network transmission, you could add a separate compress step
> > >> > after the blocks have been written to disk, but before being sent
> over
> > >> > the wire.
> > >> >
> > >> > Granted, IANAE, so perhaps this is a bad idea; either way, awesome
> to
> > >> > see work in this area!
> > >> >
> > >> > - Stephen
> > >> >
> > >>
> > >>
> > >
> >
>

Re: better compression codecs for shuffle blocks?

Posted by Aaron Davidson <il...@gmail.com>.

One of the core problems here is the number of open streams we have, which
is (# cores * # reduce partitions), which can easily climb into the tens of
thousands for large jobs. This is a more general problem that we are
planning on fixing for our largest shuffles, as even moderate buffer sizes
can explode to use huge amounts of memory at that scale.


On Mon, Jul 14, 2014 at 4:53 PM, Jon Hartlaub <jh...@gmail.com> wrote:

> Is the held memory due to just instantiating the LZFOutputStream?  If so,
> I'm a surprised and I consider that a bug.
>
> I suspect the held memory may be due to a SoftReference - memory will be
> released with enough memory pressure.
>
> Finally, is it necessary to keep 1000 (or more) decoders active?  Would it
> be possible to keep an object pool of encoders and check them in and out as
> needed?  I admit I have not done much homework to determine if this is
> viable.
>
> -Jon
>
>
> On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > Copying Jon here since he worked on the lzf library at Ning.
> >
> > Jon - any comments on this topic?
> >
> >
> > On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia <ma...@gmail.com>
> > wrote:
> >
> >> You can actually turn off shuffle compression by setting
> >> spark.shuffle.compress to false. Try that out, there will still be some
> >> buffers for the various OutputStreams, but they should be smaller.
> >>
> >> Matei
> >>
> >> On Jul 14, 2014, at 3:30 PM, Stephen Haberman <
> stephen.haberman@gmail.com>
> >> wrote:
> >>
> >> >
> >> > Just a comment from the peanut gallery, but these buffers are a real
> >> > PITA for us as well. Probably 75% of our non-user-error job failures
> >> > are related to them.
> >> >
> >> > Just naively, what about not doing compression on the fly? E.g. during
> >> > the shuffle just write straight to disk, uncompressed?
> >> >
> >> > For us, we always have plenty of disk space, and if you're concerned
> >> > about network transmission, you could add a separate compress step
> >> > after the blocks have been written to disk, but before being sent over
> >> > the wire.
> >> >
> >> > Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> >> > see work in this area!
> >> >
> >> > - Stephen
> >> >
> >>
> >>
> >
>

Re: better compression codecs for shuffle blocks?

Posted by Jon Hartlaub <jh...@gmail.com>.

Is the held memory due to just instantiating the LZFOutputStream?  If so,
I'm a surprised and I consider that a bug.

I suspect the held memory may be due to a SoftReference - memory will be
released with enough memory pressure.

Finally, is it necessary to keep 1000 (or more) decoders active?  Would it
be possible to keep an object pool of encoders and check them in and out as
needed?  I admit I have not done much homework to determine if this is
viable.

-Jon

On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin <rx...@databricks.com> wrote:

> Copying Jon here since he worked on the lzf library at Ning.
>
> Jon - any comments on this topic?
>
>
> On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> You can actually turn off shuffle compression by setting
>> spark.shuffle.compress to false. Try that out, there will still be some
>> buffers for the various OutputStreams, but they should be smaller.
>>
>> Matei
>>
>> On Jul 14, 2014, at 3:30 PM, Stephen Haberman <st...@gmail.com>
>> wrote:
>>
>> >
>> > Just a comment from the peanut gallery, but these buffers are a real
>> > PITA for us as well. Probably 75% of our non-user-error job failures
>> > are related to them.
>> >
>> > Just naively, what about not doing compression on the fly? E.g. during
>> > the shuffle just write straight to disk, uncompressed?
>> >
>> > For us, we always have plenty of disk space, and if you're concerned
>> > about network transmission, you could add a separate compress step
>> > after the blocks have been written to disk, but before being sent over
>> > the wire.
>> >
>> > Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
>> > see work in this area!
>> >
>> > - Stephen
>> >
>>
>>
>

Re: better compression codecs for shuffle blocks?

Posted by Reynold Xin <rx...@databricks.com>.

Copying Jon here since he worked on the lzf library at Ning.

Jon - any comments on this topic?


On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> You can actually turn off shuffle compression by setting
> spark.shuffle.compress to false. Try that out, there will still be some
> buffers for the various OutputStreams, but they should be smaller.
>
> Matei
>
> On Jul 14, 2014, at 3:30 PM, Stephen Haberman <st...@gmail.com>
> wrote:
>
> >
> > Just a comment from the peanut gallery, but these buffers are a real
> > PITA for us as well. Probably 75% of our non-user-error job failures
> > are related to them.
> >
> > Just naively, what about not doing compression on the fly? E.g. during
> > the shuffle just write straight to disk, uncompressed?
> >
> > For us, we always have plenty of disk space, and if you're concerned
> > about network transmission, you could add a separate compress step
> > after the blocks have been written to disk, but before being sent over
> > the wire.
> >
> > Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> > see work in this area!
> >
> > - Stephen
> >
>
>

Re: better compression codecs for shuffle blocks?

Posted by Matei Zaharia <ma...@gmail.com>.

You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the various OutputStreams, but they should be smaller.

Matei

On Jul 14, 2014, at 3:30 PM, Stephen Haberman <st...@gmail.com> wrote:

> 
> Just a comment from the peanut gallery, but these buffers are a real
> PITA for us as well. Probably 75% of our non-user-error job failures
> are related to them.
> 
> Just naively, what about not doing compression on the fly? E.g. during
> the shuffle just write straight to disk, uncompressed?
> 
> For us, we always have plenty of disk space, and if you're concerned
> about network transmission, you could add a separate compress step
> after the blocks have been written to disk, but before being sent over
> the wire.
> 
> Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> see work in this area!
> 
> - Stephen
>

Re: better compression codecs for shuffle blocks?

Posted by Stephen Haberman <st...@gmail.com>.

Just a comment from the peanut gallery, but these buffers are a real
PITA for us as well. Probably 75% of our non-user-error job failures
are related to them.

Just naively, what about not doing compression on the fly? E.g. during
the shuffle just write straight to disk, uncompressed?

For us, we always have plenty of disk space, and if you're concerned
about network transmission, you could add a separate compress step
after the blocks have been written to disk, but before being sent over
the wire.

Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
see work in this area!

- Stephen

Re: better compression codecs for shuffle blocks?

Posted by Mridul Muralidharan <mr...@gmail.com>.

We tried with lower block size for lzf, but it barfed all over the place.
Snappy was the way to go for our jobs.


Regards,
Mridul


On Mon, Jul 14, 2014 at 12:31 PM, Reynold Xin <rx...@databricks.com> wrote:
> Hi Spark devs,
>
> I was looking into the memory usage of shuffle and one annoying thing is
> the default compression codec (LZF) is that the implementation we use
> allocates buffers pretty generously. I did a simple experiment and found
> that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
> we have a shuffle task that uses 10k reducers and 32 threads running
> currently, the memory used by the lzf stream alone would be ~ 60GB.
>
> In comparison, Snappy only allocates ~ 65MB for every
> 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
> LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
> ratio does matter here because we are sending data across the network.
>
> In future releases we will likely change the shuffle implementation to open
> less streams. Until that happens, I'm looking for compression codec
> implementations that are fast, allocate small buffers, and have decent
> compression ratio.
>
> Does anybody on this list have any suggestions? If not, I will submit a
> patch for 1.1 that replaces LZF with Snappy for the default compression
> codec to lower memory usage.
>
>
> allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567

Re: better compression codecs for shuffle blocks?

Posted by Davies Liu <da...@databricks.com>.

Maybe we could try LZ4 [1], which has better performance and smaller footprint
than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x
higher than LZF[2],
but memory used is 10x smaller than LZF (16k vs 190k).

[1] https://github.com/jpountz/lz4-java
[2] http://ning.github.io/jvm-compressor-benchmark/results/calgary/roundtrip-2013-06-06/index.html


On Mon, Jul 14, 2014 at 12:01 AM, Reynold Xin <rx...@databricks.com> wrote:
>
> Hi Spark devs,
>
> I was looking into the memory usage of shuffle and one annoying thing is
> the default compression codec (LZF) is that the implementation we use
> allocates buffers pretty generously. I did a simple experiment and found
> that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
> we have a shuffle task that uses 10k reducers and 32 threads running
> currently, the memory used by the lzf stream alone would be ~ 60GB.
>
> In comparison, Snappy only allocates ~ 65MB for every
> 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
> LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
> ratio does matter here because we are sending data across the network.
>
> In future releases we will likely change the shuffle implementation to open
> less streams. Until that happens, I'm looking for compression codec
> implementations that are fast, allocate small buffers, and have decent
> compression ratio.
>
> Does anybody on this list have any suggestions? If not, I will submit a
> patch for 1.1 that replaces LZF with Snappy for the default compression
> codec to lower memory usage.
>
>
> allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567