You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Surendranauth Hiraman <su...@velos.io> on 2014/04/03 14:27:36 UTC

Spark Disk Usage

Hi,

I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.

I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.

Basically, one part of the question is when is disk used internally?

And is calling persist() on the RDD returned by such transformations what
let's it know to use disk in those situations? Trying to understand if
persist() is applied during the transformation or after it.

Thank you.


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Spark Disk Usage

Posted by Surendranauth Hiraman <su...@velos.io>.
Andrew,

Thanks a lot for the pointer to the code! This has answered my question.

Looks like it tries to write it to memory first and then if it doesn't fit,
it spills to disk. I'll have to dig in more to figure out the details.

-Suren



On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash <an...@andrewash.com> wrote:

> The groupByKey would be aware of the subsequent persist -- that's part of
> the reason why operations are lazy.  As far as whether it's materialized in
> memory first and then flushed to disk vs streamed to disk I'm not sure the
> exact behavior.
>
> What I'd expect to happen would be that the RDD is materialized in memory
> up until it fills up the BlockManager.  At that point it starts spilling
> blocks out to disk in order to keep from OOM'ing.  I'm not sure if new
> blocks go straight to disk or if the BlockManager pages already-existing
> blocks out in order to make room for new blocks.
>
> You can always read through source to figure it out though!
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588
>
>
>
>
> On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Yes, MEMORY_AND_DISK.
>>
>> We do a groupByKey and then call persist on the resulting RDD. So I'm
>> wondering if groupByKey is aware of the subsequent persist setting to use
>> disk or just creates the Seq[V] in memory and only uses disk after that
>> data structure is fully realized in memory.
>>
>> -Suren
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> Which persistence level are you talking about? MEMORY_AND_DISK ?
>>>
>>> Sent from my mobile phone
>>> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <su...@velos.io>
>>> wrote:
>>>
>>>> Thanks, Andrew. That helps.
>>>>
>>>> For 1, it sounds like the data for the RDD is held in memory and then
>>>> only written to disk after the entire RDD has been realized in memory. Is
>>>> that correct?
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <an...@andrewash.com>wrote:
>>>>
>>>>> For 1, persist can be used to save an RDD to disk using the various
>>>>> persistence levels.  When a persistency level is set on an RDD, when that
>>>>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>>>>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>>>>> use the cached value.
>>>>>
>>>>>
>>>>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>>>>
>>>>> 2. The other places disk is used most commonly is shuffles.  If you
>>>>> have data across the cluster that comes from a source, then you might not
>>>>> have to hold it all in memory at once.  But if you do a shuffle, which
>>>>> scatters the data across the cluster in a certain way, then you have to
>>>>> have the memory/disk available for that RDD all at once.  In that case,
>>>>> shuffles will sometimes need to spill over to disk for large RDDs, which
>>>>> can be controlled with the spark.shuffle.spill setting.
>>>>>
>>>>> Does that help clarify?
>>>>>
>>>>>
>>>>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>>>>> suren.hiraman@velos.io> wrote:
>>>>>
>>>>>> It might help if I clarify my questions. :-)
>>>>>>
>>>>>> 1. Is persist() applied during the transformation right before the
>>>>>> persist() call in the graph? Or is is applied after the transform's
>>>>>> processing is complete? In the case of things like GroupBy, is the Seq
>>>>>> backed by disk as it is being created? We're trying to get a sense of how
>>>>>> the processing is handled behind the scenes with respect to disk.
>>>>>>
>>>>>> 2. When else is disk used internally?
>>>>>>
>>>>>> Any pointers are appreciated.
>>>>>>
>>>>>> -Suren
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>>>>> suren.hiraman@velos.io> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Any thoughts on this? Thanks.
>>>>>>>
>>>>>>> -Suren
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>>>>> suren.hiraman@velos.io> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I know if we call persist with the right options, we can have Spark
>>>>>>>> persist an RDD's data on disk.
>>>>>>>>
>>>>>>>> I am wondering what happens in intermediate operations that could
>>>>>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>>>>>
>>>>>>>> Basically, one part of the question is when is disk used internally?
>>>>>>>>
>>>>>>>> And is calling persist() on the RDD returned by such
>>>>>>>> transformations what let's it know to use disk in those situations? Trying
>>>>>>>> to understand if persist() is applied during the transformation or after it.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>>>> Velos
>>>>>>>> Accelerating Machine Learning
>>>>>>>>
>>>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>>>> NEW YORK, NY 10001
>>>>>>>> O: (917) 525-2466 ext. 105
>>>>>>>> F: 646.349.4063
>>>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>>>> W: www.velos.io
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>>> Velos
>>>>>>> Accelerating Machine Learning
>>>>>>>
>>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>>> NEW YORK, NY 10001
>>>>>>> O: (917) 525-2466 ext. 105
>>>>>>> F: 646.349.4063
>>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>>> W: www.velos.io
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>> Velos
>>>>>> Accelerating Machine Learning
>>>>>>
>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>> NEW YORK, NY 10001
>>>>>> O: (917) 525-2466 ext. 105
>>>>>> F: 646.349.4063
>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>> W: www.velos.io
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Spark Disk Usage

Posted by Andrew Ash <an...@andrewash.com>.
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy.  As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.

What I'd expect to happen would be that the RDD is materialized in memory
up until it fills up the BlockManager.  At that point it starts spilling
blocks out to disk in order to keep from OOM'ing.  I'm not sure if new
blocks go straight to disk or if the BlockManager pages already-existing
blocks out in order to make room for new blocks.

You can always read through source to figure it out though!

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588




On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Yes, MEMORY_AND_DISK.
>
> We do a groupByKey and then call persist on the resulting RDD. So I'm
> wondering if groupByKey is aware of the subsequent persist setting to use
> disk or just creates the Seq[V] in memory and only uses disk after that
> data structure is fully realized in memory.
>
> -Suren
>
>
>
> On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Which persistence level are you talking about? MEMORY_AND_DISK ?
>>
>> Sent from my mobile phone
>> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <su...@velos.io>
>> wrote:
>>
>>> Thanks, Andrew. That helps.
>>>
>>> For 1, it sounds like the data for the RDD is held in memory and then
>>> only written to disk after the entire RDD has been realized in memory. Is
>>> that correct?
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <an...@andrewash.com> wrote:
>>>
>>>> For 1, persist can be used to save an RDD to disk using the various
>>>> persistence levels.  When a persistency level is set on an RDD, when that
>>>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>>>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>>>> use the cached value.
>>>>
>>>>
>>>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>>>
>>>> 2. The other places disk is used most commonly is shuffles.  If you
>>>> have data across the cluster that comes from a source, then you might not
>>>> have to hold it all in memory at once.  But if you do a shuffle, which
>>>> scatters the data across the cluster in a certain way, then you have to
>>>> have the memory/disk available for that RDD all at once.  In that case,
>>>> shuffles will sometimes need to spill over to disk for large RDDs, which
>>>> can be controlled with the spark.shuffle.spill setting.
>>>>
>>>> Does that help clarify?
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>>>> suren.hiraman@velos.io> wrote:
>>>>
>>>>> It might help if I clarify my questions. :-)
>>>>>
>>>>> 1. Is persist() applied during the transformation right before the
>>>>> persist() call in the graph? Or is is applied after the transform's
>>>>> processing is complete? In the case of things like GroupBy, is the Seq
>>>>> backed by disk as it is being created? We're trying to get a sense of how
>>>>> the processing is handled behind the scenes with respect to disk.
>>>>>
>>>>> 2. When else is disk used internally?
>>>>>
>>>>> Any pointers are appreciated.
>>>>>
>>>>> -Suren
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>>>> suren.hiraman@velos.io> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Any thoughts on this? Thanks.
>>>>>>
>>>>>> -Suren
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>>>> suren.hiraman@velos.io> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I know if we call persist with the right options, we can have Spark
>>>>>>> persist an RDD's data on disk.
>>>>>>>
>>>>>>> I am wondering what happens in intermediate operations that could
>>>>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>>>>
>>>>>>> Basically, one part of the question is when is disk used internally?
>>>>>>>
>>>>>>> And is calling persist() on the RDD returned by such transformations
>>>>>>> what let's it know to use disk in those situations? Trying to understand if
>>>>>>> persist() is applied during the transformation or after it.
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>>> Velos
>>>>>>> Accelerating Machine Learning
>>>>>>>
>>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>>> NEW YORK, NY 10001
>>>>>>> O: (917) 525-2466 ext. 105
>>>>>>> F: 646.349.4063
>>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>>> W: www.velos.io
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>> Velos
>>>>>> Accelerating Machine Learning
>>>>>>
>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>> NEW YORK, NY 10001
>>>>>> O: (917) 525-2466 ext. 105
>>>>>> F: 646.349.4063
>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>> W: www.velos.io
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>

Re: Spark Disk Usage

Posted by Surendranauth Hiraman <su...@velos.io>.
Yes, MEMORY_AND_DISK.

We do a groupByKey and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.

-Suren



On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash <an...@andrewash.com> wrote:

> Which persistence level are you talking about? MEMORY_AND_DISK ?
>
> Sent from my mobile phone
> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <su...@velos.io>
> wrote:
>
>> Thanks, Andrew. That helps.
>>
>> For 1, it sounds like the data for the RDD is held in memory and then
>> only written to disk after the entire RDD has been realized in memory. Is
>> that correct?
>>
>> -Suren
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <an...@andrewash.com> wrote:
>>
>>> For 1, persist can be used to save an RDD to disk using the various
>>> persistence levels.  When a persistency level is set on an RDD, when that
>>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>>> use the cached value.
>>>
>>>
>>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>>
>>> 2. The other places disk is used most commonly is shuffles.  If you have
>>> data across the cluster that comes from a source, then you might not have
>>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>>> the data across the cluster in a certain way, then you have to have the
>>> memory/disk available for that RDD all at once.  In that case, shuffles
>>> will sometimes need to spill over to disk for large RDDs, which can be
>>> controlled with the spark.shuffle.spill setting.
>>>
>>> Does that help clarify?
>>>
>>>
>>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>>> suren.hiraman@velos.io> wrote:
>>>
>>>> It might help if I clarify my questions. :-)
>>>>
>>>> 1. Is persist() applied during the transformation right before the
>>>> persist() call in the graph? Or is is applied after the transform's
>>>> processing is complete? In the case of things like GroupBy, is the Seq
>>>> backed by disk as it is being created? We're trying to get a sense of how
>>>> the processing is handled behind the scenes with respect to disk.
>>>>
>>>> 2. When else is disk used internally?
>>>>
>>>> Any pointers are appreciated.
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>>> suren.hiraman@velos.io> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Any thoughts on this? Thanks.
>>>>>
>>>>> -Suren
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>>> suren.hiraman@velos.io> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I know if we call persist with the right options, we can have Spark
>>>>>> persist an RDD's data on disk.
>>>>>>
>>>>>> I am wondering what happens in intermediate operations that could
>>>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>>>
>>>>>> Basically, one part of the question is when is disk used internally?
>>>>>>
>>>>>> And is calling persist() on the RDD returned by such transformations
>>>>>> what let's it know to use disk in those situations? Trying to understand if
>>>>>> persist() is applied during the transformation or after it.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>> Velos
>>>>>> Accelerating Machine Learning
>>>>>>
>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>> NEW YORK, NY 10001
>>>>>> O: (917) 525-2466 ext. 105
>>>>>> F: 646.349.4063
>>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>>> W: www.velos.io
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Spark Disk Usage

Posted by Andrew Ash <an...@andrewash.com>.
Which persistence level are you talking about? MEMORY_AND_DISK ?

Sent from my mobile phone
On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <su...@velos.io>
wrote:

> Thanks, Andrew. That helps.
>
> For 1, it sounds like the data for the RDD is held in memory and then only
> written to disk after the entire RDD has been realized in memory. Is that
> correct?
>
> -Suren
>
>
>
> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> For 1, persist can be used to save an RDD to disk using the various
>> persistence levels.  When a persistency level is set on an RDD, when that
>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>> use the cached value.
>>
>>
>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>
>> 2. The other places disk is used most commonly is shuffles.  If you have
>> data across the cluster that comes from a source, then you might not have
>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>> the data across the cluster in a certain way, then you have to have the
>> memory/disk available for that RDD all at once.  In that case, shuffles
>> will sometimes need to spill over to disk for large RDDs, which can be
>> controlled with the spark.shuffle.spill setting.
>>
>> Does that help clarify?
>>
>>
>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>> suren.hiraman@velos.io> wrote:
>>
>>> It might help if I clarify my questions. :-)
>>>
>>> 1. Is persist() applied during the transformation right before the
>>> persist() call in the graph? Or is is applied after the transform's
>>> processing is complete? In the case of things like GroupBy, is the Seq
>>> backed by disk as it is being created? We're trying to get a sense of how
>>> the processing is handled behind the scenes with respect to disk.
>>>
>>> 2. When else is disk used internally?
>>>
>>> Any pointers are appreciated.
>>>
>>> -Suren
>>>
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>> suren.hiraman@velos.io> wrote:
>>>
>>>> Hi,
>>>>
>>>> Any thoughts on this? Thanks.
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>> suren.hiraman@velos.io> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I know if we call persist with the right options, we can have Spark
>>>>> persist an RDD's data on disk.
>>>>>
>>>>> I am wondering what happens in intermediate operations that could
>>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>>
>>>>> Basically, one part of the question is when is disk used internally?
>>>>>
>>>>> And is calling persist() on the RDD returned by such transformations
>>>>> what let's it know to use disk in those situations? Trying to understand if
>>>>> persist() is applied during the transformation or after it.
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>

Re: Spark Disk Usage

Posted by Surendranauth Hiraman <su...@velos.io>.
Thanks, Andrew. That helps.

For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after the entire RDD has been realized in memory. Is that
correct?

-Suren



On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <an...@andrewash.com> wrote:

> For 1, persist can be used to save an RDD to disk using the various
> persistence levels.  When a persistency level is set on an RDD, when that
> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
> use the cached value.
>
>
> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>
> 2. The other places disk is used most commonly is shuffles.  If you have
> data across the cluster that comes from a source, then you might not have
> to hold it all in memory at once.  But if you do a shuffle, which scatters
> the data across the cluster in a certain way, then you have to have the
> memory/disk available for that RDD all at once.  In that case, shuffles
> will sometimes need to spill over to disk for large RDDs, which can be
> controlled with the spark.shuffle.spill setting.
>
> Does that help clarify?
>
>
> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> It might help if I clarify my questions. :-)
>>
>> 1. Is persist() applied during the transformation right before the
>> persist() call in the graph? Or is is applied after the transform's
>> processing is complete? In the case of things like GroupBy, is the Seq
>> backed by disk as it is being created? We're trying to get a sense of how
>> the processing is handled behind the scenes with respect to disk.
>>
>> 2. When else is disk used internally?
>>
>> Any pointers are appreciated.
>>
>> -Suren
>>
>>
>>
>>
>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>> suren.hiraman@velos.io> wrote:
>>
>>> Hi,
>>>
>>> Any thoughts on this? Thanks.
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>> suren.hiraman@velos.io> wrote:
>>>
>>>> Hi,
>>>>
>>>> I know if we call persist with the right options, we can have Spark
>>>> persist an RDD's data on disk.
>>>>
>>>> I am wondering what happens in intermediate operations that could
>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>
>>>> Basically, one part of the question is when is disk used internally?
>>>>
>>>> And is calling persist() on the RDD returned by such transformations
>>>> what let's it know to use disk in those situations? Trying to understand if
>>>> persist() is applied during the transformation or after it.
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Spark Disk Usage

Posted by Andrew Ash <an...@andrewash.com>.
For 1, persist can be used to save an RDD to disk using the various
persistence levels.  When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
use the cached value.

https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence

2. The other places disk is used most commonly is shuffles.  If you have
data across the cluster that comes from a source, then you might not have
to hold it all in memory at once.  But if you do a shuffle, which scatters
the data across the cluster in a certain way, then you have to have the
memory/disk available for that RDD all at once.  In that case, shuffles
will sometimes need to spill over to disk for large RDDs, which can be
controlled with the spark.shuffle.spill setting.

Does that help clarify?


On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> It might help if I clarify my questions. :-)
>
> 1. Is persist() applied during the transformation right before the
> persist() call in the graph? Or is is applied after the transform's
> processing is complete? In the case of things like GroupBy, is the Seq
> backed by disk as it is being created? We're trying to get a sense of how
> the processing is handled behind the scenes with respect to disk.
>
> 2. When else is disk used internally?
>
> Any pointers are appreciated.
>
> -Suren
>
>
>
>
> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Hi,
>>
>> Any thoughts on this? Thanks.
>>
>> -Suren
>>
>>
>>
>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>> suren.hiraman@velos.io> wrote:
>>
>>> Hi,
>>>
>>> I know if we call persist with the right options, we can have Spark
>>> persist an RDD's data on disk.
>>>
>>> I am wondering what happens in intermediate operations that could
>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>
>>> Basically, one part of the question is when is disk used internally?
>>>
>>> And is calling persist() on the RDD returned by such transformations
>>> what let's it know to use disk in those situations? Trying to understand if
>>> persist() is applied during the transformation or after it.
>>>
>>> Thank you.
>>>
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>

Re: Spark Disk Usage

Posted by Surendranauth Hiraman <su...@velos.io>.
It might help if I clarify my questions. :-)

1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're trying to get a sense of how
the processing is handled behind the scenes with respect to disk.

2. When else is disk used internally?

Any pointers are appreciated.

-Suren




On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Hi,
>
> Any thoughts on this? Thanks.
>
> -Suren
>
>
>
> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Hi,
>>
>> I know if we call persist with the right options, we can have Spark
>> persist an RDD's data on disk.
>>
>> I am wondering what happens in intermediate operations that could
>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>
>> Basically, one part of the question is when is disk used internally?
>>
>> And is calling persist() on the RDD returned by such transformations what
>> let's it know to use disk in those situations? Trying to understand if
>> persist() is applied during the transformation or after it.
>>
>> Thank you.
>>
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Spark Disk Usage

Posted by Surendranauth Hiraman <su...@velos.io>.
Hi,

Any thoughts on this? Thanks.

-Suren



On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operations that could
> conceivably create large collections/Sequences, like GroupBy and shuffling.
>
> Basically, one part of the question is when is disk used internally?
>
> And is calling persist() on the RDD returned by such transformations what
> let's it know to use disk in those situations? Trying to understand if
> persist() is applied during the transformation or after it.
>
> Thank you.
>
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io