You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michal Čizmazia <mi...@gmail.com> on 2015/07/09 16:09:38 UTC

change default storage level

Is there a way how to change the default storage level?

If not, how can I properly change the storage level wherever necessary, if
my input and intermediate results do not fit into memory?

In this example:

context.wholeTextFiles(...)
    .flatMap(s -> ...)
    .flatMap(s -> ...)

Does persist() need to be called after every transformation?

 context.wholeTextFiles(...)
    .persist(StorageLevel.MEMORY_AND_DISK)
    .flatMap(s -> ...)
    .persist(StorageLevel.MEMORY_AND_DISK)
    .flatMap(s -> ...)
    .persist(StorageLevel.MEMORY_AND_DISK)

 Thanks!

Re: change default storage level

Posted by Michal Čizmazia <mi...@gmail.com>.
Thanks Shixiong! Your response helped me to understand the role of
persist(). No persist() calls were required indeed. I solved my problem by
setting spark.local.dir to allow more space for Spark temporary folder. It
works automatically. I am seeing logs like this:

    Not enough space to cache rdd_0_1 in memory!
    Persisting partition rdd_0_1 to disk instead.

Before I was getting:

    No space left on device


On 9 July 2015 at 11:57, Shixiong Zhu <zs...@gmail.com> wrote:

> Spark won't store RDDs to memory unless you use a memory StorageLevel. By
> default, your input and intermediate results won't be put into memory. You
> can call persist if you want to avoid duplicate computation or reading.
> E.g.,
>
> val r1 = context.wholeTextFiles(...)
> val r2 = r1.flatMap(s -> ...)
> val r3 = r2.filter(...)...
> r3.saveAsTextFile(...)
> val r4 = r2.map(...)...
> r4.saveAsTextFile(...)
>
> In the avoid example, r2 will be used twice. To speed up the computation,
> you can call r2.persist(StorageLevel.MEMORY) to store r2 into memory. Then
> r4 will use the data of r2 in memory directly. E.g.,
>
> val r1 = context.wholeTextFiles(...)
> val r2 = r1.flatMap(s -> ...)
> r2.persist(StorageLevel.MEMORY)
> val r3 = r2.filter(...)...
> r3.saveAsTextFile(...)
> val r4 = r2.map(...)...
> r4.saveAsTextFile(...)
>
> See
> http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-07-09 22:09 GMT+08:00 Michal Čizmazia <mi...@gmail.com>:
>
>> Is there a way how to change the default storage level?
>>
>> If not, how can I properly change the storage level wherever necessary,
>> if my input and intermediate results do not fit into memory?
>>
>> In this example:
>>
>> context.wholeTextFiles(...)
>>     .flatMap(s -> ...)
>>     .flatMap(s -> ...)
>>
>> Does persist() need to be called after every transformation?
>>
>>  context.wholeTextFiles(...)
>>     .persist(StorageLevel.MEMORY_AND_DISK)
>>     .flatMap(s -> ...)
>>     .persist(StorageLevel.MEMORY_AND_DISK)
>>     .flatMap(s -> ...)
>>     .persist(StorageLevel.MEMORY_AND_DISK)
>>
>>  Thanks!
>>
>>
>

Re: change default storage level

Posted by Shixiong Zhu <zs...@gmail.com>.
Spark won't store RDDs to memory unless you use a memory StorageLevel. By
default, your input and intermediate results won't be put into memory. You
can call persist if you want to avoid duplicate computation or reading.
E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s -> ...)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

In the avoid example, r2 will be used twice. To speed up the computation,
you can call r2.persist(StorageLevel.MEMORY) to store r2 into memory. Then
r4 will use the data of r2 in memory directly. E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s -> ...)
r2.persist(StorageLevel.MEMORY)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

See
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence


Best Regards,
Shixiong Zhu

2015-07-09 22:09 GMT+08:00 Michal Čizmazia <mi...@gmail.com>:

> Is there a way how to change the default storage level?
>
> If not, how can I properly change the storage level wherever necessary, if
> my input and intermediate results do not fit into memory?
>
> In this example:
>
> context.wholeTextFiles(...)
>     .flatMap(s -> ...)
>     .flatMap(s -> ...)
>
> Does persist() need to be called after every transformation?
>
>  context.wholeTextFiles(...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>     .flatMap(s -> ...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>     .flatMap(s -> ...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>
>  Thanks!
>
>