You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tsai Li Ming <ma...@ltsai.com> on 2014/11/20 13:12:06 UTC
RDD memory and storage level option
Hi,
This is on version 1.1.0.
I’m did a simple test on MEMORY_AND_DISK storage level.
> var file = sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
> file.count()
The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node:
ID Name Cores Memory per Node Submitted Time User State Duration
app-20141120193912-0002 Spark shell 64 1024.0 MB 2014/11/20 19:39:12 root RUNNING 6.0 min
After doing a simple count, the job web ui indicates the entire file is saved on disk?
RDD Name Storage Level Cached Fraction Size in Size in Size on
Partitions Cached Memory Tachyon Disk
file:///path/to/file.txt Disk Serialized 1x 46 100% 0.0 B 0.0 B 1476.5 MB
Replicated
1. Shouldn’t some partitions be saved into memory?
2. If I run with MEMORY_ONLY option, I can save some partitions into memory but there are still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 73MB.
RDD Name Storage Level Cached Fraction Size in Size in Size on
Partitions Cached Memory Tachyon Disk
file:///path/to/file.txt Memory Deserialized 3 7% 220.6 MB 0.0 B 0.0 B
1x Replicated
Executor Address RDD Memory Disk Active Failed Complete Total Task Input Shuffle Shuffle
ID Blocks Used Used Tasks Tasks Tasks Tasks Time Read Write
220.6 MB 1457.4MB
0 foo.co:48660 3 / 530.3 0.0 B 0 0 46 46 14.2 m 0.0 B 0.0 B
MB
14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on foo.co:48660 (size: 73.6 MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 29833 ms on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 31502 ms on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 31651 ms on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 31782 ms on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished in 31.818 s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took 31.926585742 s
res0: Long = 10000000
Is this correct?
3. I can’t seem to work out the math to derive 530MB that is made available in the executor? 1024MB * memoryFraction(0.6) = 614.4
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org