You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tsai Li Ming <ma...@ltsai.com> on 2014/11/20 13:12:06 UTC

RDD memory and storage level option

Hi,

This is on version 1.1.0.

I’m did a simple test on MEMORY_AND_DISK storage level.

> var file = sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
> file.count()

The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node:
                                                                                                                              
             ID               Name     Cores Memory per Node   Submitted Time    User  State  Duration                        
   app-20141120193912-0002 Spark shell 64    1024.0 MB       2014/11/20 19:39:12 root RUNNING 6.0 min                         


After doing a simple count, the job web ui indicates the entire file is saved on disk?

               RDD Name                Storage Level         Cached         Fraction      Size in     Size in     Size on     
                                                           Partitions        Cached       Memory      Tachyon       Disk      
   file:///path/to/file.txt Disk Serialized 1x             46               100%           0.0 B       0.0 B        1476.5 MB    
                                     Replicated                                                                               
                                                 
1. Shouldn’t some partitions be saved into memory? 




2. If I run with MEMORY_ONLY option, I can save some partitions into memory but there are still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 73MB.

              RDD Name                  Storage Level          Cached        Fraction      Size in     Size in    Size on    
                                                              Partitions       Cached       Memory      Tachyon      Disk     
   file:///path/to/file.txt Memory Deserialized              3                7%            220.6 MB    0.0 B        0.0 B      
                                     1x Replicated                                                                            
                                              
    Executor    Address      RDD     Memory    Disk   Active   Failed   Complete    Total   Task   Input  Shuffle  Shuffle    
       ID                   Blocks    Used     Used   Tasks    Tasks      Tasks     Tasks   Time            Read    Write     
                                    220.6 MB                                                      1457.4MB                      
   0          foo.co:48660 3        / 530.3   0.0 B  0        0        46          46      14.2 m         0.0 B    0.0 B      
                                    MB        

14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on foo.co:48660 (size: 73.6 MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 29833 ms on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 31502 ms on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 31651 ms on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 31782 ms on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished in 31.818 s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took 31.926585742 s
res0: Long = 10000000

Is this correct?



3. I can’t seem to work out the math to derive 530MB that is made available in the executor? 1024MB * memoryFraction(0.6) = 614.4

Thanks!





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org