You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2017/06/25 07:13:15 UTC

RDD and DataFrame persistent memory usage

 Gurus,
I understand when we create RDD in Spark it is immutable.
So I have few points please:
   
   - When RDD is created that is just a pointer. Not most Spark operations it is lazy not consumed until a collection operation done that affects RDD?
   - When a DF is created from RDD does that result in additional memory to DF. Again with collection operation that affects both RDD and DF built from that RDD?
   - There is some references that as you build operations and creating new DFs, one is consuming more and more memory without releasing it back?
   - What will happen if I do df.unpersist. I know that it shifts DF from memory (cache) to disk. Will that reduce memory overhead?
   - Is it a good idea to unpersist to reduce memory overhead?


Thanking you