You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lyx <11...@qq.com> on 2020/04/20 03:02:50 UTC

[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

Hello,

&nbsp; &nbsp;I'm using Spark to deal with my project these days, however i noticed that when load data

stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory size between using DataFrame

and using RDD format.Below lists my shell script&nbsp; when using spark-shell, my original files(testData) are just ordinary text files 

which is about 11GB when stored in hard disk,each line has the format of "Id1,Id2" where both Id1 and Id2 are some random numbers of int32.

/* code segment 

import java.io.DataOutputStream
import java.util
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import scala.collection.mutable.ArrayBuffer

// this text file's size is 11GB in hard disk
var filePath = "hdfs://10.10.23.105:9000/testData"


val fields = Array.range(0, 2).map(i =&gt; StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)

val df: Dataset[Row] = spark.read.format("csv").schema(schema).load(filePath)

// the fisrt dataframe which turn out to be 5.5GB in memory
df.cache()
df.count()

// the second datafame which turn out to be 95GB in memory
df.rdd.cache()
df.rdd.count()

// the third rdd format which turn out to be 88GB in memory
val pureRDD= spark.sparkContext.textFile(filePath)
pureRDD.cache()
pureRDD.count()

//the line below gose wrong when i using collect() even driver has 200GB and executor have 300GB memory allocated
df.collect()

*/




&nbsp; So here I encountered 2 problems:

Q1: I loaded and cached the very identical raw file into 3 types format respectively&nbsp;as showed above&nbsp;:DataFrame,&nbsp;DataFrame.rdd,&nbsp;RDD. Then I founded that DataFrame used just 5.5GB in my JVM , however df.rdd used nearly 95GB and RDD used about 69GB .So I'am wondering why RDD or DataFrame.rdd will take so much memory space even the original files are very small?




Q2: And I also noticed that when i called df.collect(),it will keep blocking without exeption or further information, while using RDD.collect() won't cause this problem and can return the result successfully.

(P.S. my driver is allocated 200GB alone with a 300GB executor in JVM heap, which is sufficient enough for such a collect action.)

&nbsp; &nbsp;

&nbsp; &nbsp;Hoping your attention and help

&nbsp; &nbsp;Best regards with thanks!




&nbsp; 




Department of Engineering Mechanics


Zhejiang University


Hangzhou 310027,&nbsp; P.R. China


Mobile: (+86)15158859317


E-mail: lyx_zane@zju.edu.cn



发自我的iPhone