You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by swastik mittal <sm...@ncsu.edu> on 2019/03/14 16:53:57 UTC

How does spark operate internally for an indivisual task?

I am running a grep application on spark 2.3.4 and scala version 2.11. I have
an input textfile of 813MB stored on a remote source (not a part of spark
infrastructure) using hdfs. My application just reads the textfile line by
line from hdfs server and filters for a given keyword in each line and
output's like grep in Linux. Hdfs divides the file into 128MB chunks so my
application distributes into 7 tasks and 1 stage (stage 0). I want to
analyze the time spark takes for a task in the compute function of
hadoopRDD. For that I record and log every time a hadoopRDD compute, read,
updaterecords or updatebytesread is called. Also when the filter RDD
(MapPartitionsRDD) compute and the spark build filter function is called.
What I observe is that the MapPartitionsRDD which is the child RDD has its
compute and filter function called first and once the hadoopRDD is called it
never logs compute or filter operation of MapPartitionsRDD. But, before
reading the data spark cannot perform any filter on it, then the computing
has to be called after a read operation. Does this filter operation work
simultaneously on every record read, or once the whole text file chunk is
read? Also How can I separate the information about the two or know when
exactly did the first mapPartionRDD operation was done?
Any help is appreciated.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org