You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by raggy <ra...@gmail.com> on 2015/03/14 05:56:15 UTC

Aggregation of distributed datasets

I am a PhD student trying to understand the internals of Spark, so that I can
make some modifications to it. I am trying to understand how the aggregation
of the distributed datasets(through the network) onto the driver node works.
I would very much appreciate it if someone could point me towards the source
code that is involved with the aggregation over the network. An explanation
on how it works would also be appreciated. 

 So far, I have followed the code to identify that the handleJobSubmitted()
function in DAGScheduler.scala is invoked when trying to schedule a job. And
then since I am trying to run it on a cluster, I reach 

listenerBus.post(SparkListenerJobStart(job.jobId, jobSubmissionTime,
stageInfos, properties)) on line 759 in DAGScheduler.scala. I am not sure
where to go from here. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Aggregation-of-distributed-datasets-tp22048.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org