You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Or Raz <ra...@post.bgu.ac.il> on 2016/10/23 11:00:01 UTC

Dataflow of Spark/Hadoop in steps

 I would like to know if I have 100 GB data and I would like to find the most
common world ,actually what is going on in my cluster(lets say a master node
and 6 workers) step by step.(1)
what does the master(2)? start the mapreduce job, monitor the traffic and
return the result? the same goes for what the mappers and reducers (can they
be a different node/worker?) do(3)?.
The reducers always wait for all the mappers to finish before they start?(4)
and who combine/attach the final output?

For example:

These is the input for the reducers ,these tuples(I go for an easy example
that each word is unique to every reducer, which means the shuffle step has
been done correctly)

reducer 1: {dog,1} ,{banana,1},{oreo,6} 
reducer 2: {peach,1} ,{mesut,5},{ozil,10} 
reducer 3: {I,4} ,{witch,2} 
reducer 4: {fear,1} ,{goal,6},{arsenal,3}

The output of each reducer should be :

reducer 1: {oreo,6} 
reducer 2: {ozil,10} 
reducer 3: {I,4} 
reducer 4: {goal,6}

Now we need to combine the results, to who we send the output and he will do
sort and aggregate(the master?) (5)? and in those steps and before where
there are I/O calls?(6) (when the data is stored on local disk and when on
HDFS).

In addition in Hadoop as far as I know we need to deploy the Map and Reduce
functions to the matching workers, can we change the functions in run
time?(7) , if we have done a mapreduce job and we want to that again(go over
our results) do we have to split the data again and send it to each
worker?(8)

P.S. I have numbered the questions for understanding where are my questions.

Any more comments or notions would be appreciated :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataflow-of-Spark-Hadoop-in-steps-tp27946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org