You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Carter <gy...@hotmail.com> on 2014/04/23 08:42:12 UTC

Need help about how hadoop works.

Hi, I am a beginner of Hadoop and Spark, and want some help in understanding
how hadoop works.

If we have a cluster of 5 computers, and install Spark on the cluster
WITHOUT Hadoop. And then we run the code on one computer: 
val doc = sc.textFile("/home/scalatest.txt",5)
doc.count
Can the "count" task be distributed to all the 5 computers? Or it is only
run by 5 parallel threads of the current computer?

On th other hand, if we install Hadoop on the cluster and upload the data
into HDFS, when running the same code will this "count" task be done by 25
threads?

Thank you very much for your help. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Need help about how hadoop works.

Posted by Carter <gy...@hotmail.com>.

Thank you very much Prashant.
 
Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739h43@n3.nabble.com
To: gyzhen@hotmail.com
Subject: Re: Need help about how hadoop works.



	It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node.Prashant Sharma




On Thu, Apr 24, 2014 at 1:36 PM, Carter <[hidden email]> wrote:


Thank you very much for your help Prashant.



Sorry I still have another question about your answer: "however if the

file("/home/scalatest.txt") is present on the same path on all systems it

will be processed on all nodes."



When presenting the file to the same path on all nodes, do we just simply

copy the same file to all nodes, or do we need to split the original file

into different parts (each part is still with the same file name

"scalatest.txt"), and copy each part to a different node for

parallelization?



Thank you very much.







--

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html



Sent from the Apache Spark User List mailing list archive at Nabble.com.





	
	
	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html
	
	
		
		To unsubscribe from Need help about how hadoop works., click here.

		NAML
	 		 	   		  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

Posted by Prashant Sharma <sc...@gmail.com>.

It is the same file and hadoop library that we use for splitting takes care
of assigning the right split to each node.

Prashant Sharma


On Thu, Apr 24, 2014 at 1:36 PM, Carter <gy...@hotmail.com> wrote:

> Thank you very much for your help Prashant.
>
> Sorry I still have another question about your answer: "however if the
> file("/home/scalatest.txt") is present on the same path on all systems it
> will be processed on all nodes."
>
> When presenting the file to the same path on all nodes, do we just simply
> copy the same file to all nodes, or do we need to split the original file
> into different parts (each part is still with the same file name
> "scalatest.txt"), and copy each part to a different node for
> parallelization?
>
> Thank you very much.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Need help about how hadoop works.

Posted by Carter <gy...@hotmail.com>.

Thank you very much for your help Prashant.

Sorry I still have another question about your answer: "however if the
file("/home/scalatest.txt") is present on the same path on all systems it
will be processed on all nodes."

When presenting the file to the same path on all nodes, do we just simply
copy the same file to all nodes, or do we need to split the original file
into different parts (each part is still with the same file name
"scalatest.txt"), and copy each part to a different node for
parallelization? 

Thank you very much.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

Posted by Prashant Sharma <sc...@gmail.com>.

Prashant Sharma


On Thu, Apr 24, 2014 at 12:15 PM, Carter <gy...@hotmail.com> wrote:

> Thanks Mayur.
>
> So without Hadoop and any other distributed file systems, by running:
>      val doc = sc.textFile("/home/scalatest.txt",5)
>      doc.count
> we can only get parallelization within the computer where the file is
> loaded, but not the parallelization within the computers in the cluster
> (Spark can not automatically duplicate the file to the other computers in
> the cluster), is this understanding correct? Thank you.
>
>
Spark will not distribute that file for you on other systems, however if
the file("/home/scalatest.txt") is present on the same path on all systems
it will be processed on all nodes. We generally use hdfs which takes care
of this distribution.


>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Need help about how hadoop works.

Posted by Carter <gy...@hotmail.com>.

Thanks Mayur.

So without Hadoop and any other distributed file systems, by running:
     val doc = sc.textFile("/home/scalatest.txt",5)
     doc.count
we can only get parallelization within the computer where the file is
loaded, but not the parallelization within the computers in the cluster
(Spark can not automatically duplicate the file to the other computers in
the cluster), is this understanding correct? Thank you.

 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

Posted by Mayur Rustagi <ma...@gmail.com>.

As long as the path is present & available on all machines you should be
able to leverage distribution. HDFS is one way to make that happen, NFS is
another & simple replication is another.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Apr 23, 2014 at 12:12 PM, Carter <gy...@hotmail.com> wrote:

> Hi, I am a beginner of Hadoop and Spark, and want some help in
> understanding
> how hadoop works.
>
> If we have a cluster of 5 computers, and install Spark on the cluster
> WITHOUT Hadoop. And then we run the code on one computer:
> val doc = sc.textFile("/home/scalatest.txt",5)
> doc.count
> Can the "count" task be distributed to all the 5 computers? Or it is only
> run by 5 parallel threads of the current computer?
>
> On th other hand, if we install Hadoop on the cluster and upload the data
> into HDFS, when running the same code will this "count" task be done by 25
> threads?
>
> Thank you very much for your help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>