You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mutahir Ali <so...@outlook.com> on 2018/01/24 20:50:11 UTC

Apache Hadoop and Spark

Hello All,
Cordial Greetings,

I am trying to familiarize myself with Apache Hadoop and it's different software components and how they can be deployed on physical or virtual infrastructure.

I have a few questions:

Q1) Can we use Mapreduce and apache spark in the same cluster
Q2) is it mandatory to use GPUs for apache spark?
Q3) I read that apache spark is in-memory, will it benefit from SSD / Flash for caching or persistent storage?
Q4) If we intend to deploy a Hadoop cluster with 6 servers, can we have GPUs in only two and restrict apache spark on those servers only?
Q5) Possible to virtualize Spark with GPU Pass thru?
Q6) What GPUs are recommended / compatible with Apache Spark? (NVidia M10 / M60)?

Will be grateful for your suggestions and answers - please accept my apologies for totally noob questions 😊

Have a good day / evening ahead
Best



Sent from Outlook<http://aka.ms/weboutlook>

Re: Apache Hadoop and Spark

Posted by "jamison.bennett" <ja...@gmail.com>.

Hi Mutahir,

I will try to answer some of your questions.

Q1) Can we use Mapreduce and apache spark in the same cluster
Yes. I run a cluster with both MapReduce2 and Spark and I use Yarn as the
resource manager.

Q2) is it mandatory to use GPUs for apache spark?
No. My cluster has Spark and does not have any GPUs.

Q3) I read that apache spark is in-memory, will it benefit from SSD / Flash
for caching or persistent storage?
As you noted, Spark is in-memory but there may be a few places that faster
storage may benefit including:
- Storage of the data file data read into Spark from HDFS DataNodes
-  RDD persistence
<https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence>  
when caching includes one of the disk options
- Spark shuffle service - Between Spark stages which process the data
in-memory, intermediate results from Spark executors are written to storage
and served to the next stage by the shuffle service.
I don't have any benchmark results for these, but it might be something you
want to look into.

Thanks,
Jamison



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org