You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by puneetloya <pu...@gmail.com> on 2018/01/27 17:07:25 UTC

Spark Streaming Cluster queries

Hi All,

A cluster of one spark driver and multiple executors(5) is setup with redis
for spark processed data storage and s3 is used for checkpointing. I have a
couple of queries about this setup.

1) How to analyze what part of code executes on Spark Driver and what part
of code executes on the executors?
2) As Spark driver gets results from spark executors, should Spark executors
have any access to redis storage? ( My guess is yes, because executors may
need data from redis for further calculations)
3) Should the spark executors have access to the checkpoint storage s3?
4) Can anyone share their checkpoint recovery strategy for s3?

Thanks,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Streaming Cluster queries

Posted by "vijay.bvp" <bv...@gmail.com>.

Assuming you are talking about Spark Streaming

1) How to analyze what part of code executes on Spark Driver and what part
of code executes on the executors?

RDD's can be understood as set of data transformations or set of jobs. Your
understanding deepens as you do more programming with Spark. Here are some
good resources, some bit outdated but still good fundamentals.

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
<https://www.youtube.com/watch?v=dmL0N3qfSc8>

Advanced Apache Spark Training - Sameer Farooqui (Databricks)
<https://www.youtube.com/watch?v=7ooZ4S7Ay6Y>

rdd-streaming debugging-streaming-applications.html
<https://docs.databricks.com/spark/latest/rdd-streaming/debugging-streaming-applications.html>

Look at the Task Details page in the spark UI for Streaming jobs

2) As Spark driver gets results from spark executors, should Spark executors
have any access to redis storage? ( My guess is yes, because executors may
need data from redis for further calculations)

Avoid the pattern of getting results back to driver and then doing something
or sending to some store. Driver quickly becomes bottleneck and you will
benefits of parallel programming. Having said that there is nothing
preventing you from doing it, for instance if you want to do a complex
calculation with output as single value you could do calculation on cluster
workers and collect it in driver and then send to some store. Still if
possible avoid using driver for these, driver is for scheduling of parallel
rdd jobs

which means each of your worker nodes needs to have access to redis store.
Also connections can't be serialized from driver to worker nodes. please
look at the documentation for the correct design pattern for storing data to
external stores and creating connections at workers.

Design Patterns for using foreachRDD
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams>

3) Should the spark executors have access to the checkpoint storage s3?
yes I believe ideally storage should be mapped to cluster avoiding login
each time.
4) Can anyone share their checkpoint recovery strategy for s3?
Not specific to S3, but in general
checkpoint is time consuming process and can incur performance if the
checkpoint storage is slow.
alternate is to use local storage which would be faster than S3 but that
means fault tolerance is not 100% guaranteed.

thanks
Vijay

--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org