You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "abhilash.kr" <ab...@gmail.com> on 2021/05/13 14:13:53 UTC

Understanding what happens when a job is submitted to a cluster

Hello,

       What happens when a job is submitted to a cluster? I know the 10,000
foot overview of the spark architecture. But I need the minute details as to
how spark estimates the resources to ask yarn, what's the response of yarn
etc... I need the *step by step* understanding of the complete process. I
searched through the net but I couldn't find any good material on this. Can
anyone help me here?

Thanks,
Abhilash



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Understanding what happens when a job is submitted to a cluster

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

With the danger of stating the obvious, With Spark you are dealing with
what is known as parallel architecture. Parallel architecture comes into
play when the data size is significantly large which cannot be handled on a
single machine, hence hence the use of Spark becomes meaningful. In cases
where (the generated) data size is going to be very large (which is often
the norm rather than the exception these days), the data cannot be
processed and stored in Python tools like Pandas dataframes as these
dataframes store data in RAM. Then, the whole dataset from a storage like
HDFS or cloud storage cannot be collected, because it will take a
significant time and space and probably won't fit in a single machine RAM.
So the key with Spark is *distributed **parallel processing.*

Therefore, I suggest you start from here, assuming that you are somehow
familiar with the fundamentals of Spark.

Running Spark on YARN - Spark 3.1.1 Documentation (apache.org)
<https://spark.apache.org/docs/latest/running-on-yarn.html>

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 13 May 2021 at 15:21, abhilash.kr <ab...@gmail.com> wrote:

> Hello,
>
>        What happens when a job is submitted to a cluster? I know the 10,000
> foot overview of the spark architecture. But I need the minute details as
> to
> how spark estimates the resources to ask yarn, what's the response of yarn
> etc... I need the *step by step* understanding of the complete process. I
> searched through the net but I couldn't find any good material on this. Can
> anyone help me here?
>
> Thanks,
> Abhilash
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Understanding what happens when a job is submitted to a cluster

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.

    1. How does spark know the data size is 5 million?
Depends on the source. Some sources (database/parquet) tell you. Some sources(CSV, JSON) need to be guesstimated
    2. Are there any books or documentation that takes one simple job and goes
    deeper in terms of understanding what happens under the hood?
Jacek Laskowski has a good web book
(https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/) . Most people who understand what's going under the hood have dove into the code. 


On 5/13/21, 1:00 PM, "abhilash.kr" <ab...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Thank you. This was helpful. I have follow up questions.

    1. How does spark know the data size is 5 million?
    2. Are there any books or documentation that takes one simple job and goes
    deeper in terms of understanding what happens under the hood?




    --
    Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Understanding what happens when a job is submitted to a cluster

Posted by "abhilash.kr" <ab...@gmail.com>.

Thank you. This was helpful. I have follow up questions.

1. How does spark know the data size is 5 million?
2. Are there any books or documentation that takes one simple job and goes
deeper in terms of understanding what happens under the hood?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Understanding what happens when a job is submitted to a cluster

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.

The specifics depend on what's going on underneath. At the 10,000 foot level, you probably know that Spark creates a Logical execution plan when you call it. It converts it into a execution plan when you call an action. The Execution plan has stages that are run sequentially. Stages are broken up into tasks that are run in parallel.

So, there 2 questions to be answered How does Spark determine stages? How does Spark break up stages into tasks?

How does Spark determine stages? At a 5,000 foot view, Spark will break up a job into stages at points where a shuffle is required. This usually happens if you are doing a join, aggregation or partition. The basic idea is that Spark is trying to minimize data movement. It looks at the logical plan, divides it into stages such that data movement is minimized. The specifics depends on what the job is doing.

How does Spark break up a stage into tasks? At a 5,000 foot view, it is trying to break up the stage into bite sized pieces that can be processed independently. The specifics depends on what is being done in the stage. For example, let's say, a stage is reading a file with 5 Million rows, transforming it and writing out. Spark will check how much memory it needs it for one row by looking at the dataframe schema. Let's say each row is 10K. Then it looks at how much memory it has per executor. Let's say 100M. From this it determines that one executor can process 10K rows. So, each task can have 10K rows. 5M/10K = 500. It will create 500 tasks. Each task will read 10K rows, transform them and write it to output
Again, this is an example. How tasks get divided depend a lot on what's happening in the stage. It also optimizes the code and tries to push down predicates, which complicates things for us users.

On 5/13/21, 10:21 AM, "abhilash.kr" <ab...@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hello,

What happens when a job is submitted to a cluster? I know the 10,000
foot overview of the spark architecture. But I need the minute details as to
how spark estimates the resources to ask yarn, what's the response of yarn
etc... I need the *step by step* understanding of the complete process. I
searched through the net but I couldn't find any good material on this. Can
anyone help me here?

Thanks,
Abhilash

--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org