You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Junfeng Chen <da...@gmail.com> on 2018/04/09 05:07:02 UTC

spark application running in yarn client mode is slower than in local mode.

I have wrote an spark streaming application reading kafka data and convert
the json data to parquet and save to hdfs.
What make me puzzled is, the processing time of app in yarn mode cost 20%
to 50% more time than in local mode. My cluster have three nodes with three
node managers, and all three hosts have same hardware, 40cores and 256GB
memory. .

Why? How to solve it?

Regard,
Junfeng Chen

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
I read json string value from kafka, then transform them to df:

Dataset<Row> df = spark.read().json(stringjavaRDD);


Then add some new data to each row:

> JavaRDD<Row> rowJavaRDD = df.javaRDD().map(...)
> StructType type = df.schema().add()....
> Dataset<Row> newdf = spark.createDataFrame(rowJavaRDD,type);


...

At last write the dataset to parquet file

newdf.write().mode(SaveMode.Append).partitionedBy("stream","appname","year","month","day","hour").parquet(savePath);


How to determine if it is caused by shuffle or broadcast?


Regard,
Junfeng Chen

On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com> wrote:

> Probably network / shuffling cost? Or broadcast variables? Can you provide
> more details what you do and some timings?
>
> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
> >
> > I have wrote an spark streaming application reading kafka data and
> convert the json data to parquet and save to hdfs.
> > What make me puzzled is, the processing time of app in yarn mode cost
> 20% to 50% more time than in local mode. My cluster have three nodes with
> three node managers, and all three hosts have same hardware, 40cores and
> 256GB memory. .
> >
> > Why? How to solve it?
> >
> > Regard,
> > Junfeng Chen
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
But I still have one question. I find the task number in stage is 3. So
where is this 3 from? How to increase the parallelism?


Regard,
Junfeng Chen

On Tue, Apr 10, 2018 at 11:31 AM, Junfeng Chen <da...@gmail.com> wrote:

> Yeah, I have increase the executor number and executor cores, and it runs
> normally now.  The hdp spark 2 have only 2 executor and 1 executor cores by
> default.
>
>
> Regard,
> Junfeng Chen
>
> On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shao <sa...@gmail.com>
> wrote:
>
>> In yarn mode, only two executor are assigned to process the task, since
>>> one executor can process one task only, they need 6 min in total.
>>>
>>
>> This is not true. You should set --executor-cores/--num-executors to
>> increase the task parallelism for executor. To be fair, Spark application
>> should have same resources (cpu/memory) when comparing between local and
>> yarn mode.
>>
>> 2018-04-10 10:05 GMT+08:00 Junfeng Chen <da...@gmail.com>:
>>
>>> I found the potential reason.
>>>
>>> In local mode, all tasks in one stage runs concurrently, while tasks in
>>> yarn mode runs in sequence.
>>>
>>> For example, in one stage, each task costs 3 mins. If in local mode,
>>> they will run together, and cost 3 min in total. In yarn mode, only two
>>> executor are assigned to process the task, since one executor can process
>>> one task only, they need 6 min in total.
>>>
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>> On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>>
>>>> Probably network / shuffling cost? Or broadcast variables? Can you
>>>> provide more details what you do and some timings?
>>>>
>>>> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
>>>> >
>>>> > I have wrote an spark streaming application reading kafka data and
>>>> convert the json data to parquet and save to hdfs.
>>>> > What make me puzzled is, the processing time of app in yarn mode cost
>>>> 20% to 50% more time than in local mode. My cluster have three nodes with
>>>> three node managers, and all three hosts have same hardware, 40cores and
>>>> 256GB memory. .
>>>> >
>>>> > Why? How to solve it?
>>>> >
>>>> > Regard,
>>>> > Junfeng Chen
>>>>
>>>
>>>
>>
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
Yeah, I have increase the executor number and executor cores, and it runs
normally now.  The hdp spark 2 have only 2 executor and 1 executor cores by
default.


Regard,
Junfeng Chen

On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shao <sa...@gmail.com>
wrote:

> In yarn mode, only two executor are assigned to process the task, since
>> one executor can process one task only, they need 6 min in total.
>>
>
> This is not true. You should set --executor-cores/--num-executors to
> increase the task parallelism for executor. To be fair, Spark application
> should have same resources (cpu/memory) when comparing between local and
> yarn mode.
>
> 2018-04-10 10:05 GMT+08:00 Junfeng Chen <da...@gmail.com>:
>
>> I found the potential reason.
>>
>> In local mode, all tasks in one stage runs concurrently, while tasks in
>> yarn mode runs in sequence.
>>
>> For example, in one stage, each task costs 3 mins. If in local mode, they
>> will run together, and cost 3 min in total. In yarn mode, only two executor
>> are assigned to process the task, since one executor can process one task
>> only, they need 6 min in total.
>>
>>
>> Regard,
>> Junfeng Chen
>>
>> On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Probably network / shuffling cost? Or broadcast variables? Can you
>>> provide more details what you do and some timings?
>>>
>>> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
>>> >
>>> > I have wrote an spark streaming application reading kafka data and
>>> convert the json data to parquet and save to hdfs.
>>> > What make me puzzled is, the processing time of app in yarn mode cost
>>> 20% to 50% more time than in local mode. My cluster have three nodes with
>>> three node managers, and all three hosts have same hardware, 40cores and
>>> 256GB memory. .
>>> >
>>> > Why? How to solve it?
>>> >
>>> > Regard,
>>> > Junfeng Chen
>>>
>>
>>
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Saisai Shao <sa...@gmail.com>.
>
> In yarn mode, only two executor are assigned to process the task, since
> one executor can process one task only, they need 6 min in total.
>

This is not true. You should set --executor-cores/--num-executors to
increase the task parallelism for executor. To be fair, Spark application
should have same resources (cpu/memory) when comparing between local and
yarn mode.

2018-04-10 10:05 GMT+08:00 Junfeng Chen <da...@gmail.com>:

> I found the potential reason.
>
> In local mode, all tasks in one stage runs concurrently, while tasks in
> yarn mode runs in sequence.
>
> For example, in one stage, each task costs 3 mins. If in local mode, they
> will run together, and cost 3 min in total. In yarn mode, only two executor
> are assigned to process the task, since one executor can process one task
> only, they need 6 min in total.
>
>
> Regard,
> Junfeng Chen
>
> On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Probably network / shuffling cost? Or broadcast variables? Can you
>> provide more details what you do and some timings?
>>
>> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
>> >
>> > I have wrote an spark streaming application reading kafka data and
>> convert the json data to parquet and save to hdfs.
>> > What make me puzzled is, the processing time of app in yarn mode cost
>> 20% to 50% more time than in local mode. My cluster have three nodes with
>> three node managers, and all three hosts have same hardware, 40cores and
>> 256GB memory. .
>> >
>> > Why? How to solve it?
>> >
>> > Regard,
>> > Junfeng Chen
>>
>
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
I found the potential reason.

In local mode, all tasks in one stage runs concurrently, while tasks in
yarn mode runs in sequence.

For example, in one stage, each task costs 3 mins. If in local mode, they
will run together, and cost 3 min in total. In yarn mode, only two executor
are assigned to process the task, since one executor can process one task
only, they need 6 min in total.


Regard,
Junfeng Chen

On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com> wrote:

> Probably network / shuffling cost? Or broadcast variables? Can you provide
> more details what you do and some timings?
>
> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
> >
> > I have wrote an spark streaming application reading kafka data and
> convert the json data to parquet and save to hdfs.
> > What make me puzzled is, the processing time of app in yarn mode cost
> 20% to 50% more time than in local mode. My cluster have three nodes with
> three node managers, and all three hosts have same hardware, 40cores and
> 256GB memory. .
> >
> > Why? How to solve it?
> >
> > Regard,
> > Junfeng Chen
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
hi,

My kafka topic has three partitions.  The time cost I mentioned means ,
each streaming loop cost more time with yarn client mode. For example yarn
mode cost 300 seconds to process some data, and local mode just cost 200
seconds  to process similar amount of data.


Regard,
Junfeng Chen

On Mon, Apr 9, 2018 at 2:20 PM, Gopala Krishna Manchukonda <
gopala_krishna_manchukonda@apple.com> wrote:

> Hi Junfeng ,
>
> Is your kafka topic partitioned?
>
> Are you referring to the duration or the CPU time spent by the job as
> being 20% - 50% higher than running in local?
>
> Thanks & Regards
> Gopal
>
>
> > On 09-Apr-2018, at 11:42 AM, Jörn Franke <jo...@gmail.com> wrote:
> >
> > Probably network / shuffling cost? Or broadcast variables? Can you
> provide more details what you do and some timings?
> >
> >> On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
> >>
> >> I have wrote an spark streaming application reading kafka data and
> convert the json data to parquet and save to hdfs.
> >> What make me puzzled is, the processing time of app in yarn mode cost
> 20% to 50% more time than in local mode. My cluster have three nodes with
> three node managers, and all three hosts have same hardware, 40cores and
> 256GB memory. .
> >>
> >> Why? How to solve it?
> >>
> >> Regard,
> >> Junfeng Chen
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Gopala Krishna Manchukonda <go...@apple.com>.
Hi Junfeng ,

Is your kafka topic partitioned? 

Are you referring to the duration or the CPU time spent by the job as being 20% - 50% higher than running in local? 

Thanks & Regards
Gopal 


> On 09-Apr-2018, at 11:42 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings?
> 
>> On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
>> 
>> I have wrote an spark streaming application reading kafka data and convert the json data to parquet and save to hdfs. 
>> What make me puzzled is, the processing time of app in yarn mode cost 20% to 50% more time than in local mode. My cluster have three nodes with three node managers, and all three hosts have same hardware, 40cores and 256GB memory. .
>> 
>> Why? How to solve it? 
>> 
>> Regard,
>> Junfeng Chen
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: spark application running in yarn client mode is slower than in local mode.

Posted by Junfeng Chen <da...@gmail.com>.
Hi Jorn,

I checked the log info of my application:
The ResultStage3 (parquet writing) cost a very long time,nearly 300s, where
the total processing time of this loop is 6 mins.


Regard,
Junfeng Chen

On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jo...@gmail.com> wrote:

> Probably network / shuffling cost? Or broadcast variables? Can you provide
> more details what you do and some timings?
>
> > On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
> >
> > I have wrote an spark streaming application reading kafka data and
> convert the json data to parquet and save to hdfs.
> > What make me puzzled is, the processing time of app in yarn mode cost
> 20% to 50% more time than in local mode. My cluster have three nodes with
> three node managers, and all three hosts have same hardware, 40cores and
> 256GB memory. .
> >
> > Why? How to solve it?
> >
> > Regard,
> > Junfeng Chen
>

Re: spark application running in yarn client mode is slower than in local mode.

Posted by Jörn Franke <jo...@gmail.com>.
Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings?

> On 9. Apr 2018, at 07:07, Junfeng Chen <da...@gmail.com> wrote:
> 
> I have wrote an spark streaming application reading kafka data and convert the json data to parquet and save to hdfs. 
> What make me puzzled is, the processing time of app in yarn mode cost 20% to 50% more time than in local mode. My cluster have three nodes with three node managers, and all three hosts have same hardware, 40cores and 256GB memory. .
> 
> Why? How to solve it? 
> 
> Regard,
> Junfeng Chen

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org