You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Renxia Wang <re...@gmail.com> on 2016/07/14 17:49:12 UTC

Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

Hi all,

I am running a Spark Streaming application with Kinesis on EMR 4.7.1. The
application runs on YARN and use client mode. There are 17 worker nodes
(c3.8xlarge) with 100 executors and 100 receivers. This setting works fine.

But when I increase the number of worker nodes to 50, and increase the
number of executors to 250, with the 250 receivers, the processing time of
batches increase from ~50s to 2.3min, and scheduler delay for tasks
increase from ~0.2s max to 20s max (while 75th percentile is about 2-3s).

I tried to only increase the number executors but keep the number of
receivers, but then I still see performance degrade from ~50s to 1.1min,
and for tasks the scheduler delay increased from ~0.2s max to 4s max (while
75th percentile is about 1s).

The spark-submit is as follow. The only parameter I changed here is the
num-executors.

spark-submit
--deploy-mode client
--verbose
--master yarn
--jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar
--driver-memory 20g --driver-cores 20
--num-executors 250
--executor-cores 5
--executor-memory 8g
--conf spark.yarn.executor.memoryOverhead=1600
--conf spark.driver.maxResultSize=0
--conf spark.dynamicAllocation.enabled=false
--conf spark.rdd.compress=true
--conf spark.streaming.stopGracefullyOnShutdown=true
--conf spark.streaming.backpressure.enabled=true
--conf spark.speculation=true
--conf spark.task.maxFailures=15
--conf spark.ui.retainedJobs=100
--conf spark.ui.retainedStages=100
--conf spark.executor.logs.rolling.maxRetainedFiles=1
--conf spark.executor.logs.rolling.strategy=time
--conf spark.executor.logs.rolling.time.interval=hourly
--conf spark.scheduler.mode=FAIR
--conf spark.scheduler.allocation.file=/home/hadoop/fairscheduler.xml
--conf spark.metrics.conf=/home/hadoop/spark-metrics.properties
--class Main /home/hadoop/Main-1.0.jar

I found this issue seems relevant:
https://issues.apache.org/jira/browse/SPARK-14327

Any suggestion for me to troubleshoot this issue?

Thanks,

Renxia

Re: Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

Posted by Renxia Wang <re...@gmail.com>.

Hi Daniel,

I didn't re-sharding. I have much more shards than receivers.

I finally tune the cluster to work, by tuning locality, blockInterval,
reduce number of output files, disable speculation.

Especially the speculation, I have to turn it on for my 17 nodes cluster,
using the default setting. But with it turned on on 50 hosts cluster, it
makes the scheduler delay up to 20s. After I turned it off, scheduler delay
become 5-6s.

The cluster is working, however, I see some weird behavior in memory usage:
memory usage jump up regularly.


This happens to all hosts.

Renxia



2016-07-14 12:59 GMT-07:00 Daniel Santana <da...@everymundo.com>:

> Are you re-sharding your kinesis stream as well?
>
> I had a similar problem and increasing the number of kinesis stream shards
> solved it.
>
> --
> *Daniel Santana*
> Senior Software Engineer
>
> EVERY*MUNDO*
> 25 SE 2nd Ave., Suite 900
> Miami, FL 33131 USA
> main:+1 (305) 375-0045
> EveryMundo.com <http://www.everymundo.com/#whoweare>
>
> *Confidentiality Notice: *This email and any files transmitted with it
> are confidential and intended solely for the use of the individual or
> entity to whom they are addressed. If you have received this email in
> error, please notify the system manager.
>
> On Thu, Jul 14, 2016 at 2:20 PM, Renxia Wang <re...@gmail.com>
> wrote:
>
>> Additional information: The batch duration in my app is 1 minute, from
>> Spark UI, for each batch, the difference between Output Op Duration and Job
>> Duration is big. E.g. Output Op Duration is 1min while Job Duration is 19s.
>>
>> 2016-07-14 10:49 GMT-07:00 Renxia Wang <re...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I am running a Spark Streaming application with Kinesis on EMR 4.7.1.
>>> The application runs on YARN and use client mode. There are 17 worker nodes
>>> (c3.8xlarge) with 100 executors and 100 receivers. This setting works fine.
>>>
>>> But when I increase the number of worker nodes to 50, and increase the
>>> number of executors to 250, with the 250 receivers, the processing time of
>>> batches increase from ~50s to 2.3min, and scheduler delay for tasks
>>> increase from ~0.2s max to 20s max (while 75th percentile is about 2-3s).
>>>
>>> I tried to only increase the number executors but keep the number of
>>> receivers, but then I still see performance degrade from ~50s to 1.1min,
>>> and for tasks the scheduler delay increased from ~0.2s max to 4s max (while
>>> 75th percentile is about 1s).
>>>
>>> The spark-submit is as follow. The only parameter I changed here is the
>>> num-executors.
>>>
>>> spark-submit
>>> --deploy-mode client
>>> --verbose
>>> --master yarn
>>> --jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar
>>> --driver-memory 20g --driver-cores 20
>>> --num-executors 250
>>> --executor-cores 5
>>> --executor-memory 8g
>>> --conf spark.yarn.executor.memoryOverhead=1600
>>> --conf spark.driver.maxResultSize=0
>>> --conf spark.dynamicAllocation.enabled=false
>>> --conf spark.rdd.compress=true
>>> --conf spark.streaming.stopGracefullyOnShutdown=true
>>> --conf spark.streaming.backpressure.enabled=true
>>> --conf spark.speculation=true
>>> --conf spark.task.maxFailures=15
>>> --conf spark.ui.retainedJobs=100
>>> --conf spark.ui.retainedStages=100
>>> --conf spark.executor.logs.rolling.maxRetainedFiles=1
>>> --conf spark.executor.logs.rolling.strategy=time
>>> --conf spark.executor.logs.rolling.time.interval=hourly
>>> --conf spark.scheduler.mode=FAIR
>>> --conf spark.scheduler.allocation.file=/home/hadoop/fairscheduler.xml
>>> --conf spark.metrics.conf=/home/hadoop/spark-metrics.properties
>>> --class Main /home/hadoop/Main-1.0.jar
>>>
>>> I found this issue seems relevant:
>>> https://issues.apache.org/jira/browse/SPARK-14327
>>>
>>> Any suggestion for me to troubleshoot this issue?
>>>
>>> Thanks,
>>>
>>> Renxia
>>>
>>>
>>
>

Re: Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

Posted by Daniel Santana <da...@everymundo.com>.

Are you re-sharding your kinesis stream as well?

I had a similar problem and increasing the number of kinesis stream shards
solved it.

-- 
*Daniel Santana*
Senior Software Engineer

EVERY*MUNDO*
25 SE 2nd Ave., Suite 900
Miami, FL 33131 USA
main:+1 (305) 375-0045
EveryMundo.com <http://www.everymundo.com/#whoweare>

*Confidentiality Notice: *This email and any files transmitted with it are
confidential and intended solely for the use of the individual or entity to
whom they are addressed. If you have received this email in error, please
notify the system manager.

On Thu, Jul 14, 2016 at 2:20 PM, Renxia Wang <re...@gmail.com> wrote:

> Additional information: The batch duration in my app is 1 minute, from
> Spark UI, for each batch, the difference between Output Op Duration and Job
> Duration is big. E.g. Output Op Duration is 1min while Job Duration is 19s.
>
> 2016-07-14 10:49 GMT-07:00 Renxia Wang <re...@gmail.com>:
>
>> Hi all,
>>
>> I am running a Spark Streaming application with Kinesis on EMR 4.7.1. The
>> application runs on YARN and use client mode. There are 17 worker nodes
>> (c3.8xlarge) with 100 executors and 100 receivers. This setting works fine.
>>
>> But when I increase the number of worker nodes to 50, and increase the
>> number of executors to 250, with the 250 receivers, the processing time of
>> batches increase from ~50s to 2.3min, and scheduler delay for tasks
>> increase from ~0.2s max to 20s max (while 75th percentile is about 2-3s).
>>
>> I tried to only increase the number executors but keep the number of
>> receivers, but then I still see performance degrade from ~50s to 1.1min,
>> and for tasks the scheduler delay increased from ~0.2s max to 4s max (while
>> 75th percentile is about 1s).
>>
>> The spark-submit is as follow. The only parameter I changed here is the
>> num-executors.
>>
>> spark-submit
>> --deploy-mode client
>> --verbose
>> --master yarn
>> --jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar
>> --driver-memory 20g --driver-cores 20
>> --num-executors 250
>> --executor-cores 5
>> --executor-memory 8g
>> --conf spark.yarn.executor.memoryOverhead=1600
>> --conf spark.driver.maxResultSize=0
>> --conf spark.dynamicAllocation.enabled=false
>> --conf spark.rdd.compress=true
>> --conf spark.streaming.stopGracefullyOnShutdown=true
>> --conf spark.streaming.backpressure.enabled=true
>> --conf spark.speculation=true
>> --conf spark.task.maxFailures=15
>> --conf spark.ui.retainedJobs=100
>> --conf spark.ui.retainedStages=100
>> --conf spark.executor.logs.rolling.maxRetainedFiles=1
>> --conf spark.executor.logs.rolling.strategy=time
>> --conf spark.executor.logs.rolling.time.interval=hourly
>> --conf spark.scheduler.mode=FAIR
>> --conf spark.scheduler.allocation.file=/home/hadoop/fairscheduler.xml
>> --conf spark.metrics.conf=/home/hadoop/spark-metrics.properties
>> --class Main /home/hadoop/Main-1.0.jar
>>
>> I found this issue seems relevant:
>> https://issues.apache.org/jira/browse/SPARK-14327
>>
>> Any suggestion for me to troubleshoot this issue?
>>
>> Thanks,
>>
>> Renxia
>>
>>
>

Re: Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

Posted by Renxia Wang <re...@gmail.com>.

Additional information: The batch duration in my app is 1 minute, from
Spark UI, for each batch, the difference between Output Op Duration and Job
Duration is big. E.g. Output Op Duration is 1min while Job Duration is 19s.

2016-07-14 10:49 GMT-07:00 Renxia Wang <re...@gmail.com>:

> Hi all,
>
> I am running a Spark Streaming application with Kinesis on EMR 4.7.1. The
> application runs on YARN and use client mode. There are 17 worker nodes
> (c3.8xlarge) with 100 executors and 100 receivers. This setting works fine.
>
> But when I increase the number of worker nodes to 50, and increase the
> number of executors to 250, with the 250 receivers, the processing time of
> batches increase from ~50s to 2.3min, and scheduler delay for tasks
> increase from ~0.2s max to 20s max (while 75th percentile is about 2-3s).
>
> I tried to only increase the number executors but keep the number of
> receivers, but then I still see performance degrade from ~50s to 1.1min,
> and for tasks the scheduler delay increased from ~0.2s max to 4s max (while
> 75th percentile is about 1s).
>
> The spark-submit is as follow. The only parameter I changed here is the
> num-executors.
>
> spark-submit
> --deploy-mode client
> --verbose
> --master yarn
> --jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar
> --driver-memory 20g --driver-cores 20
> --num-executors 250
> --executor-cores 5
> --executor-memory 8g
> --conf spark.yarn.executor.memoryOverhead=1600
> --conf spark.driver.maxResultSize=0
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.rdd.compress=true
> --conf spark.streaming.stopGracefullyOnShutdown=true
> --conf spark.streaming.backpressure.enabled=true
> --conf spark.speculation=true
> --conf spark.task.maxFailures=15
> --conf spark.ui.retainedJobs=100
> --conf spark.ui.retainedStages=100
> --conf spark.executor.logs.rolling.maxRetainedFiles=1
> --conf spark.executor.logs.rolling.strategy=time
> --conf spark.executor.logs.rolling.time.interval=hourly
> --conf spark.scheduler.mode=FAIR
> --conf spark.scheduler.allocation.file=/home/hadoop/fairscheduler.xml
> --conf spark.metrics.conf=/home/hadoop/spark-metrics.properties
> --class Main /home/hadoop/Main-1.0.jar
>
> I found this issue seems relevant:
> https://issues.apache.org/jira/browse/SPARK-14327
>
> Any suggestion for me to troubleshoot this issue?
>
> Thanks,
>
> Renxia
>
>