You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by satishjohn <sa...@gmail.com> on 2017/06/06 16:02:57 UTC

Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

Performance issue / time taken to complete spark job in yarn is 4 x slower,
when considered spark standalone mode. However, in spark standalone mode
jobs often fails with executor lost issue.

Hardware configuration


 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)

Spark configuration:


spark.executor.memory 7g
Spark cores Max 96
Spark driver 5GB
spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job fails or
job takes 50x times more time)
spark.driver.maxResultSize::2g
spark.driver.memory::5g
No of Instances 4 per machine.

With the above spark configuration the spark job for the business flow of 17
million records completes in 8 Minutes.

Problem Area:


When run in yarn client mode with the below configuration which takes 33 to
42 minutes to run the same flow. Below is the yarn-site.xml configuration
data.

<configuration>
  <property><name>yarn.label.enabled</name><value>true</value></property>
 
<property><name>yarn.log-aggregation.enable-local-cleanup</name><value>false</value></property>
 
<property><name>yarn.resourcemanager.scheduler.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.resourcemanager.resource-tracker.address</name><value>satish-NS1:8031</value></property>
 
<property><name>yarn.resourcemanager.scheduler.address</name><value>satish-NS1:8030</value></property>
 
<property><name>yarn.dispatcher.exit-on-error</name><value>true</value></property>
 
<property><name>yarn.nodemanager.container-manager.thread-count</name><value>64</value></property>
 
<property><name>yarn.nodemanager.local-dirs</name><value>/home/satish/yarn</value></property>
 
<property><name>yarn.nodemanager.localizer.fetch.thread-count</name><value>20</value></property>
 
<property><name>yarn.resourcemanager.address</name><value>satish-NS1:8032</value></property>
 
<property><name>yarn.scheduler.increment-allocation-mb</name><value>512</value></property>
 
<property><name>yarn.log.server.url</name><value>http://satish-NS1:19888/jobhistory/logs</value></property>
 
<property><name>yarn.nodemanager.resource.memory-mb</name><value>28000</value></property>
 
<property><name>yarn.nodemanager.labels</name><value>MASTER</value></property>
 
<property><name>yarn.nodemanager.resource.cpu-vcores</name><value>48</value></property>
 
<property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value></property>
 
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
 
<property><name>yarn.nodemanager.localizer.client.thread-count</name><value>20</value></property>
 
<property><name>yarn.app.mapreduce.am.labels</name><value>CORE</value></property>
 
<property><name>yarn.log-aggregation.retain-seconds</name><value>172800</value></property>
 
<property><name>yarn.nodemanager.address</name><value>${yarn.nodemanager.hostname}:8041</value></property>
 
<property><name>yarn.resourcemanager.hostname</name><value>satish-NS1</value></property>
 
<property><name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value></property>
 
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property>
 
<property><name>yarn.resourcemanager.resource-tracker.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property>
 
<property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,</value></property>
 
<property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
 
<property><name>yarn.resourcemanager.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.nodemanager.container-metrics.enable</name><value>true</value></property>
 
<property><name>yarn.nodemanager.log-dirs</name><value>/home/satish/hadoop-yarn/containers</value></property>
  <property> <name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle,mapreduce_shuffle</value></property>    
 <property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>     
<value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>
  <property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name>   
<value>org.apache.spark.network.yarn.YarnShuffleService</value></property>
 
<property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property>
  <property><name>yarn.scheduler.increment-allocation-vcores</name>       
<value>1</value>    </property>
<property> <name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property>
<property><name>yarn.scheduler.fair.preemption</name><value>true</value></property>

</configuration>

Also in capacity scheduler I am using Dominant resource calculator. I have
tried hands on other fair and default as well.

In order make the test simple, I ran sort on the same cluster with
yarn-client mode and spark standalone mode. I can share the data for your
comparative test analysis as well.

136 seconds - Yarn-client mode
40 seconds  - Spark Standalone mode

To conclude I am looking for a reason and solution for yarn-client mode
performance issue best configuration possible to achieve performance from
yarn. 

When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes long
completes in time and also does not fail often when compared to without as I
have had history of issues when running job in spark without this option
enabled. 

Let me know how to get similar performance from yarn-client or spark
standalone.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-when-running-Spark-1-6-1-in-yarn-client-mode-with-Hadoop-2-6-0-tp28747.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

Posted by Satish John Bosco <sa...@gmail.com>.

I have tried the configuration calculator sheet provided by Cloudera as
well but no improvements. However, ignoring the 17 mil operation to begin
with.

Let consider the simple sort on yarn and spark which has tremendous
difference.

The operation is simple Selected numeric col to be sorted ascending and
below is the results.

> 136 seconds - Yarn-client mode
> 40 seconds  - Spark Standalone mode

Can you guide me on having a simple yarn-site.xml configuration that should
be the bare minimum for the below hardware at least. So that I can see if I
am missing or overlooked any key configurations . Also if running in spark
Standalone mode the configuration of spark-env.sh and spark-defaults as to
how many instances to choose with memory and cores.

32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)

Finally this key is mystifying as to why it created such performance
difference in spark 1.6.1 is not understood spark.sql.
autoBroadcastJoinThreshold::-1.





On Wed, Jun 7, 2017 at 11:16 AM, Jörn Franke <jo...@gmail.com> wrote:

> What does your Spark job do? Have you tried standard configurations and
> changing them gradually?
>
> Have you checked the logfiles/ui which tasks  take long?
>
> 17 Mio records does not sound much, but it depends what you do with it.
>
> I do not think that for such a small "cluster" it makes sense to have a
> special scheduling configuration.
>
> > On 6. Jun 2017, at 18:02, satishjohn <sa...@gmail.com> wrote:
> >
> > Performance issue / time taken to complete spark job in yarn is 4 x
> slower,
> > when considered spark standalone mode. However, in spark standalone mode
> > jobs often fails with executor lost issue.
> >
> > Hardware configuration
> >
> >
> > 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)
> >
> > Spark configuration:
> >
> >
> > spark.executor.memory 7g
> > Spark cores Max 96
> > Spark driver 5GB
> > spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job
> fails or
> > job takes 50x times more time)
> > spark.driver.maxResultSize::2g
> > spark.driver.memory::5g
> > No of Instances 4 per machine.
> >
> > With the above spark configuration the spark job for the business flow
> of 17
> > million records completes in 8 Minutes.
> >
> > Problem Area:
> >
> >
> > When run in yarn client mode with the below configuration which takes 33
> to
> > 42 minutes to run the same flow. Below is the yarn-site.xml configuration
> > data.
> >
> > <configuration>
> >  <property><name>yarn.label.enabled</name><value>true</value></property>
> >
> > <property><name>yarn.log-aggregation.enable-local-
> cleanup</name><value>false</value></property>
> >
> > <property><name>yarn.resourcemanager.scheduler.
> client.thread-count</name><value>64</value></property>
> >
> > <property><name>yarn.resourcemanager.resource-
> tracker.address</name><value>satish-NS1:8031</value></property>
> >
> > <property><name>yarn.resourcemanager.scheduler.
> address</name><value>satish-NS1:8030</value></property>
> >
> > <property><name>yarn.dispatcher.exit-on-error</
> name><value>true</value></property>
> >
> > <property><name>yarn.nodemanager.container-manager.
> thread-count</name><value>64</value></property>
> >
> > <property><name>yarn.nodemanager.local-dirs</name><
> value>/home/satish/yarn</value></property>
> >
> > <property><name>yarn.nodemanager.localizer.fetch.
> thread-count</name><value>20</value></property>
> >
> > <property><name>yarn.resourcemanager.address</name>
> <value>satish-NS1:8032</value></property>
> >
> > <property><name>yarn.scheduler.increment-allocation-mb</name><value>
> 512</value></property>
> >
> > <property><name>yarn.log.server.url</name><value>http:/
> /satish-NS1:19888/jobhistory/logs</value></property>
> >
> > <property><name>yarn.nodemanager.resource.memory-
> mb</name><value>28000</value></property>
> >
> > <property><name>yarn.nodemanager.labels</name><value>MASTER</value></
> property>
> >
> > <property><name>yarn.nodemanager.resource.cpu-
> vcores</name><value>48</value></property>
> >
> > <property><name>yarn.scheduler.minimum-allocation-
> mb</name><value>1024</value></property>
> >
> > <property><name>yarn.log-aggregation-enable</name><
> value>true</value></property>
> >
> > <property><name>yarn.nodemanager.localizer.client.
> thread-count</name><value>20</value></property>
> >
> > <property><name>yarn.app.mapreduce.am.labels</name><
> value>CORE</value></property>
> >
> > <property><name>yarn.log-aggregation.retain-seconds</
> name><value>172800</value></property>
> >
> > <property><name>yarn.nodemanager.address</name><
> value>${yarn.nodemanager.hostname}:8041</value></property>
> >
> > <property><name>yarn.resourcemanager.hostname</
> name><value>satish-NS1</value></property>
> >
> > <property><name>yarn.scheduler.maximum-allocation-
> mb</name><value>8192</value></property>
> >
> > <property><name>yarn.nodemanager.remote-app-log-
> dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property>
> >
> > <property><name>yarn.resourcemanager.resource-
> tracker.client.thread-count</name><value>64</value></property>
> >
> > <property><name>yarn.scheduler.maximum-allocation-
> vcores</name><value>1</value></property>
> >
> > <property><name>yarn.nodemanager.aux-services</
> name><value>mapreduce_shuffle,</value></property>
> >
> > <property><name>yarn.nodemanager.aux-services.
> mapreduce_shuffle.class</name><value>org.apache.hadoop.
> mapred.ShuffleHandler</value></property>
> >
> > <property><name>yarn.resourcemanager.client.thread-
> count</name><value>64</value></property>
> >
> > <property><name>yarn.nodemanager.container-metrics.
> enable</name><value>true</value></property>
> >
> > <property><name>yarn.nodemanager.log-dirs</name><
> value>/home/satish/hadoop-yarn/containers</value></property>
> >  <property> <name>yarn.nodemanager.aux-services</name>
> > <value>spark_shuffle,mapreduce_shuffle</value></property>
> > <property>
> > <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
> > <value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>
> >  <property><name>yarn.nodemanager.aux-services.
> spark_shuffle.class</name>
> > <value>org.apache.spark.network.yarn.YarnShuffleService</value></
> property>
> >
> > <property><name>yarn.scheduler.minimum-allocation-
> vcores</name><value>1</value></property>
> >  <property><name>yarn.scheduler.increment-allocation-vcores</name>
> > <value>1</value>    </property>
> > <property> <name>yarn.resourcemanager.scheduler.class</name>
> > <value>org.apache.hadoop.yarn.server.resourcemanager.
> scheduler.fair.FairScheduler</value></property>
> > <property><name>yarn.scheduler.fair.preemption</
> name><value>true</value></property>
> >
> > </configuration>
> >
> > Also in capacity scheduler I am using Dominant resource calculator. I
> have
> > tried hands on other fair and default as well.
> >
> > In order make the test simple, I ran sort on the same cluster with
> > yarn-client mode and spark standalone mode. I can share the data for your
> > comparative test analysis as well.
> >
> > 136 seconds - Yarn-client mode
> > 40 seconds  - Spark Standalone mode
> >
> > To conclude I am looking for a reason and solution for yarn-client mode
> > performance issue best configuration possible to achieve performance from
> > yarn.
> >
> > When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes
> long
> > completes in time and also does not fail often when compared to without
> as I
> > have had history of issues when running job in spark without this option
> > enabled.
> >
> > Let me know how to get similar performance from yarn-client or spark
> > standalone.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Performance-issue-when-running-Spark-1-6-1-in-yarn-
> client-mode-with-Hadoop-2-6-0-tp28747.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

Posted by Jörn Franke <jo...@gmail.com>.

What does your Spark job do? Have you tried standard configurations and changing them gradually?

Have you checked the logfiles/ui which tasks  take long?

17 Mio records does not sound much, but it depends what you do with it. 

I do not think that for such a small "cluster" it makes sense to have a special scheduling configuration.

> On 6. Jun 2017, at 18:02, satishjohn <sa...@gmail.com> wrote:
> 
> Performance issue / time taken to complete spark job in yarn is 4 x slower,
> when considered spark standalone mode. However, in spark standalone mode
> jobs often fails with executor lost issue.
> 
> Hardware configuration
> 
> 
> 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)
> 
> Spark configuration:
> 
> 
> spark.executor.memory 7g
> Spark cores Max 96
> Spark driver 5GB
> spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job fails or
> job takes 50x times more time)
> spark.driver.maxResultSize::2g
> spark.driver.memory::5g
> No of Instances 4 per machine.
> 
> With the above spark configuration the spark job for the business flow of 17
> million records completes in 8 Minutes.
> 
> Problem Area:
> 
> 
> When run in yarn client mode with the below configuration which takes 33 to
> 42 minutes to run the same flow. Below is the yarn-site.xml configuration
> data.
> 
> <configuration>
>  <property><name>yarn.label.enabled</name><value>true</value></property>
> 
> <property><name>yarn.log-aggregation.enable-local-cleanup</name><value>false</value></property>
> 
> <property><name>yarn.resourcemanager.scheduler.client.thread-count</name><value>64</value></property>
> 
> <property><name>yarn.resourcemanager.resource-tracker.address</name><value>satish-NS1:8031</value></property>
> 
> <property><name>yarn.resourcemanager.scheduler.address</name><value>satish-NS1:8030</value></property>
> 
> <property><name>yarn.dispatcher.exit-on-error</name><value>true</value></property>
> 
> <property><name>yarn.nodemanager.container-manager.thread-count</name><value>64</value></property>
> 
> <property><name>yarn.nodemanager.local-dirs</name><value>/home/satish/yarn</value></property>
> 
> <property><name>yarn.nodemanager.localizer.fetch.thread-count</name><value>20</value></property>
> 
> <property><name>yarn.resourcemanager.address</name><value>satish-NS1:8032</value></property>
> 
> <property><name>yarn.scheduler.increment-allocation-mb</name><value>512</value></property>
> 
> <property><name>yarn.log.server.url</name><value>http://satish-NS1:19888/jobhistory/logs</value></property>
> 
> <property><name>yarn.nodemanager.resource.memory-mb</name><value>28000</value></property>
> 
> <property><name>yarn.nodemanager.labels</name><value>MASTER</value></property>
> 
> <property><name>yarn.nodemanager.resource.cpu-vcores</name><value>48</value></property>
> 
> <property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value></property>
> 
> <property><name>yarn.log-aggregation-enable</name><value>true</value></property>
> 
> <property><name>yarn.nodemanager.localizer.client.thread-count</name><value>20</value></property>
> 
> <property><name>yarn.app.mapreduce.am.labels</name><value>CORE</value></property>
> 
> <property><name>yarn.log-aggregation.retain-seconds</name><value>172800</value></property>
> 
> <property><name>yarn.nodemanager.address</name><value>${yarn.nodemanager.hostname}:8041</value></property>
> 
> <property><name>yarn.resourcemanager.hostname</name><value>satish-NS1</value></property>
> 
> <property><name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value></property>
> 
> <property><name>yarn.nodemanager.remote-app-log-dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property>
> 
> <property><name>yarn.resourcemanager.resource-tracker.client.thread-count</name><value>64</value></property>
> 
> <property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property>
> 
> <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,</value></property>
> 
> <property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
> 
> <property><name>yarn.resourcemanager.client.thread-count</name><value>64</value></property>
> 
> <property><name>yarn.nodemanager.container-metrics.enable</name><value>true</value></property>
> 
> <property><name>yarn.nodemanager.log-dirs</name><value>/home/satish/hadoop-yarn/containers</value></property>
>  <property> <name>yarn.nodemanager.aux-services</name>
> <value>spark_shuffle,mapreduce_shuffle</value></property>    
> <property>
> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>     
> <value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>
>  <property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name>   
> <value>org.apache.spark.network.yarn.YarnShuffleService</value></property>
> 
> <property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property>
>  <property><name>yarn.scheduler.increment-allocation-vcores</name>       
> <value>1</value>    </property>
> <property> <name>yarn.resourcemanager.scheduler.class</name>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property>
> <property><name>yarn.scheduler.fair.preemption</name><value>true</value></property>
> 
> </configuration>
> 
> Also in capacity scheduler I am using Dominant resource calculator. I have
> tried hands on other fair and default as well.
> 
> In order make the test simple, I ran sort on the same cluster with
> yarn-client mode and spark standalone mode. I can share the data for your
> comparative test analysis as well.
> 
> 136 seconds - Yarn-client mode
> 40 seconds  - Spark Standalone mode
> 
> To conclude I am looking for a reason and solution for yarn-client mode
> performance issue best configuration possible to achieve performance from
> yarn. 
> 
> When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes long
> completes in time and also does not fail often when compared to without as I
> have had history of issues when running job in spark without this option
> enabled. 
> 
> Let me know how to get similar performance from yarn-client or spark
> standalone.
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-when-running-Spark-1-6-1-in-yarn-client-mode-with-Hadoop-2-6-0-tp28747.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org