You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by abhiguruvayya <sh...@gmail.com> on 2014/06/17 18:36:59 UTC

Executors not utilized properly.

I am creating around 10 executors with 12 cores and 7g memory, but when i
launch a task not all executors are being used. For example if my job has 9
tasks, only 3 executors are being used with 3 task each and i believe this
is making my app slower than map reduce program for the same use case. Can
any one throw some light on executor configuration if any?How can i use all
the executors. I am running spark on yarn and Hadoop 2.4.0.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by abhiguruvayya <sh...@gmail.com>.

Can some one help me with this. Any help is appreciated.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by abhiguruvayya <sh...@gmail.com>.

Perfect!! That makes so much sense to me now. Thanks a ton!!!!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7793.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by Aaron Davidson <il...@gmail.com>.

repartition() is actually just an alias of coalesce(), but which the
shuffle flag to set to true. This shuffle is probably what you're seeing as
taking longer, but it is required when you go from a smaller number of
partitions to a larger.

When actually decreasing the number of partitions, coalesce(shuffle =
false) will be fully pipelined, but is limited in how it can redistribute
data, as it can only combine whole partitions into larger partitions. For
example, if you have an rdd with 101 partitions, and you do
rdd.coalesce(100, shuffle = false), then the resultant rdd will have 99 of
the original partitions, and 1 partition will just be 2 original partitions
combined. This can lead to increased data skew, but requires no effort to
create.

On the other hand, if you do rdd.coalesce(100, shuffle = true), then all of
the data will actually be reshuffled into 100 new evenly-sized partitions,
eliminating any data skew at the cost of actually moving all data around.

On Tue, Jun 17, 2014 at 4:52 PM, abhiguruvayya <sh...@gmail.com>
wrote:

> I found the main reason to be that i was using coalesce instead of
> repartition. coalesce was shrinking the portioning so the number of tasks
> were very less to be executed by all of the executors. Can you help me in
> understudying when to use coalesce and when to use repartition. In
> application coalesce is being processed faster then repartition. Which is
> unusual.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Executors not utilized properly.

Posted by abhiguruvayya <sh...@gmail.com>.

I found the main reason to be that i was using coalesce instead of
repartition. coalesce was shrinking the portioning so the number of tasks
were very less to be executed by all of the executors. Can you help me in
understudying when to use coalesce and when to use repartition. In
application coalesce is being processed faster then repartition. Which is
unusual.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by abhiguruvayya <sh...@gmail.com>.

My use case was to read 3000 files from 3000 different HDFS directories so i
was reading each file and creating RDD and adding it to array of JavaRDD
then do a union(rdd...). Because of this my prog was very slow(5 minutes).
After i replaced this logic with textFile(path1,path2,path3) it is working
super fast(56 sec). So union() was the overhead.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7785.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

Hi Abhishek,

> Where mapreduce is taking 2 mins, spark is taking 5 min to complete the
job.

Interesting. Could you tell us more about your program? A "code skeleton"
would certainly be helpful.

Thanks!

-Jey


On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya <sh...@gmail.com>
wrote:

> I did try creating more partitions by overriding the default number of
> partitions determined by HDFS splits. Problem is, in this case program will
> run for ever. I have same set of inputs for map reduce and spark. Where map
> reduce is taking 2 mins, spark is taking 5 min to complete the job. I
> thought because all of the executors are not being utilized properly my
> spark program is running slower than map reduce. I can provide you my code
> skeleton for your reference. Please help me with this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Executors not utilized properly.

Posted by abhiguruvayya <sh...@gmail.com>.

I did try creating more partitions by overriding the default number of
partitions determined by HDFS splits. Problem is, in this case program will
run for ever. I have same set of inputs for map reduce and spark. Where map
reduce is taking 2 mins, spark is taking 5 min to complete the job. I
thought because all of the executors are not being utilized properly my
spark program is running slower than map reduce. I can provide you my code
skeleton for your reference. Please help me with this.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Executors not utilized properly.

Posted by Sean Owen <so...@cloudera.com>.

It sounds like your job has 9 tasks and all are executing simultaneously in
parallel. This is as good as it gets right? Are you asking how to break the
work into more tasks, like 120 to match your 10*12 cores? Make your RDD
have more partitions. For example the textFile method can override the
default number of partitions determined by HDFS splits.
On Jun 17, 2014 5:37 PM, "abhiguruvayya" <sh...@gmail.com> wrote:

> I am creating around 10 executors with 12 cores and 7g memory, but when i
> launch a task not all executors are being used. For example if my job has 9
> tasks, only 3 executors are being used with 3 task each and i believe this
> is making my app slower than map reduce program for the same use case. Can
> any one throw some light on executor configuration if any?How can i use all
> the executors. I am running spark on yarn and Hadoop 2.4.0.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>