You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by prateek arora <pr...@gmail.com> on 2016/05/10 04:58:43 UTC

spark 1.6 : RDD Partitions not distributed evenly to executors

Hi

My Spark Streaming application receiving data from one kafka  topic ( one
partition) and rdd have 30 partition.  

but scheduler schedule the task between executors  running on  same host
with NODE_LOCAL locality level.  ( where kafka topic partition created) . 

Below are the logs :

16/05/06 11:21:38 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:21:38 DEBUG TaskSetManager: Epoch for TaskSet 1.0: 1
16/05/06 11:21:38 DEBUG TaskSetManager: Valid locality levels for TaskSet
1.0: NODE_LOCAL, RACK_LOCAL, ANY
16/05/06 11:21:38 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1, ivcp-m04.novalocal, partition 0,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID
2, ivcp-m04.novalocal, partition 1,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID
3, ivcp-m04.novalocal, partition 2,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID
4, ivcp-m04.novalocal, partition 3,NODE_LOCAL, 2248 bytes)
16/05/06 11:21:38 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID
5, ivcp-m04.novalocal, partition 4,NODE_LOCAL, 2248 bytes)



I have seen this scenario after upgrading my spark from 1.5.to 1.6 . same
application distributed rdd partition  evenly to executors in spark 1.5 .

As mentioned on some spark developer blogs , I have tried
spark.shuffle.reduceLocality.enabled=false   and after that my rdd partition
is distributed between executors of all host with PROCESS_LOCAL locality
level.

Below are the logs :

16/05/06 11:24:46 INFO YarnScheduler: Adding task set 1.0 with 30 tasks
16/05/06 11:24:46 DEBUG TaskSetManager: Valid locality levels for TaskSet
1.0: NO_PREF, ANY

16/05/06 11:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1, ivcp-m02.novalocal, partition 0,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID
2, ivcp-m01.novalocal, partition 1,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID
3, ivcp-m06.novalocal, partition 2,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID
4, ivcp-m04.novalocal, partition 3,PROCESS_LOCAL, 2248 bytes)
16/05/06 11:24:46 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID
5, ivcp-m04.novalocal, partition 4,PROCESS_LOCAL, 2248 bytes)
--------
--------
--------


Is above configuration is correct solution for problem  ? and why
spark.shuffle.reduceLocality.enabled not mentioned in spark configuration
document ?



Regards
Prateek



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-RDD-Partitions-not-distributed-evenly-to-executors-tp26911.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org