You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by htailor <he...@live.co.uk> on 2015/03/10 16:30:46 UTC
Pyspark not using all cores
Hi All,
I need some help with a problem in pyspark which is causing a major issue.
Recently I've noticed that the behaviour of the python.deamons on the worker
nodes for compute-intensive tasks have changed from using all the avaliable
cores to using only a single core. On each worker node, 8 python.deamons
exist but they all seem to run on a single core. The remaining 7 cores idle.
Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
standalone mode via Cloudera 5.3.2.
To give a better understanding of the problem I have made a quick script
from one of the given examples which replicates the problem:
When I run the calculating_pies.py script using the command "spark-submit
calculate_pies.py" this is what I typically see on all my worker nodes:
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] 5
[||| 3.9%]
2 [
0.0%] 6 [||| 2.6%]
3 [||
1.3%] 7 [||||| 8.4%]
4 [||
1.3%] 8 [|||||| 8.7%]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
30672 spark 20 0 225M 112M 1156 R 13.0 0.2 0:03.10 python
-m pyspark.daemon
30681 spark 20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python
-m pyspark.daemon
30687 spark 20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python
-m pyspark.daemon
30678 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
-m pyspark.daemon
30693 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python
-m pyspark.daemon
30674 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
-m pyspark.daemon
30688 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python
-m pyspark.daemon
30684 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
-m pyspark.daemon
Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
also see the same behaviour if I use the flag --total-executor-cores 64 in
spark-submit.
Strangly, If I run the same script in local mode everything seems to run
fine. This is what I see
1 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 5
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
2 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 6
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
3 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 7
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
4 [|||||||||||||||||||||||||||||||||||||||||||||||99.3%] 8
[|||||||||||||||||||||||||||||||||||||||||||||||99.4%]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
22519 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.97 python
-m pyspark.daemon
22508 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.92 python
-m pyspark.daemon
22513 data 20 0 225M 106M 1368 R 99.0 0.2 0:11.02 python
-m pyspark.daemon
22526 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.84 python
-m pyspark.daemon
22522 data 20 0 225M 106M 1368 R 98.0 0.2 0:10.95 python
-m pyspark.daemon
22523 data 20 0 225M 106M 1368 R 97.0 0.2 0:10.92 python
-m pyspark.daemon
22507 data 20 0 225M 106M 1368 R 97.0 0.2 0:10.83 python
-m pyspark.daemon
22516 data 20 0 225M 106M 1368 R 93.0 0.2 0:10.88 python
-m pyspark.daemon
============== calculating_pies.py ==================
#!/usr/bin/pyspark
import random
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.storagelevel import StorageLevel
def pi(NUM_SAMPLE = 1000000):
count = 0.
for i in xrange(NUM_SAMPLE):
x, y = random.random(), random.random()
if (x * x) + (y * y) < 1:
count += 1
return 4.0 * (count / NUM_SAMPLE)
if __name__ == "__main__":
sconf = (SparkConf()
.set('spark.default.parallelism','256')
.set('spark.app.name', 'Calculating PI'))
# local
# sc = SparkContext(conf=sconf)
# standalone
sc = SparkContext("spark://<driver_host>:7077", conf=sconf)
# yarn
# sc = SparkContext("yarn-client", conf=sconf)
rdd_pies = sc.parallelize(range(10000), 1000)
rdd_pies.map(lambda x: pi()).collect()
sc.stop()
=====================================================
Does anyone have any suggestions, or know of any config we should be looking
at that could solve this problem? Does anyone else see the same problem?
Any help is appreciated, Thanks.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-using-all-cores-tp21989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Pyspark not using all cores
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Can you paste your complete spark-submit command? Also did you try
specifying *--worker-cores*?
Thanks
Best Regards
On Tue, Mar 10, 2015 at 9:00 PM, htailor <he...@live.co.uk> wrote:
> Hi All,
>
> I need some help with a problem in pyspark which is causing a major issue.
>
> Recently I've noticed that the behaviour of the python.deamons on the
> worker
> nodes for compute-intensive tasks have changed from using all the avaliable
> cores to using only a single core. On each worker node, 8 python.deamons
> exist but they all seem to run on a single core. The remaining 7 cores
> idle.
>
> Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
> with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
> standalone mode via Cloudera 5.3.2.
>
> To give a better understanding of the problem I have made a quick script
> from one of the given examples which replicates the problem:
>
> When I run the calculating_pies.py script using the command "spark-submit
> calculate_pies.py" this is what I typically see on all my worker nodes:
>
> 1 [|||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] 5
> [||| 3.9%]
> 2 [
> 0.0%] 6 [||| 2.6%]
> 3 [||
> 1.3%] 7 [||||| 8.4%]
> 4 [||
> 1.3%] 8 [|||||| 8.7%]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 30672 spark 20 0 225M 112M 1156 R 13.0 0.2 0:03.10 python
> -m pyspark.daemon
> 30681 spark 20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python
> -m pyspark.daemon
> 30687 spark 20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python
> -m pyspark.daemon
> 30678 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
> -m pyspark.daemon
> 30693 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python
> -m pyspark.daemon
> 30674 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
> -m pyspark.daemon
> 30688 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python
> -m pyspark.daemon
> 30684 spark 20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python
> -m pyspark.daemon
>
> Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
> also see the same behaviour if I use the flag --total-executor-cores 64 in
> spark-submit.
>
> Strangly, If I run the same script in local mode everything seems to run
> fine. This is what I see
>
> 1 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 5
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
> 2 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 6
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
> 3 [||||||||||||||||||||||||||||||||||||||||||||||100.0%] 7
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
> 4 [|||||||||||||||||||||||||||||||||||||||||||||||99.3%] 8
> [|||||||||||||||||||||||||||||||||||||||||||||||99.4%]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 22519 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.97
> python
> -m pyspark.daemon
> 22508 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.92
> python
> -m pyspark.daemon
> 22513 data 20 0 225M 106M 1368 R 99.0 0.2 0:11.02
> python
> -m pyspark.daemon
> 22526 data 20 0 225M 106M 1368 R 99.0 0.2 0:10.84
> python
> -m pyspark.daemon
> 22522 data 20 0 225M 106M 1368 R 98.0 0.2 0:10.95
> python
> -m pyspark.daemon
> 22523 data 20 0 225M 106M 1368 R 97.0 0.2 0:10.92
> python
> -m pyspark.daemon
> 22507 data 20 0 225M 106M 1368 R 97.0 0.2 0:10.83
> python
> -m pyspark.daemon
> 22516 data 20 0 225M 106M 1368 R 93.0 0.2 0:10.88
> python
> -m pyspark.daemon
>
>
> ============== calculating_pies.py ==================
>
> #!/usr/bin/pyspark
>
> import random
> from pyspark.conf import SparkConf
> from pyspark.context import SparkContext
> from pyspark.storagelevel import StorageLevel
>
> def pi(NUM_SAMPLE = 1000000):
> count = 0.
> for i in xrange(NUM_SAMPLE):
> x, y = random.random(), random.random()
> if (x * x) + (y * y) < 1:
> count += 1
> return 4.0 * (count / NUM_SAMPLE)
>
> if __name__ == "__main__":
>
> sconf = (SparkConf()
> .set('spark.default.parallelism','256')
> .set('spark.app.name', 'Calculating PI'))
>
> # local
> # sc = SparkContext(conf=sconf)
>
> # standalone
> sc = SparkContext("spark://<driver_host>:7077", conf=sconf)
>
> # yarn
> # sc = SparkContext("yarn-client", conf=sconf)
>
> rdd_pies = sc.parallelize(range(10000), 1000)
> rdd_pies.map(lambda x: pi()).collect()
> sc.stop()
>
> =====================================================
>
> Does anyone have any suggestions, or know of any config we should be
> looking
> at that could solve this problem? Does anyone else see the same problem?
>
> Any help is appreciated, Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-using-all-cores-tp21989.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>