You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by htailor <he...@live.co.uk> on 2015/03/10 16:30:46 UTC

Pyspark not using all cores

Hi All,

I need some help with a problem in pyspark which is causing a major issue. 

Recently I've noticed that the behaviour of the python.deamons on the worker
nodes for compute-intensive tasks have changed from using all the avaliable
cores to using only a single core. On each worker node, 8 python.deamons
exist but they all seem to run on a single core. The remaining 7 cores idle.

Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
standalone mode via Cloudera 5.3.2. 

To give a better understanding of the problem I have made a quick script
from one of the given examples which replicates the problem:

When I run the calculating_pies.py script using the command "spark-submit
calculate_pies.py" this is what I typically see on all my worker nodes:

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]      5 
[|||                                                  3.9%]
  2  [                                                                     
0.0%]      6  [|||                                                  2.6%]
  3  [||                                                                   
1.3%]     7  [|||||                                                8.4%]
  4  [||                                                                   
1.3%]     8  [||||||                                               8.7%]

  PID USER      PRI  NI  VIRT   RES    SHR S CPU% MEM%   TIME+  Command
30672 spark    20   0   225M  112M  1156 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30681 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30687 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30678 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon
30693 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
-m pyspark.daemon
30674 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon
30688 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
-m pyspark.daemon
30684 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon

Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
also see the same behaviour if I use the flag --total-executor-cores 64 in
spark-submit.

Strangly, If I run the same script in local mode everything seems to run
fine. This is what I see

  1  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     5 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  2  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     6 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  3  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     7 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  4  [|||||||||||||||||||||||||||||||||||||||||||||||99.3%]     8 
[|||||||||||||||||||||||||||||||||||||||||||||||99.4%]

  PID USER      PRI  NI  VIRT   RES    SHR  S CPU% MEM%   TIME+  Command
22519 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.97 python
-m pyspark.daemon
22508 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.92 python
-m pyspark.daemon
22513 data       20   0  225M  106M  1368 R 99.0   0.2        0:11.02 python
-m pyspark.daemon
22526 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.84 python
-m pyspark.daemon
22522 data       20   0  225M  106M  1368 R 98.0   0.2        0:10.95 python
-m pyspark.daemon
22523 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.92 python
-m pyspark.daemon
22507 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.83 python
-m pyspark.daemon
22516 data       20   0  225M  106M  1368 R 93.0   0.2        0:10.88 python
-m pyspark.daemon


============== calculating_pies.py ==================

#!/usr/bin/pyspark

import random
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.storagelevel import StorageLevel

def pi(NUM_SAMPLE = 1000000):
    count = 0.
    for i in xrange(NUM_SAMPLE):
        x, y = random.random(), random.random()
        if (x * x) + (y * y) < 1:
            count += 1
    return 4.0 * (count / NUM_SAMPLE)

if __name__ == "__main__":

    sconf = (SparkConf()
             .set('spark.default.parallelism','256')
             .set('spark.app.name', 'Calculating PI'))

    # local
    # sc = SparkContext(conf=sconf)

    # standalone
    sc = SparkContext("spark://<driver_host>:7077", conf=sconf)

    # yarn
    # sc = SparkContext("yarn-client", conf=sconf)

    rdd_pies = sc.parallelize(range(10000), 1000)
    rdd_pies.map(lambda x: pi()).collect()
    sc.stop()

=====================================================

Does anyone have any suggestions, or know of any config we should be looking
at that could solve this problem? Does anyone else see the same problem?

Any help is appreciated, Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-using-all-cores-tp21989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Pyspark not using all cores

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Can you paste your complete spark-submit command? Also did you try
specifying *--worker-cores*?

Thanks
Best Regards

On Tue, Mar 10, 2015 at 9:00 PM, htailor <he...@live.co.uk> wrote:

> Hi All,
>
> I need some help with a problem in pyspark which is causing a major issue.
>
> Recently I've noticed that the behaviour of the python.deamons on the
> worker
> nodes for compute-intensive tasks have changed from using all the avaliable
> cores to using only a single core. On each worker node, 8 python.deamons
> exist but they all seem to run on a single core. The remaining 7 cores
> idle.
>
> Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
> with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
> standalone mode via Cloudera 5.3.2.
>
> To give a better understanding of the problem I have made a quick script
> from one of the given examples which replicates the problem:
>
> When I run the calculating_pies.py script using the command "spark-submit
> calculate_pies.py" this is what I typically see on all my worker nodes:
>
>   1  [|||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]      5
> [|||                                                  3.9%]
>   2  [
> 0.0%]      6  [|||                                                  2.6%]
>   3  [||
> 1.3%]     7  [|||||                                                8.4%]
>   4  [||
> 1.3%]     8  [||||||                                               8.7%]
>
>   PID USER      PRI  NI  VIRT   RES    SHR S CPU% MEM%   TIME+  Command
> 30672 spark    20   0   225M  112M  1156 R 13.0   0.2       0:03.10 python
> -m pyspark.daemon
> 30681 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
> -m pyspark.daemon
> 30687 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
> -m pyspark.daemon
> 30678 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
> -m pyspark.daemon
> 30693 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
> -m pyspark.daemon
> 30674 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
> -m pyspark.daemon
> 30688 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
> -m pyspark.daemon
> 30684 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
> -m pyspark.daemon
>
> Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
> also see the same behaviour if I use the flag --total-executor-cores 64 in
> spark-submit.
>
> Strangly, If I run the same script in local mode everything seems to run
> fine. This is what I see
>
>   1  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     5
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
>   2  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     6
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
>   3  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     7
> [||||||||||||||||||||||||||||||||||||||||||||||100.0%]
>   4  [|||||||||||||||||||||||||||||||||||||||||||||||99.3%]     8
> [|||||||||||||||||||||||||||||||||||||||||||||||99.4%]
>
>   PID USER      PRI  NI  VIRT   RES    SHR  S CPU% MEM%   TIME+  Command
> 22519 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.97
> python
> -m pyspark.daemon
> 22508 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.92
> python
> -m pyspark.daemon
> 22513 data       20   0  225M  106M  1368 R 99.0   0.2        0:11.02
> python
> -m pyspark.daemon
> 22526 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.84
> python
> -m pyspark.daemon
> 22522 data       20   0  225M  106M  1368 R 98.0   0.2        0:10.95
> python
> -m pyspark.daemon
> 22523 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.92
> python
> -m pyspark.daemon
> 22507 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.83
> python
> -m pyspark.daemon
> 22516 data       20   0  225M  106M  1368 R 93.0   0.2        0:10.88
> python
> -m pyspark.daemon
>
>
> ============== calculating_pies.py ==================
>
> #!/usr/bin/pyspark
>
> import random
> from pyspark.conf import SparkConf
> from pyspark.context import SparkContext
> from pyspark.storagelevel import StorageLevel
>
> def pi(NUM_SAMPLE = 1000000):
>     count = 0.
>     for i in xrange(NUM_SAMPLE):
>         x, y = random.random(), random.random()
>         if (x * x) + (y * y) < 1:
>             count += 1
>     return 4.0 * (count / NUM_SAMPLE)
>
> if __name__ == "__main__":
>
>     sconf = (SparkConf()
>              .set('spark.default.parallelism','256')
>              .set('spark.app.name', 'Calculating PI'))
>
>     # local
>     # sc = SparkContext(conf=sconf)
>
>     # standalone
>     sc = SparkContext("spark://<driver_host>:7077", conf=sconf)
>
>     # yarn
>     # sc = SparkContext("yarn-client", conf=sconf)
>
>     rdd_pies = sc.parallelize(range(10000), 1000)
>     rdd_pies.map(lambda x: pi()).collect()
>     sc.stop()
>
> =====================================================
>
> Does anyone have any suggestions, or know of any config we should be
> looking
> at that could solve this problem? Does anyone else see the same problem?
>
> Any help is appreciated, Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-using-all-cores-tp21989.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>