You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Bryński (JIRA)" <ji...@apache.org> on 2016/07/18 11:47:20 UTC
[jira] [Comment Edited] (SPARK-16321) Pyspark 2.0 performance drop
vs pyspark 1.6
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382113#comment-15382113 ]
Maciej Bryński edited comment on SPARK-16321 at 7/18/16 11:46 AM:
------------------------------------------------------------------
OK. I did tests with VisualVM and Python profiler enabled
I set following options:
{code}
spark.driver.memory 8g
spark.memory.fraction 0.6
spark.executor.memory 30g
spark.executor.cores 15
spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC -XX:-UseCompressedOops
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 -Dcom.sun.management.jmxremote.rmi.port=4047
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
{code}
I think those GC options eliminated GC overhead problem.
*Total run time:*
Spark 2.0 : 3,3 min
Spark 1.6: 2,3 min
I attached screenshots from VisualVM from one of the executors (both 1.6 and 2.0 version).
As you can see there are 2 main problems:
1) PythonRunner time
2) Parquet reader time
PythonRunner time indicate problems with Python code so I paste profiles from Python BasicProfiles.
And this is the moment that I don't understand results.
From JVM point of view Spark 2.0 is 50% slower (org.apache.spark.api.python.PythonRunner$$anon$1.read() method)
From Python point of view Spark 2.0 is slightly faster.
So where could be a problem ?
Or maybe I'm wrong and this method is only time spend on reading data from Python ? Not Python code execution
Spark 1.6 Profiler
{code}
============================================================
Profile of RDD<id=11>
============================================================
3860611667 function calls (3529336059 primitive calls) in 4360.151 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
29503 480.019 0.016 3968.411 0.135 {built-in method loads}
1588051 420.956 0.000 420.956 0.000 decoder.py:349(raw_decode)
14602755 362.611 0.000 912.538 0.000 types.py:565(<listcomp>)
59406 319.737 0.005 319.737 0.005 {method 'read' of '_io.BufferedReader' objects}
143753067/43088753 310.714 0.000 1263.782 0.000 types.py:424(fromJson)
689592988 309.769 0.000 549.927 0.000 types.py:437(fromInternal)
143753067 303.711 0.000 366.807 0.000 types.py:394(__init__)
71409609 268.747 0.000 480.577 0.000 types.py:1162(_create_row)
146639250/1588051 181.296 0.000 1376.390 0.001 types.py:736(_parse_datatype_json_value)
723603860 179.262 0.000 179.262 0.000 {built-in method isinstance}
149238722/71409609 146.782 0.000 1547.714 0.000 types.py:558(fromInternal)
71409609 99.052 0.000 99.052 0.000 {built-in method __new__ of type object at 0x9d1c40}
5453542/1588051 95.142 0.000 1288.681 0.001 types.py:527(<listcomp>)
139887576 67.573 0.000 67.573 0.000 {method 'keys' of 'dict' objects}
615371111 64.788 0.000 64.788 0.000 types.py:87(fromInternal)
71409609 64.314 0.000 64.314 0.000 types.py:1280(__setattr__)
400 63.731 0.159 4360.136 10.900 serializers.py:259(dump_stream)
71409609 53.644 0.000 1601.358 0.000 types.py:1159(<lambda>)
120860802 51.438 0.000 115.739 0.000 types.py:466(<genexpr>)
149206609 50.887 0.000 64.772 0.000 types.py:464(<genexpr>)
115407260 50.515 0.000 64.300 0.000 types.py:431(needConversion)
71409609 48.464 0.000 147.516 0.000 types.py:1194(__new__)
1588051 39.248 0.000 1859.901 0.001 types.py:684(_parse_datatype_json_string)
8915977 29.611 0.000 51.700 0.000 types.py:322(<listcomp>)
71409609 27.133 0.000 27.133 0.000 types.py:1158(_create_row_inbound_converter)
5453542/1588051 26.184 0.000 1371.215 0.001 types.py:525(fromJson)
5453542 23.354 0.000 264.314 0.000 types.py:446(__init__)
5453542 22.546 0.000 22.546 0.000 types.py:463(<listcomp>)
2354475 22.168 0.000 22.168 0.000 {built-in method fromtimestamp}
5453542 21.537 0.000 86.309 0.000 {built-in method all}
19161994 19.031 0.000 88.270 0.000 types.py:319(fromInternal)
5453542 16.367 0.000 131.358 0.000 {built-in method any}
19815964 15.244 0.000 29.920 0.000 types.py:174(fromInternal)
19250503 14.890 0.000 17.730 0.000 types.py:311(needConversion)
12500825 14.676 0.000 14.676 0.000 {built-in method fromordinal}
114485251 13.383 0.000 13.383 0.000 types.py:73(needConversion)
2538382 9.649 0.000 40.856 0.000 types.py:194(fromInternal)
2354475 9.039 0.000 9.039 0.000 {method 'replace' of 'datetime.datetime' objects}
4352388 8.644 0.000 8.644 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1588051 7.681 0.000 436.389 0.000 decoder.py:338(decode)
1588051 5.644 0.000 444.263 0.000 __init__.py:271(loads)
19293583 2.852 0.000 2.852 0.000 types.py:529(needConversion)
1298132 2.834 0.000 413.293 0.000 types.py:306(fromJson)
873033 2.422 0.000 6.821 0.000 <ipython-input-12-1ecd485d1a19>:2(<lambda>)
2461092 2.322 0.000 2.322 0.000 {method 'startswith' of 'str' objects}
873041 1.813 0.000 4.399 0.000 types.py:1267(__getattr__)
1298132 1.751 0.000 2.307 0.000 types.py:283(__init__)
873041 1.671 0.000 1.821 0.000 types.py:1254(__getitem__)
588143 1.171 0.000 1.171 0.000 types.py:216(__init__)
3176102 1.014 0.000 1.014 0.000 {method 'end' of '_sre.SRE_Match' objects}
1176286 0.700 0.000 0.700 0.000 {method 'group' of '_sre.SRE_Match' objects}
29903 0.457 0.000 4289.348 0.143 serializers.py:155(_read_with_length)
1617562 0.447 0.000 0.447 0.000 {built-in method len}
29903 0.371 0.000 300.548 0.010 serializers.py:542(read_int)
873041 0.340 0.000 0.340 0.000 {method 'index' of 'list' objects}
29903 0.236 0.000 4289.584 0.143 serializers.py:136(load_stream)
29503 0.227 0.000 3968.639 0.135 serializers.py:418(loads)
29903 0.130 0.000 0.130 0.000 {built-in method unpack}
407054 0.104 0.000 0.104 0.000 types.py:185(needConversion)
383366 0.093 0.000 0.093 0.000 types.py:167(needConversion)
400 0.006 0.000 4360.150 10.900 worker.py:104(process)
400 0.005 0.000 0.007 0.000 serializers.py:217(load_stream)
400 0.002 0.000 0.002 0.000 rdd.py:303(func)
400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching)
800 0.000 0.000 0.000 0.000 {built-in method from_iterable}
4 0.000 0.000 0.000 0.000 {built-in method dumps}
400 0.000 0.000 0.000 0.000 {built-in method iter}
400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4 0.000 0.000 0.000 0.000 serializers.py:414(dumps)
4 0.000 0.000 0.000 0.000 serializers.py:549(write_int)
8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}
4 0.000 0.000 0.000 0.000 {built-in method pack}
{code}
Spark 2.0 Profiler
{code}
============================================================
Profile of RDD<id=9>
============================================================
4015274806 function calls (3683999000 primitive calls) in 3824.303 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
29503 447.179 0.015 3581.490 0.121 {built-in method loads}
1588069 343.160 0.000 343.160 0.000 decoder.py:349(raw_decode)
14602755 331.047 0.000 837.768 0.000 types.py:600(<listcomp>)
689592988 285.243 0.000 506.721 0.000 types.py:438(fromInternal)
143753265 276.530 0.000 348.146 0.000 types.py:394(__init__)
143753265/43088951 261.310 0.000 1137.452 0.000 types.py:425(fromJson)
71409609 238.990 0.000 430.747 0.000 types.py:1354(_create_row)
867357953 181.193 0.000 181.193 0.000 {built-in method isinstance}
59406 176.652 0.003 176.652 0.003 {method 'read' of '_io.BufferedReader' objects}
146639466/1588069 168.921 0.000 1243.338 0.001 types.py:891(_parse_datatype_json_value)
149238722/71409609 136.504 0.000 1413.990 0.000 types.py:593(fromInternal)
71409609 84.433 0.000 84.433 0.000 {built-in method __new__ of type object at 0x9d1c40}
5453560/1588069 81.277 0.000 1160.041 0.001 types.py:562(<listcomp>)
615371111 61.250 0.000 61.250 0.000 types.py:87(fromInternal)
71409609 61.204 0.000 61.204 0.000 types.py:1483(__setattr__)
400 58.462 0.146 3824.286 9.561 serializers.py:259(dump_stream)
139887774 51.188 0.000 51.188 0.000 {method 'keys' of 'dict' objects}
71409609 51.071 0.000 1465.061 0.000 types.py:1351(<lambda>)
120861018 47.346 0.000 106.776 0.000 types.py:476(<genexpr>)
149206825 46.725 0.000 59.287 0.000 types.py:474(<genexpr>)
115407458 46.443 0.000 59.430 0.000 types.py:432(needConversion)
71409609 46.120 0.000 130.553 0.000 types.py:1400(__new__)
1588069 35.083 0.000 1643.525 0.001 types.py:842(_parse_datatype_json_string)
8915977 26.167 0.000 46.662 0.000 types.py:322(<listcomp>)
71409609 25.725 0.000 25.725 0.000 types.py:1350(_create_row_inbound_converter)
5453560 25.075 0.000 251.842 0.000 types.py:456(__init__)
2354475 21.215 0.000 21.215 0.000 {built-in method fromtimestamp}
5453560/1588069 20.432 0.000 1238.601 0.001 types.py:560(fromJson)
5453560 20.328 0.000 20.328 0.000 types.py:473(<listcomp>)
5453560 19.526 0.000 78.813 0.000 {built-in method all}
19161994 16.877 0.000 78.520 0.000 types.py:319(fromInternal)
5453560 15.870 0.000 121.936 0.000 {built-in method any}
19815964 13.901 0.000 27.726 0.000 types.py:174(fromInternal)
12500825 13.825 0.000 13.825 0.000 {built-in method fromordinal}
114485449 12.640 0.000 12.640 0.000 types.py:73(needConversion)
19250503 12.552 0.000 15.145 0.000 types.py:311(needConversion)
2538382 8.971 0.000 38.776 0.000 types.py:194(fromInternal)
2354475 8.590 0.000 8.590 0.000 {method 'replace' of 'datetime.datetime' objects}
4352424 8.459 0.000 8.459 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1588069 7.168 0.000 357.679 0.000 decoder.py:338(decode)
1588069 5.234 0.000 365.104 0.000 __init__.py:271(loads)
5453560 3.650 0.000 4.979 0.000 types.py:524(__iter__)
19293583 2.599 0.000 2.599 0.000 types.py:564(needConversion)
1298132 2.476 0.000 373.483 0.000 types.py:306(fromJson)
873033 2.318 0.000 6.430 0.000 <ipython-input-8-1ecd485d1a19>:2(<lambda>)
2461110 2.294 0.000 2.294 0.000 {method 'startswith' of 'str' objects}
873041 1.695 0.000 4.112 0.000 types.py:1470(__getattr__)
1298132 1.655 0.000 2.191 0.000 types.py:283(__init__)
873041 1.534 0.000 1.678 0.000 types.py:1457(__getitem__)
5453960 1.329 0.000 1.329 0.000 {built-in method iter}
588143 1.073 0.000 1.073 0.000 types.py:216(__init__)
3176138 0.943 0.000 0.943 0.000 {method 'end' of '_sre.SRE_Match' objects}
1176286 0.673 0.000 0.673 0.000 {method 'group' of '_sre.SRE_Match' objects}
1617580 0.420 0.000 0.420 0.000 {built-in method len}
29903 0.398 0.000 3759.130 0.126 serializers.py:155(_read_with_length)
873041 0.327 0.000 0.327 0.000 {method 'index' of 'list' objects}
29903 0.274 0.000 157.542 0.005 serializers.py:542(read_int)
29903 0.263 0.000 3759.393 0.126 serializers.py:136(load_stream)
29503 0.198 0.000 3581.688 0.121 serializers.py:418(loads)
29903 0.104 0.000 0.104 0.000 {built-in method unpack}
407054 0.097 0.000 0.097 0.000 types.py:185(needConversion)
383366 0.080 0.000 0.080 0.000 types.py:167(needConversion)
400 0.007 0.000 3824.302 9.561 worker.py:165(process)
400 0.006 0.000 0.008 0.000 serializers.py:217(load_stream)
400 0.002 0.000 0.002 0.000 rdd.py:303(func)
400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching)
800 0.001 0.000 0.001 0.000 {built-in method from_iterable}
400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4 0.000 0.000 0.000 0.000 {built-in method dumps}
4 0.000 0.000 0.000 0.000 serializers.py:414(dumps)
4 0.000 0.000 0.000 0.000 serializers.py:549(write_int)
8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}
4 0.000 0.000 0.000 0.000 {built-in method pack}
{code}
was (Author: maver1ck):
OK. I did tests with VisualVM and Python profiler enabled
I set following options:
{code}
spark.driver.memory 8g
spark.memory.fraction 0.6
spark.executor.memory 30g
spark.executor.cores 15
spark.executor.extraJavaOptions -XX:NewRatio=4 -XX:+UseParallelOldGC -XX:-UseCompressedOops
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=4048 -Dcom.sun.management.jmxremote.rmi.port=4047
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
{code}
I think those GC options eliminated GC overhead problem.
I attached screenshots from VisualVM from one of the executors (both 1.6 and 2.0 version).
As you can see there are 2 main problems:
1) PythonRunner time
2) Parquet reader time
PythonRunner time indicate problems with Python code so I paste profiles from Python BasicProfiles.
And this is the moment that I don't understand results.
From JVM point of view Spark 2.0 is 50% slower (org.apache.spark.api.python.PythonRunner$$anon$1.read() method)
From Python point of view Spark 2.0 is slightly faster.
So where could be a problem ?
Spark 1.6 Profiler
{code}
============================================================
Profile of RDD<id=11>
============================================================
3860611667 function calls (3529336059 primitive calls) in 4360.151 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
29503 480.019 0.016 3968.411 0.135 {built-in method loads}
1588051 420.956 0.000 420.956 0.000 decoder.py:349(raw_decode)
14602755 362.611 0.000 912.538 0.000 types.py:565(<listcomp>)
59406 319.737 0.005 319.737 0.005 {method 'read' of '_io.BufferedReader' objects}
143753067/43088753 310.714 0.000 1263.782 0.000 types.py:424(fromJson)
689592988 309.769 0.000 549.927 0.000 types.py:437(fromInternal)
143753067 303.711 0.000 366.807 0.000 types.py:394(__init__)
71409609 268.747 0.000 480.577 0.000 types.py:1162(_create_row)
146639250/1588051 181.296 0.000 1376.390 0.001 types.py:736(_parse_datatype_json_value)
723603860 179.262 0.000 179.262 0.000 {built-in method isinstance}
149238722/71409609 146.782 0.000 1547.714 0.000 types.py:558(fromInternal)
71409609 99.052 0.000 99.052 0.000 {built-in method __new__ of type object at 0x9d1c40}
5453542/1588051 95.142 0.000 1288.681 0.001 types.py:527(<listcomp>)
139887576 67.573 0.000 67.573 0.000 {method 'keys' of 'dict' objects}
615371111 64.788 0.000 64.788 0.000 types.py:87(fromInternal)
71409609 64.314 0.000 64.314 0.000 types.py:1280(__setattr__)
400 63.731 0.159 4360.136 10.900 serializers.py:259(dump_stream)
71409609 53.644 0.000 1601.358 0.000 types.py:1159(<lambda>)
120860802 51.438 0.000 115.739 0.000 types.py:466(<genexpr>)
149206609 50.887 0.000 64.772 0.000 types.py:464(<genexpr>)
115407260 50.515 0.000 64.300 0.000 types.py:431(needConversion)
71409609 48.464 0.000 147.516 0.000 types.py:1194(__new__)
1588051 39.248 0.000 1859.901 0.001 types.py:684(_parse_datatype_json_string)
8915977 29.611 0.000 51.700 0.000 types.py:322(<listcomp>)
71409609 27.133 0.000 27.133 0.000 types.py:1158(_create_row_inbound_converter)
5453542/1588051 26.184 0.000 1371.215 0.001 types.py:525(fromJson)
5453542 23.354 0.000 264.314 0.000 types.py:446(__init__)
5453542 22.546 0.000 22.546 0.000 types.py:463(<listcomp>)
2354475 22.168 0.000 22.168 0.000 {built-in method fromtimestamp}
5453542 21.537 0.000 86.309 0.000 {built-in method all}
19161994 19.031 0.000 88.270 0.000 types.py:319(fromInternal)
5453542 16.367 0.000 131.358 0.000 {built-in method any}
19815964 15.244 0.000 29.920 0.000 types.py:174(fromInternal)
19250503 14.890 0.000 17.730 0.000 types.py:311(needConversion)
12500825 14.676 0.000 14.676 0.000 {built-in method fromordinal}
114485251 13.383 0.000 13.383 0.000 types.py:73(needConversion)
2538382 9.649 0.000 40.856 0.000 types.py:194(fromInternal)
2354475 9.039 0.000 9.039 0.000 {method 'replace' of 'datetime.datetime' objects}
4352388 8.644 0.000 8.644 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1588051 7.681 0.000 436.389 0.000 decoder.py:338(decode)
1588051 5.644 0.000 444.263 0.000 __init__.py:271(loads)
19293583 2.852 0.000 2.852 0.000 types.py:529(needConversion)
1298132 2.834 0.000 413.293 0.000 types.py:306(fromJson)
873033 2.422 0.000 6.821 0.000 <ipython-input-12-1ecd485d1a19>:2(<lambda>)
2461092 2.322 0.000 2.322 0.000 {method 'startswith' of 'str' objects}
873041 1.813 0.000 4.399 0.000 types.py:1267(__getattr__)
1298132 1.751 0.000 2.307 0.000 types.py:283(__init__)
873041 1.671 0.000 1.821 0.000 types.py:1254(__getitem__)
588143 1.171 0.000 1.171 0.000 types.py:216(__init__)
3176102 1.014 0.000 1.014 0.000 {method 'end' of '_sre.SRE_Match' objects}
1176286 0.700 0.000 0.700 0.000 {method 'group' of '_sre.SRE_Match' objects}
29903 0.457 0.000 4289.348 0.143 serializers.py:155(_read_with_length)
1617562 0.447 0.000 0.447 0.000 {built-in method len}
29903 0.371 0.000 300.548 0.010 serializers.py:542(read_int)
873041 0.340 0.000 0.340 0.000 {method 'index' of 'list' objects}
29903 0.236 0.000 4289.584 0.143 serializers.py:136(load_stream)
29503 0.227 0.000 3968.639 0.135 serializers.py:418(loads)
29903 0.130 0.000 0.130 0.000 {built-in method unpack}
407054 0.104 0.000 0.104 0.000 types.py:185(needConversion)
383366 0.093 0.000 0.093 0.000 types.py:167(needConversion)
400 0.006 0.000 4360.150 10.900 worker.py:104(process)
400 0.005 0.000 0.007 0.000 serializers.py:217(load_stream)
400 0.002 0.000 0.002 0.000 rdd.py:303(func)
400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching)
800 0.000 0.000 0.000 0.000 {built-in method from_iterable}
4 0.000 0.000 0.000 0.000 {built-in method dumps}
400 0.000 0.000 0.000 0.000 {built-in method iter}
400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4 0.000 0.000 0.000 0.000 serializers.py:414(dumps)
4 0.000 0.000 0.000 0.000 serializers.py:549(write_int)
8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}
4 0.000 0.000 0.000 0.000 {built-in method pack}
{code}
Spark 2.0 Profiler
{code}
============================================================
Profile of RDD<id=9>
============================================================
4015274806 function calls (3683999000 primitive calls) in 3824.303 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
29503 447.179 0.015 3581.490 0.121 {built-in method loads}
1588069 343.160 0.000 343.160 0.000 decoder.py:349(raw_decode)
14602755 331.047 0.000 837.768 0.000 types.py:600(<listcomp>)
689592988 285.243 0.000 506.721 0.000 types.py:438(fromInternal)
143753265 276.530 0.000 348.146 0.000 types.py:394(__init__)
143753265/43088951 261.310 0.000 1137.452 0.000 types.py:425(fromJson)
71409609 238.990 0.000 430.747 0.000 types.py:1354(_create_row)
867357953 181.193 0.000 181.193 0.000 {built-in method isinstance}
59406 176.652 0.003 176.652 0.003 {method 'read' of '_io.BufferedReader' objects}
146639466/1588069 168.921 0.000 1243.338 0.001 types.py:891(_parse_datatype_json_value)
149238722/71409609 136.504 0.000 1413.990 0.000 types.py:593(fromInternal)
71409609 84.433 0.000 84.433 0.000 {built-in method __new__ of type object at 0x9d1c40}
5453560/1588069 81.277 0.000 1160.041 0.001 types.py:562(<listcomp>)
615371111 61.250 0.000 61.250 0.000 types.py:87(fromInternal)
71409609 61.204 0.000 61.204 0.000 types.py:1483(__setattr__)
400 58.462 0.146 3824.286 9.561 serializers.py:259(dump_stream)
139887774 51.188 0.000 51.188 0.000 {method 'keys' of 'dict' objects}
71409609 51.071 0.000 1465.061 0.000 types.py:1351(<lambda>)
120861018 47.346 0.000 106.776 0.000 types.py:476(<genexpr>)
149206825 46.725 0.000 59.287 0.000 types.py:474(<genexpr>)
115407458 46.443 0.000 59.430 0.000 types.py:432(needConversion)
71409609 46.120 0.000 130.553 0.000 types.py:1400(__new__)
1588069 35.083 0.000 1643.525 0.001 types.py:842(_parse_datatype_json_string)
8915977 26.167 0.000 46.662 0.000 types.py:322(<listcomp>)
71409609 25.725 0.000 25.725 0.000 types.py:1350(_create_row_inbound_converter)
5453560 25.075 0.000 251.842 0.000 types.py:456(__init__)
2354475 21.215 0.000 21.215 0.000 {built-in method fromtimestamp}
5453560/1588069 20.432 0.000 1238.601 0.001 types.py:560(fromJson)
5453560 20.328 0.000 20.328 0.000 types.py:473(<listcomp>)
5453560 19.526 0.000 78.813 0.000 {built-in method all}
19161994 16.877 0.000 78.520 0.000 types.py:319(fromInternal)
5453560 15.870 0.000 121.936 0.000 {built-in method any}
19815964 13.901 0.000 27.726 0.000 types.py:174(fromInternal)
12500825 13.825 0.000 13.825 0.000 {built-in method fromordinal}
114485449 12.640 0.000 12.640 0.000 types.py:73(needConversion)
19250503 12.552 0.000 15.145 0.000 types.py:311(needConversion)
2538382 8.971 0.000 38.776 0.000 types.py:194(fromInternal)
2354475 8.590 0.000 8.590 0.000 {method 'replace' of 'datetime.datetime' objects}
4352424 8.459 0.000 8.459 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1588069 7.168 0.000 357.679 0.000 decoder.py:338(decode)
1588069 5.234 0.000 365.104 0.000 __init__.py:271(loads)
5453560 3.650 0.000 4.979 0.000 types.py:524(__iter__)
19293583 2.599 0.000 2.599 0.000 types.py:564(needConversion)
1298132 2.476 0.000 373.483 0.000 types.py:306(fromJson)
873033 2.318 0.000 6.430 0.000 <ipython-input-8-1ecd485d1a19>:2(<lambda>)
2461110 2.294 0.000 2.294 0.000 {method 'startswith' of 'str' objects}
873041 1.695 0.000 4.112 0.000 types.py:1470(__getattr__)
1298132 1.655 0.000 2.191 0.000 types.py:283(__init__)
873041 1.534 0.000 1.678 0.000 types.py:1457(__getitem__)
5453960 1.329 0.000 1.329 0.000 {built-in method iter}
588143 1.073 0.000 1.073 0.000 types.py:216(__init__)
3176138 0.943 0.000 0.943 0.000 {method 'end' of '_sre.SRE_Match' objects}
1176286 0.673 0.000 0.673 0.000 {method 'group' of '_sre.SRE_Match' objects}
1617580 0.420 0.000 0.420 0.000 {built-in method len}
29903 0.398 0.000 3759.130 0.126 serializers.py:155(_read_with_length)
873041 0.327 0.000 0.327 0.000 {method 'index' of 'list' objects}
29903 0.274 0.000 157.542 0.005 serializers.py:542(read_int)
29903 0.263 0.000 3759.393 0.126 serializers.py:136(load_stream)
29503 0.198 0.000 3581.688 0.121 serializers.py:418(loads)
29903 0.104 0.000 0.104 0.000 {built-in method unpack}
407054 0.097 0.000 0.097 0.000 types.py:185(needConversion)
383366 0.080 0.000 0.080 0.000 types.py:167(needConversion)
400 0.007 0.000 3824.302 9.561 worker.py:165(process)
400 0.006 0.000 0.008 0.000 serializers.py:217(load_stream)
400 0.002 0.000 0.002 0.000 rdd.py:303(func)
400 0.001 0.000 0.001 0.000 serializers.py:220(_load_stream_without_unbatching)
800 0.001 0.000 0.001 0.000 {built-in method from_iterable}
400 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
4 0.000 0.000 0.000 0.000 {built-in method dumps}
4 0.000 0.000 0.000 0.000 serializers.py:414(dumps)
4 0.000 0.000 0.000 0.000 serializers.py:549(write_int)
8 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}
4 0.000 0.000 0.000 0.000 {built-in method pack}
{code}
> Pyspark 2.0 performance drop vs pyspark 1.6
> -------------------------------------------
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.0
> Reporter: Maciej Bryński
> Attachments: visualvm_spark16.png, visualvm_spark2.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org