You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Rao, Abhishek (Nokia - IN/Bangalore)" <ab...@nokia.com> on 2021/05/18 05:42:09 UTC

RE: Why is Spark 3.0.x faster than Spark 3.1.x

Hi Maziyar, Mich

Do we have any ticket to track this? Any idea if this is going to be fixed in 3.1.2?

Thanks and Regards,
Abhishek

From: Mich Talebzadeh <mi...@gmail.com>
Sent: Friday, April 9, 2021 2:11 PM
To: Maziyar Panahi <ma...@iscpif.fr>
Cc: User <us...@spark.apache.org>
Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x


Hi,

Regarding your point:

.... I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release ...

With the benefit of hindsight version 3.1.1 was released recently and the definition of stable (from a practical point of view) does not come into it yet. That is perhaps the reason why some vendors like Cloudera are few releases away from the latest version. In production what matters most is the predictability and stability. You are not doing anything wrong by rolling it back and awaiting further clarification and resolution on the error.

HTH


[https://docs.google.com/uc?export=download&id=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6&revid=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Fri, 9 Apr 2021 at 08:58, Maziyar Panahi <ma...@iscpif.fr>> wrote:
Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change all the notebooks/scripts to switch back from 3.1.1 to 3.0.2.

That's being said, I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release, something that made everything default to 3.1.1 (pyspark, downloads, etc.).

I'll see if I can open a ticket for this as well.


On 8 Apr 2021, at 17:27, Mich Talebzadeh <mi...@gmail.com>> wrote:

Well the normal course of action (considering laws of diminishing returns)  is that your mileage varies:

Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding reason why you have to use 3.1.1, you can set it aside and try it when you have other use cases. For now I guess you can carry on with 3.0.1 as BAU.

HTH


 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Thu, 8 Apr 2021 at 16:19, Maziyar Panahi <ma...@iscpif.fr>> wrote:
I personally added the followings to my SparkSession in 3.1.1 and the result was exactly the same as before (local master). The 3.1.1 is still 4-5 times slower than 3.0.2 at least for that piece of code. I will do more investigation to see how it does with other stuff, especially anything without .transform or Spark ML related functions, but the small code I provided on any dataset that is big enough to take a minute to finish will show you the difference going from 3.0.2 to 3.1.1 by magnitude of 4-5:


.config("spark.sql.adaptive.coalescePartitions.enabled", "false")
.config("spark.sql.adaptive.enabled", "false")



On 8 Apr 2021, at 16:47, Mich Talebzadeh <mi...@gmail.com>> wrote:

spark 3.1.1

I enabled the parameter

spark_session.conf.set("spark.sql.adaptive.enabled", "true")

to see it effects

in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client

with 4 executors it crashed the cluster.

I then reduced the number of executors to 2 and this time it ran OK but the performance is worse

I assume it adds some overhead?



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <ma...@iscpif.fr>> wrote:
Thanks Sean,

I have already tried adding that and the result is absolutely the same.

The reason that config cannot be the reason (at least not alone) it's because my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been set to true the beginning of 3.0.0 and hasn't changed:

- https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution
- https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution
- https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution
- https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution

So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, unfortunately the issue is some where else.


On 8 Apr 2021, at 15:54, Sean Owen <sr...@gmail.com>> wrote:

Right, you already established a few times that the difference is the number of partitions. Russell answered with what is almost surely the correct answer, that it's AQE. In toy cases it isn't always a win.
Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up more realistic workloads in general.

On Thu, Apr 8, 2021 at 8:52 AM maziyar <ma...@iscpif.fr>> wrote:
So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png] Finished in 10 seconds For pyspark==3.1.1 (same stage "showString at NativeMethodAccessorImpl.java:0"): [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png] Finished the same stage in 39 seconds As you can see everything is literally the same between 3.0.2 and 3.1.1, number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 10/12 and the other 2 are the processing of the actual task which I shared previously: 3.1.1 [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] 3.0.2 [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] PS: I have just made the same test in Databricks with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png] 7.6 (includes Apache Spark 3.0.1, Scala 2.12) [http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png] There is still a difference, over 20 seconds which when it comes to the whole process being within a minute that is a big bump. Not sure what it is, but until further notice, I will advise our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there are other optimizations, maybe it's not noticeable, but this is such a simple code and it can become a bottleneck quickly in larger pipelines)
________________________________
Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com<http://nabble.com/>.




Re: Why is Spark 3.0.x faster than Spark 3.1.x

Posted by Maziyar Panahi <ma...@iscpif.fr>.
Hi Rao,


Yes, I have created this ticket: https://issues.apache.org/jira/browse/SPARK-35066 <https://issues.apache.org/jira/browse/SPARK-35066>

It's not assigned to anybody so I don't have any ETA on the fix or possible workarounds.

Best
Maziyar

> On 18 May 2021, at 07:42, Rao, Abhishek (Nokia - IN/Bangalore) <ab...@nokia.com> wrote:
> 
> Hi Maziyar, Mich
>  
> Do we have any ticket to track this? Any idea if this is going to be fixed in 3.1.2?
>  
> Thanks and Regards,
> Abhishek
>  
> From: Mich Talebzadeh <mi...@gmail.com> 
> Sent: Friday, April 9, 2021 2:11 PM
> To: Maziyar Panahi <ma...@iscpif.fr>
> Cc: User <us...@spark.apache.org>
> Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x
>  
>  
> Hi,
>  
> Regarding your point:
>  
> .... I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release ...
>  
> With the benefit of hindsight version 3.1.1 was released recently and the definition of stable (from a practical point of view) does not come into it yet. That is perhaps the reason why some vendors like Cloudera are few releases away from the latest version. In production what matters most is the predictability and stability. You are not doing anything wrong by rolling it back and awaiting further clarification and resolution on the error.
>  
> HTH
> 
> 
>  
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
>  
>  
> On Fri, 9 Apr 2021 at 08:58, Maziyar Panahi <maziyar.panahi@iscpif.fr <ma...@iscpif.fr>> wrote:
> Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change all the notebooks/scripts to switch back from 3.1.1 to 3.0.2. 
>  
> That's being said, I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release, something that made everything default to 3.1.1 (pyspark, downloads, etc.).
>  
> I'll see if I can open a ticket for this as well.
> 
> 
> On 8 Apr 2021, at 17:27, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>  
> Well the normal course of action (considering laws of diminishing returns)  is that your mileage varies:
>  
> Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding reason why you have to use 3.1.1, you can set it aside and try it when you have other use cases. For now I guess you can carry on with 3.0.1 as BAU.
>  
> HTH
>  
>  
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
>  
>  
> On Thu, 8 Apr 2021 at 16:19, Maziyar Panahi <maziyar.panahi@iscpif.fr <ma...@iscpif.fr>> wrote:
> I personally added the followings to my SparkSession in 3.1.1 and the result was exactly the same as before (local master). The 3.1.1 is still 4-5 times slower than 3.0.2 at least for that piece of code. I will do more investigation to see how it does with other stuff, especially anything without .transform or Spark ML related functions, but the small code I provided on any dataset that is big enough to take a minute to finish will show you the difference going from 3.0.2 to 3.1.1 by magnitude of 4-5:
>  
> .config("spark.sql.adaptive.coalescePartitions.enabled", "false")
> .config("spark.sql.adaptive.enabled", "false")
>  
> 
> 
> On 8 Apr 2021, at 16:47, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>  
> spark 3.1.1
>  
> I enabled the parameter
>  
> spark_session.conf.set("spark.sql.adaptive.enabled", "true")
>  
> to see it effects
>  
> in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client 
>  
> with 4 executors it crashed the cluster.
>  
> I then reduced the number of executors to 2 and this time it ran OK but the performance is worse
>  
> I assume it adds some overhead?
>  
>  
>  
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
>  
>  
> On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <maziyar.panahi@iscpif.fr <ma...@iscpif.fr>> wrote:
> Thanks Sean, 
>  
> I have already tried adding that and the result is absolutely the same.
>  
> The reason that config cannot be the reason (at least not alone) it's because my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been set to true the beginning of 3.0.0 and hasn't changed:
>  
> - https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution>
> - https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution>
> - https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution>
> - https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution>
>  
> So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, unfortunately the issue is some where else.
> 
> 
> On 8 Apr 2021, at 15:54, Sean Owen <srowen@gmail.com <ma...@gmail.com>> wrote:
>  
> Right, you already established a few times that the difference is the number of partitions. Russell answered with what is almost surely the correct answer, that it's AQE. In toy cases it isn't always a win. 
> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up more realistic workloads in general.
>  
> On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.panahi@iscpif.fr <ma...@iscpif.fr>> wrote:
> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): Finished in 10 seconds For pyspark==3.1.1 (same stage "showString at NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds As you can see everything is literally the same between 3.0.2 and 3.1.1, number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 10/12 and the other 2 are the processing of the actual task which I shared previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6 (includes Apache Spark 3.0.1, Scala 2.12) There is still a difference, over 20 seconds which when it comes to the whole process being within a minute that is a big bump. Not sure what it is, but until further notice, I will advise our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there are other optimizations, maybe it's not noticeable, but this is such a simple code and it can become a bottleneck quickly in larger pipelines) 
> Sent from the Apache Spark User List mailing list archive <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com <http://nabble.com/>.