You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ankurdave (via GitHub)" <gi...@apache.org> on 2023/10/20 17:51:18 UTC

[PR] [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession [spark]

ankurdave opened a new pull request, #43466:
URL: https://github.com/apache/spark/pull/43466

   ### What changes were proposed in this pull request?
   `CastSuiteBase` and `ExpressionInfoSuite` use `ParVector.foreach()` to run Spark SQL queries in parallel. They incorrectly assume that each parallel operation will inherit the main thread’s active SparkSession. This is only true when these parallel operations run in freshly-created threads. However, when other code has already run some parallel operations before Spark was started, then there may be existing threads that do not have an active SparkSession. In that case, these tests fail with NullPointerExceptions when creating SparkPlans or running SQL queries.
   
   The fix is to use the existing method `ThreadUtils.parmap()`. This method creates fresh threads that inherit the current active SparkSession, and it propagates the Spark ThreadLocals.
   
   This PR also adds a scalastyle warning against use of ParVector.
   
   
   ### Why are the changes needed?
   This change makes `CastSuiteBase` and `ExpressionInfoSuite` less brittle to future changes that may run parallel operations during test startup.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Reproduced the test failures by running a ParVector operation before Spark starts. Verified that this PR fixes the test failures in this condition.
   
   ```scala
     protected override def beforeAll(): Unit = {
       // Run a ParVector operation before initializing the SparkSession. This starts some Scala
       // execution context threads that have no active SparkSession. These threads will be reused for
       // later ParVector operations, reproducing SPARK-45616.
       new ParVector((0 until 100).toVector).foreach { _ => }
   
       super.beforeAll()
     }
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43466:
URL: https://github.com/apache/spark/pull/43466#issuecomment-1774303618

   cc @MaxGekk FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43466:
URL: https://github.com/apache/spark/pull/43466#issuecomment-1774348902

   The streaming test failure is unrelated, merging to master/3.5, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #43466: [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession
URL: https://github.com/apache/spark/pull/43466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45616][CORE] Avoid ParVector, which does not propagate ThreadLocals or SparkSession [spark]

Posted by "ankurdave (via GitHub)" <gi...@apache.org>.
ankurdave commented on PR #43466:
URL: https://github.com/apache/spark/pull/43466#issuecomment-1773155509

   cc @cloud-fan @gatorsmile


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org