You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2024/01/11 04:55:12 UTC

[PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager isolated and self clone-able [spark]

HyukjinKwon opened a new pull request, #44681:
URL: https://github.com/apache/spark/pull/44681

   ### What changes were proposed in this pull request?
   
   This PR proposes to make DataSourceManager isolated and self clone-able without actual lookup.
   
   ### Why are the changes needed?
   
   For better maintenance of the code. Now, we triggers Python execution that actually initializes `SparkSession` via `SQLConf`. There are too many side effects.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unittest was added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1448366750


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -34,23 +32,29 @@ import org.apache.spark.util.Utils
  * A manager for user-defined data sources. It is used to register and lookup data sources by
  * their short names or fully qualified names.
  */
-class DataSourceManager extends Logging {
+class DataSourceManager(
+    initDataSourceBuilders: => Option[
+      Map[String, UserDefinedPythonDataSource]] = None
+   ) extends Logging {
+  import DataSourceManager._
   // Lazy to avoid being invoked during Session initialization.
   // Otherwise, it goes infinite loop, session -> Python runner -> SQLConf -> session.
-  private lazy val dataSourceBuilders = {
-    val builders = new ConcurrentHashMap[String, UserDefinedPythonDataSource]()
-    builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
-    builders
+  private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse {
+    initialDataSourceBuilders

Review Comment:
   I think the DataSourceManager is session-level and should not be the one to initialize static data sources. When we initialize the DataSourceManager for each spark session, we can pass in the static ones.
   
   So it might make more sense to have an API in SparkContext for static data sources?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager isolated and self clone-able [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44681:
URL: https://github.com/apache/spark/pull/44681#issuecomment-1886242890

   Build: https://github.com/HyukjinKwon/spark/actions/runs/7484413766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1448305887


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -63,7 +67,9 @@ class DataSourceManager extends Logging {
    */
   def lookupDataSource(name: String): UserDefinedPythonDataSource = {
     if (dataSourceExists(name)) {
-      dataSourceBuilders.get(normalize(name))
+      val normalizedName = normalize(name)
+      staticDataSourceBuilders.getOrElse(
+        normalizedName, dataSourceBuilders.get(normalize(name)))

Review Comment:
   ```suggestion
           normalizedName, dataSourceBuilders.get(normalizedName))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44681:
URL: https://github.com/apache/spark/pull/44681#issuecomment-1888328014

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager isolated and self clone-able [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44681:
URL: https://github.com/apache/spark/pull/44681#issuecomment-1886242979

   cc @cloud-fan and @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1448379257


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -34,23 +32,29 @@ import org.apache.spark.util.Utils
  * A manager for user-defined data sources. It is used to register and lookup data sources by
  * their short names or fully qualified names.
  */
-class DataSourceManager extends Logging {
+class DataSourceManager(
+    initDataSourceBuilders: => Option[
+      Map[String, UserDefinedPythonDataSource]] = None
+   ) extends Logging {
+  import DataSourceManager._
   // Lazy to avoid being invoked during Session initialization.
   // Otherwise, it goes infinite loop, session -> Python runner -> SQLConf -> session.
-  private lazy val dataSourceBuilders = {
-    val builders = new ConcurrentHashMap[String, UserDefinedPythonDataSource]()
-    builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
-    builders
+  private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse {
+    initialDataSourceBuilders

Review Comment:
   Yeah I agree .. but the problem is that `UserDefinedPythonDataSourceLookupRunner.runInPython` requires `SQLConf.get` that requires `SparkSession` initialization.
   
   So, this initialization of static datasources must happen at least when a session is created. So, I here put the static initialization logic into the first call of `DataSourceManager` in any session for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1448305887


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -63,7 +67,9 @@ class DataSourceManager extends Logging {
    */
   def lookupDataSource(name: String): UserDefinedPythonDataSource = {
     if (dataSourceExists(name)) {
-      dataSourceBuilders.get(normalize(name))
+      val normalizedName = normalize(name)
+      staticDataSourceBuilders.getOrElse(
+        normalizedName, dataSourceBuilders.get(normalize(name)))

Review Comment:
   ```suggestion
           normalizedName, dataSourceBuilders.get(normalizedName))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1449709548


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -34,23 +32,29 @@ import org.apache.spark.util.Utils
  * A manager for user-defined data sources. It is used to register and lookup data sources by
  * their short names or fully qualified names.
  */
-class DataSourceManager extends Logging {
+class DataSourceManager(
+    initDataSourceBuilders: => Option[
+      Map[String, UserDefinedPythonDataSource]] = None
+   ) extends Logging {
+  import DataSourceManager._
   // Lazy to avoid being invoked during Session initialization.
   // Otherwise, it goes infinite loop, session -> Python runner -> SQLConf -> session.
-  private lazy val dataSourceBuilders = {
-    val builders = new ConcurrentHashMap[String, UserDefinedPythonDataSource]()
-    builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
-    builders
+  private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse {
+    initialDataSourceBuilders

Review Comment:
   Ah because of these two configs:
   ```
       val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback
       val workerMemoryMb = SQLConf.get.pythonPlannerExecMemory
   ```
   I think instead of accessing the SQLConf here, we should pass them as parameters to this method `runInPython` to avoid this initialization issue. Maybe we can add a TODO for a follow up PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #44681: [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources
URL: https://github.com/apache/spark/pull/44681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org