You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/12/15 20:13:49 UTC

[PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

HyukjinKwon opened a new pull request, #44374:
URL: https://github.com/apache/spark/pull/44374

   ### What changes were proposed in this pull request?
   
   This PR is a kind of a followup of https://github.com/apache/spark/pull/44305 that proposes to create Python Data Source instance at `DataSource.lookupDataSourceV2`
   
   ### Why are the changes needed?
   
   Semantically the instance has to be ready at `DataSource.lookupDataSourceV2` level instead of after that. It's more consistent as well.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests should cover.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44374:
URL: https://github.com/apache/spark/pull/44374#issuecomment-1858435446

   cc @cloud-fan and @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44374:
URL: https://github.com/apache/spark/pull/44374#issuecomment-1858455867

   Build: https://github.com/HyukjinKwon/spark/actions/runs/7226599432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #44374:
URL: https://github.com/apache/spark/pull/44374#discussion_r1429403762


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala:
##########
@@ -732,18 +722,21 @@ object DataSource extends Logging {
   def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {

Review Comment:
   Should we remove the changes in the `lookupDataSource` method above? Since a Python data source can only be a V2 source.
   ```scala
   else if (isUserDefinedDataSource) {
     classOf[PythonTableProvider]
   } 
   ```
   and 
   ```scala
   case head :: Nil =>
     // there is exactly one registered alias
     // TODO(SPARK-45600): should be session-based.
     val isUserDefinedDataSource = SparkSession.getActiveSession.exists(
       _.sessionState.dataSourceManager.dataSourceExists(provider))
     // The source can be successfully loaded as either a V1 or a V2 data source.
     // Check if it is also a user-defined data source.
     if (isUserDefinedDataSource) {
       throw QueryCompilationErrors.foundMultipleDataSources(provider)
     }
     head.getClass
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44374:
URL: https://github.com/apache/spark/pull/44374#issuecomment-1858599523

   documentation build will be fixed at https://github.com/apache/spark/pull/44376.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44374:
URL: https://github.com/apache/spark/pull/44374#discussion_r1429440307


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala:
##########
@@ -732,18 +722,21 @@ object DataSource extends Logging {
   def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {

Review Comment:
   It depends on whether we will call `lookupDataSource` to detect python data sources. Seems no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44374:
URL: https://github.com/apache/spark/pull/44374#discussion_r1430826366


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala:
##########
@@ -732,18 +722,21 @@ object DataSource extends Logging {
   def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {

Review Comment:
   `lookupDataSource` detects both V1 and V2 classes. `lookupDataSourceV2` just creates the instance from V2 classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #44374:
URL: https://github.com/apache/spark/pull/44374#discussion_r1430834599


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala:
##########
@@ -732,18 +722,21 @@ object DataSource extends Logging {
   def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {

Review Comment:
   Yes, but technically python data source is not v1 or v2. So it seems ok if `lookupDataSource` doesn't return python data sources.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44374:
URL: https://github.com/apache/spark/pull/44374#discussion_r1430850046


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala:
##########
@@ -732,18 +722,21 @@ object DataSource extends Logging {
   def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {

Review Comment:
   But when we extend the Python Data Source features, we will use v2 code path, for example, `ExternalCommandRunner`. Same as `DataStreamReader.loadInternal` code path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44374: [SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2
URL: https://github.com/apache/spark/pull/44374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org