You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/07/13 10:36:31 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

zhengruifeng opened a new pull request, #41986:
URL: https://github.com/apache/spark/pull/41986

   ### What changes were proposed in this pull request?
   
   `SparkSession.sql` in Connect actually only return a unresolved plan, however it may reference temp view, it should hold the analyzed plan in the server side in this case, since the temp view maybe dropped somewhere.
   
   ### Why are the changes needed?
   
   it is a common pattern:
   
   1. create temp views;
   2. reference temp views in `SparkSession.sql` and get a result dataframe;
   3. drop temp views;
   4. return the dataframe;
   
   
   for example:
   1, in sql_formatter: https://github.com/apache/spark/blob/9aa42a970c4bd8e54603b1795a0f449bd556b11b/python/pyspark/sql/sql_formatter.py#L67-L85
   2, in mllib:
   https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L70-L75
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   before:
   ```
   In [1]: df = spark.range(0, 10)
   
   In [2]: df.createOrReplaceTempView("t1")
   
   In [3]: df2 = spark.sql("select * from t1")
   
   In [4]: spark.catalog.dropTempView("t1")
   Out[4]: True
   
   In [5]: df2.show()
   23/07/13 18:33:08 ERROR SparkConnectService: Error during: execute. UserId: ruifeng.zheng. SessionId: 3ce414b0-c99e-450d-ba11-c0f5fcb2daf3.
   org.apache.spark.sql.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `t1` cannot be found. Verify the spelling and correctness of the schema and catalog.
   If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
   To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 14;
   'Project [*]
   +- 'UnresolvedRelation [t1], [], false
   ```
   
   after:
   ```
   In [1]: df = spark.range(0, 10)
   
   In [2]: df.createOrReplaceTempView("t1")
   
   In [3]: df2 = spark.sql("select * from t1")
   
   In [4]: spark.catalog.dropTempView("t1")
   Out[4]: True
   
   In [5]: df2.show()
   +---+
   | id|
   +---+
   |  0|
   |  1|
   |  2|
   |  3|
   |  4|
   |  5|
   |  6|
   |  7|
   |  8|
   |  9|
   +---+
   ```
   
   
   ### How was this patch tested?
   added ut


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1635078602

   > This is a behavioral change. The first version of the SQL() was using a similar cached DF approach but we decided against it because it is now you're creating a lot of state on the server side for no good reason.
   > 
   > Now simply by using the API the DF becomes stored on the server and it's not easily retriable.
   
   the problem I think, is that the `spark.sql` becomes nondeterministic. After user get a DF from  `spark.sql`, if user drop/replace the view, the query output can be changed.
   
   spark still have to store the plans in catalog for temp views, which can not be dropped and are exposed to the end users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view
URL: https://github.com/apache/spark/pull/41986


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1634799865

   Please don't do this. Now the API becomes more inconsistent. E.g. so when you reference a deleted temp view from sql you will get a result, however when you reference the same temp view using `spark.table(...)` you get an error?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1635061532

   In that case, let's probably just revert https://github.com/apache/spark/pull/41980 out. That feature itself is experimental so we don't necessarily have to support it in Spark Connect.
   
   Another way is just try to cache specifically in `sql` with kwargs. If that can be easily done, we could go ahead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1635089887

   > Branchcut is soon so let's revert #41980 first.
   
   ok, let's revert that one, since it creates temp view if `kwargs` contains DF variables. I can not find an alternative solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1635087027

   Branchcut is soon so let's revert https://github.com/apache/spark/pull/41980 first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1633992306

   cc @HyukjinKwon @cloud-fan @grundprinzip @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1774245092

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41986:
URL: https://github.com/apache/spark/pull/41986#issuecomment-1635076670

   > Please don't do this. Now the API becomes more inconsistent. E.g. so when you reference a deleted temp view from sql you will get a result, however when you reference the same temp view using `spark.table(...)` you get an error?
   
   good point. Do it mean we should also make such changes for `spark.table`?  since the vanilla spark works in this way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #41986: [SPARK-44406][CONNECT] Make `SparkSession.sql` work properly with dropped temp view
URL: https://github.com/apache/spark/pull/41986


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org