You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/26 05:34:52 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request, #39214: [WIP][SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

HyukjinKwon opened a new pull request, #39214:
URL: https://github.com/apache/spark/pull/39214

   ### What changes were proposed in this pull request?
   
   This PR proposes to implement all catalog API in Spark Connect.
   
   TBD.
   
   ### Why are the changes needed?
   
   For feature parity.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds the support of Spark Connect in `spark.catalog`.
   
   ### How was this patch tested?
   
   Reused the unitests of the existing PySpark Catalog tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [WIP][SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057150277


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -99,6 +100,47 @@ class SparkConnectPlanner(session: SparkSession) {
       case proto.Relation.RelTypeCase.UNPIVOT => transformUnpivot(rel.getUnpivot)
       case proto.Relation.RelTypeCase.RELTYPE_NOT_SET =>
         throw new IndexOutOfBoundsException("Expected Relation to be set, but is empty.")
+
+      // Catalog API (internal-only)

Review Comment:
   For unittests in `SparkConnectPlanTest`, I will add them separately in another PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057211943


##########
python/pyspark/sql/connect/session.py:
##########
@@ -120,6 +121,11 @@ def __init__(self, connectionString: str, userId: Optional[str] = None):
         # Parse the connection string.
         self._client = SparkConnectClient(connectionString)
 
+    def table(self, tableName: str) -> DataFrame:

Review Comment:
   This PR piggyback an API `SparkSession.table` together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057212456


##########
python/pyspark/sql/tests/test_catalog.py:
##########
@@ -16,20 +16,21 @@
 #

Review Comment:
   The changes made in reused PySpark tests are two:
   
   1. Do not check the specific type of exception. We should fix it to check properly (SPARK-41715)
   2. Explicit `collect` on `sql` because `sql` in Spark Connect is lazy (which actually I think is an issue).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057211943


##########
python/pyspark/sql/connect/session.py:
##########
@@ -120,6 +121,11 @@ def __init__(self, connectionString: str, userId: Optional[str] = None):
         # Parse the connection string.
         self._client = SparkConnectClient(connectionString)
 
+    def table(self, tableName: str) -> DataFrame:

Review Comment:
   This PR piggybacks an API `SparkSession.table` together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39214:
URL: https://github.com/apache/spark/pull/39214#issuecomment-1365495742

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39214:
URL: https://github.com/apache/spark/pull/39214#issuecomment-1365124966

   Should be ready for a look. cc @amaliujia @zhengruifeng @hvanhovell @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057439608


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -68,6 +69,37 @@ message Relation {
     StatCrosstab crosstab = 101;
     StatDescribe describe = 102;
 
+    // Catalog API (internal-only)

Review Comment:
   The better way of dealing with this is to wrap all of the catalog messages into a parent message. This makes it easier to reason about this and eventually remove if we want.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [WIP][SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057150277


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -99,6 +100,47 @@ class SparkConnectPlanner(session: SparkSession) {
       case proto.Relation.RelTypeCase.UNPIVOT => transformUnpivot(rel.getUnpivot)
       case proto.Relation.RelTypeCase.RELTYPE_NOT_SET =>
         throw new IndexOutOfBoundsException("Expected Relation to be set, but is empty.")
+
+      // Catalog API (internal-only)

Review Comment:
   For unittests in `SparkConnectPlanTest`, I will add them separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39214:
URL: https://github.com/apache/spark/pull/39214#discussion_r1057627227


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -68,6 +69,37 @@ message Relation {
     StatCrosstab crosstab = 101;
     StatDescribe describe = 102;
 
+    // Catalog API (internal-only)

Review Comment:
   Thanks Martin. Let me try to take a stab



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39214:
URL: https://github.com/apache/spark/pull/39214#issuecomment-1365200657

   This PR supports all as the same as the current PySpark does for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #39214: [SPARK-41707][CONNECT] Implement Catalog API in Spark Connect
URL: https://github.com/apache/spark/pull/39214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org