You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "hvanhovell (via GitHub)" <gi...@apache.org> on 2023/03/01 19:09:09 UTC

[GitHub] [spark] hvanhovell opened a new pull request, #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

hvanhovell opened a new pull request, #40238:
URL: https://github.com/apache/spark/pull/40238

   ### What changes were proposed in this pull request?
   This allows LocalRelation to use an actual datatype as its schema.
   
   ### Why are the changes needed?
   Passing the actual schema has a higher fidelity than using strings. Metadata for example is not passed when you a DDL string.
   
   ### Does this PR introduce _any_ user-facing change?
   No. It is an addition to the connect protocol.
   
   ### How was this patch tested?
   Updated a test, and this works. I will do the PlanGenerationTestSuite in a follow-up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1122565685


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   My motivation was to avoid roundtrips, but let's discuss.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #40238:
URL: https://github.com/apache/spark/pull/40238#issuecomment-1595904327

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130385693


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   On second thoughts, if `UnparsedDataType` at #40260 is good to go, we might not need to use DDL string anymore, and just use `DataType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1122619185


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   in Python Client, there are a few edge cases in `createDataFrame`:
   
   1, `spark.createDataFrame([(100, None)], "age INT, name STRING")` [SPARK-42458](https://issues.apache.org/jira/browse/SPARK-42458)
   can not infer the schema since the second column only contains `None`, https://github.com/apache/spark/pull/40240 fixed by parse the ddl string via `self.createDataFrame([], schema=_schema_str).schema`;
   
   2, `self.spark.createDataFrame([["a", "b"]], ["col1"])` [SPARK-42022](https://issues.apache.org/jira/browse/SPARK-42022)
   the `createDataFrame` should be able to autogenerate missing column names, if we don't do this in client, we may need to do it in `LocalRelation`
   
   There are still some other behavior differences, If we can parse the DDL string in client, I guess we can reuse more functions in PySpark's `createDataFrame`.
   
   cc @HyukjinKwon @ueshin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123920584


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   In some cases, we need to parse it beforehand anyway, for the case of #40240, we can't create converter from the local python object to Arrow table without the schema as `DataType` object.
   
   Another example is UDFs. It needs to pickle the function, and its return type as `DataType` object. We can't generate Python `DataType` object and pickle it in the `command` field in server side anymore. So we need to parse the DDL string beforehand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130353503


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   Thanks @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123930392


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   Good point. IIRC, there was a decision not to implement it in Python.
   cc @HyukjinKwon @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1122497859


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   since we had introduced `DDLParse` in https://github.com/apache/spark/commit/0c19ea4773b8eaeedc4031cb1c5c3e97f2c2d3b9 
   
   what about always parsing the ddl string in client and only use `DataType dataType` in protos? I feel it will make client simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123536612


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -338,7 +338,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
             plan.local_relation.data = sink.getvalue().to_pybytes()
 
         if self._schema is not None:
-            plan.local_relation.schema = self._schema
+            plan.local_relation.schemaString = self._schema

Review Comment:
   ```suggestion
               plan.local_relation.schema_string = self._schema
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1122619185


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   in Python Client, there are a few edge cases in `createDataFrame`:
   
   `spark.createDataFrame([(100, None)], "age INT, name STRING")` [SPARK-42458](https://issues.apache.org/jira/browse/SPARK-42458)
   can not infer the schema since the second column only contains `None`, https://github.com/apache/spark/pull/40240 fixed by parse the ddl string via `self.createDataFrame([], schema=_schema_str).schema`;
   
   
   There are still some other behavior differences, If we can parse the DDL string in client, I guess we can reuse more functions in PySpark's `createDataFrame`.
   
   cc @HyukjinKwon @ueshin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130349889


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   > I'd rather like to change all places to accept both string and DataType and the choice is up to the client developers.
   
   I think it is a good idea. We may add a dedicated message for it and then use it in all places.
   
   Are we going to do this?
   
   also cc @LuciferYang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema
URL: https://github.com/apache/spark/pull/40238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on PR #40238:
URL: https://github.com/apache/spark/pull/40238#issuecomment-1450758457

   Need add a test for the new added field?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40238:
URL: https://github.com/apache/spark/pull/40238#issuecomment-1453013336

   test: https://github.com/hvanhovell/spark/actions/runs/4318951397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123575136


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -338,7 +338,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
             plan.local_relation.data = sink.getvalue().to_pybytes()
 
         if self._schema is not None:
-            plan.local_relation.schema = self._schema
+            plan.local_relation.schemaString = self._schema

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1122498016


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {
+    // Either a DDL-formatted type string or a JSON string.
+    string schemaString = 2;

Review Comment:
   ```suggestion
       string schema_string = 2;
   ```



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {
+    // Either a DDL-formatted type string or a JSON string.
+    string schemaString = 2;
+
+    // DataType representing the schema.
+    DataType dataType = 3;

Review Comment:
   ```suggestion
       DataType data_type = 3;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123698953


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {
+    // Either a DDL-formatted type string or a JSON string.
+    string schemaString = 2;

Review Comment:
   thanks; it is been a while since I touched proto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123926302


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   @ueshin why not implement a parser on the python side? That shouldn't be too hard, and it saves you from doing a lot of expensive rpcs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1123923286


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   I think this change is fine. I'd rather like to change all places to accept both string and `DataType` and the choice is up to the client developers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org