You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/11 06:04:24 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38200: [SPARK-40743][CONNECT]StructType should contain a list of StructField and each field should have a name

amaliujia opened a new pull request, #38200:
URL: https://github.com/apache/spark/pull/38200

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR refactors proto's struct datatype with 1) each struct has a struct field list  2) each struct field has a name.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   In the past, connect's struct datatype does not have name for its struct fields.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r993922735


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   For a pure type system design, for composable type in which there are sub-types, both parent and children can have nullability.  Use StructType as an example:
   
   1.  Struct(nullability=true, structFields=[(name="a", nullability=true)]):
   val x = Struct("a"=NULL) // **valid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **valid**
   
   2. Struct(nullability=true, structFields=[(name="a", nullability=false)]):
   val x = Struct("a"=NULL) // **invalid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **valid**
   
   3. Struct(nullability=false, structFields=[(name="a", nullability=false)]):
   val x = Struct("a"=NULL) // **invalid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **invalid**
   
   Same for ARRAY, MAP, etc.
   
   This is very general. I am thinking you are thinking a data model that is a subset of it?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991863876


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -167,8 +167,14 @@ message DataType {
     Nullability nullability = 4;
   }
 
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    Nullability nullability = 3;

Review Comment:
   isn't Nullability just a boolean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991865381


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -39,6 +39,38 @@ package object dsl {
               .addAllParts(identifier.asJava)
               .build())
           .build()
+
+      def protoQualifiedAttr: proto.Expression.QualifiedAttribute =
+        proto.Expression.QualifiedAttribute.newBuilder()
+          .setName(identifier.mkString("."))

Review Comment:
   actually `a.b` should not directly map to `QualifiedAttribute`. It can be `GetStructField(QualifiedAttribute...)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r994802677


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   I believe the connect proto type system is a bit different on the container type. However I just refactor code here and this is existing code. We don't have tests for types neither.
   
   How about let me follow up on this topic later after having more discussions with people. I can add more tests for nullability at that time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38200:
URL: https://github.com/apache/spark/pull/38200#issuecomment-1275308239

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r993922735


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   For a pure type system design, for composable type in which there are sub-types, both parent and children can have nullability.  Use StructType as an example:
   
   1.  Struct(nullability=true, structFields=[(name="a", nullability=true)]):
   val x = Struct("a"=NULL) // **valid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **valid**
   
   2. Struct(nullability=true, structFields=[(name="a", nullability=false)]):
   val x = Struct("a"=NULL) // **invalid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **valid**
   
   3. Struct(nullability=false, structFields=[(name="a", nullability=false)]):
   val x = Struct("a"=NULL) // **invalid**
   val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // **invalid**
   
   
   This is very general. I am thinking you are thinking a data model that is a subset of it?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r993841919


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -167,8 +167,14 @@ message DataType {
     Nullability nullability = 4;
   }
 
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    Nullability nullability = 3;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name
URL: https://github.com/apache/spark/pull/38200


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38200:
URL: https://github.com/apache/spark/pull/38200#issuecomment-1281720501

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991868724


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -39,6 +39,38 @@ package object dsl {
               .addAllParts(identifier.asJava)
               .build())
           .build()
+
+      def protoQualifiedAttr: proto.Expression.QualifiedAttribute =
+        proto.Expression.QualifiedAttribute.newBuilder()
+          .setName(identifier.mkString("."))

Review Comment:
   ah let me take a look. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r997649989


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -114,7 +115,26 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
       import org.apache.spark.sql.connect.dsl.plans._
       transform(connectTestRelation.as("target_table"))
     }
+

Review Comment:
   unnecessary change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991867688


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -39,6 +39,38 @@ package object dsl {
               .addAllParts(identifier.asJava)
               .build())
           .build()
+
+      def protoQualifiedAttr: proto.Expression.QualifiedAttribute =
+        proto.Expression.QualifiedAttribute.newBuilder()
+          .setName(identifier.mkString("."))

Review Comment:
   I think we can follow the catalyst dsl
   1. if we call `protoAttr`, then we parse the name (split by dot) and create `UnresolvedAttribute`
   2. if we call `bool`, then we directly use the name as attribute name and create `AttributeReference`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r993909088


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   The current data model assumes that the type itself expresses its nullability, so why do we need nullability here?
   
   I am not saying that the current approach of the datatype having nullability is great. I would also be good if we only set nullability in the StructField message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r994634084


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   This is different from Catalyst. The catalyst data model is:
   1. most data types are just enums: string, int, float, etc.
   2. some data types have parameters: decimal(3, 1), varchar(10), etc.
   3. there are 3 collection types: struct, array, map. These 3 types have nullability: struct field nullability, array element nullability, map value nullability.
   
   Why is proto data type different?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r994802677


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -55,134 +57,135 @@ message DataType {
     uint32 user_defined_type_reference = 31;
   }
 
-  enum Nullability {
-    NULLABILITY_UNSPECIFIED = 0;
-    NULLABILITY_NULLABLE = 1;
-    NULLABILITY_REQUIRED = 2;
-  }
-
   message Boolean {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I8 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I16 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message I64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP32 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message FP64 {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message String {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Binary {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Timestamp {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Date {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message Time {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message TimestampTZ {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalYear {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message IntervalDay {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   message UUID {
     uint32 type_variation_reference = 1;
-    Nullability nullability = 2;
+    bool nullability = 2;
   }
 
   // Start compound types.
   message FixedChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message VarChar {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message FixedBinary {
     int32 length = 1;
     uint32 type_variation_reference = 2;
-    Nullability nullability = 3;
+    bool nullability = 3;
   }
 
   message Decimal {
     int32 scale = 1;
     int32 precision = 2;
     uint32 type_variation_reference = 3;
-    Nullability nullability = 4;
+    bool nullability = 4;
+  }
+
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    bool nullability = 3;

Review Comment:
   I believe the connect proto type system is a bit different on the container type. However I just refactor code here and this is existing code. We don't have tests for types neither.
   
   How about let me follow on this topic later after having more discussions with people. I can add more tests for nullability at that time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38200: [SPARK-40743][CONNECT]StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38200:
URL: https://github.com/apache/spark/pull/38200#issuecomment-1274129998

   BTW struct type is a blocker for us to support the full version of `LocalRelation` which is a blocker for us to support `createDataFrame(in-memory data structure)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38200: [SPARK-40743][CONNECT]StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38200:
URL: https://github.com/apache/spark/pull/38200#issuecomment-1274129303

   R: @cloud-fan 
   cc @grundprinzip @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38200: [SPARK-40743][CONNECT]StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991849673


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -39,6 +39,38 @@ package object dsl {
               .addAllParts(identifier.asJava)
               .build())
           .build()
+
+      def protoQualifiedAttr: proto.Expression.QualifiedAttribute =
+        proto.Expression.QualifiedAttribute.newBuilder()
+          .setName(identifier.mkString("."))

Review Comment:
   For LocalRelation's attributes, how to convert `identifier: Seq[String]` to it? 
   
   Do we need to add an extra `qualifier: Seq[String]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38200: [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38200:
URL: https://github.com/apache/spark/pull/38200#discussion_r991864071


##########
connector/connect/src/main/protobuf/spark/connect/types.proto:
##########
@@ -167,8 +167,14 @@ message DataType {
     Nullability nullability = 4;
   }
 
+  message StructField {
+    DataType type = 1;
+    string name = 2;
+    Nullability nullability = 3;

Review Comment:
   and shall we include metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org