You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/02/23 09:40:23 UTC

[GitHub] [spark] rangadi opened a new pull request, #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

rangadi opened a new pull request, #40141:
URL: https://github.com/apache/spark/pull/40141

   ### What changes were proposed in this pull request?
   
   Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields up to certain depth. Currently it uses `NullType` to terminate the recursion. But an `ArrayType` containing `NullType` is not really useful and  it does not work delta.
   
   This PR fixes this by removing the field to terminate recursion rather than using `NullType`. 
   The following example illustrates the difference. 
   
   E.g. Consider a recursive Protobuf like this:
   ```
   message Node {
       int value = 1;
       repeated Node children = 2  // recursive array
   }
   message Tree {
       Node root = 1
   }
   ```
   Catalyst schama with `from_protobuf()` of `Tree` with max recursive depth set to 2, would be:
    
      - **Before**:  _STRUCT<root: STRUCT<value: int, children: array<STRUCT<value: int, **children: array<void>**>>>>_
      - **After**: _STRUCT<root: STRUCT<value: int, children: array<STRUCT<value: int>>>>_
   
   Notice that at second level, the `children` array is dropped, rather than being defined as `array<void>`. 
   
   ### Why are the changes needed?
    - This improves how Protobuf connector handles recursive fields. It avoids using `void` fields which are problematic in many scenarios and do not add any information.
   
   ### Does this PR introduce _any_ user-facing change?
    - This changes the schema in a subtle manner while using with recursive support enabled. Since this only removes an optional field, it is backward compatible. 
   
   ### How was this patch tested?
    - Added multiple unit tests and updated existing one. Most of the changes for this PR are in the tests. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #40141:
URL: https://github.com/apache/spark/pull/40141#issuecomment-1441455325

   @SandishKumarHN, @gengliangwang : Please review when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #40141:
URL: https://github.com/apache/spark/pull/40141#issuecomment-1447567460

   Thanks, merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang commented on PR #40141:
URL: https://github.com/apache/spark/pull/40141#issuecomment-1447567391

   +1, the new schema is more reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.
gengliangwang closed pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field
URL: https://github.com/apache/spark/pull/40141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40141: [SPARK-42406] Terminate Protobuf recursive fields by dropping the field

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40141:
URL: https://github.com/apache/spark/pull/40141#discussion_r1115442173


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1101,166 +939,61 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
 
     val optionsZero = new java.util.HashMap[String, String]()
     optionsZero.put("recursive.fields.max.depth", "1")
-    val schemaZero = DataType.fromJson(
-        s"""{
-           |  "type" : "struct",
-           |  "fields" : [ {
-           |    "name" : "sample",
-           |    "type" : {
-           |      "type" : "struct",
-           |      "fields" : [ {
-           |        "name" : "name",
-           |        "type" : "string",
-           |        "nullable" : true
-           |      }, {
-           |        "name" : "bff",
-           |        "type" : "void",
-           |        "nullable" : true
-           |      } ]
-           |    },
-           |    "nullable" : true
-           |  } ]
-           |}""".stripMargin).asInstanceOf[StructType]
-    val expectedDfZero = spark.createDataFrame(
-      spark.sparkContext.parallelize(Seq(Row(Row("person0", null)))), schemaZero)
-
-    testFromProtobufWithOptions(df, expectedDfZero, optionsZero, "EventPerson")
-
-    val optionsOne = new java.util.HashMap[String, String]()
-    optionsOne.put("recursive.fields.max.depth", "2")
-    val schemaOne = DataType.fromJson(
-      s"""{
-         |  "type" : "struct",
-         |  "fields" : [ {
-         |    "name" : "sample",
-         |    "type" : {
-         |      "type" : "struct",
-         |      "fields" : [ {
-         |        "name" : "name",
-         |        "type" : "string",
-         |        "nullable" : true
-         |      }, {
-         |        "name" : "bff",
-         |        "type" : {
-         |          "type" : "struct",
-         |          "fields" : [ {
-         |            "name" : "name",
-         |            "type" : "string",
-         |            "nullable" : true
-         |          }, {
-         |            "name" : "bff",
-         |            "type" : "void",
-         |            "nullable" : true
-         |          } ]
-         |        },
-         |        "nullable" : true
-         |      } ]
-         |    },
-         |    "nullable" : true
-         |  } ]
-         |}""".stripMargin).asInstanceOf[StructType]
+    val schemaOne = structFromDDL(

Review Comment:
   As part of this PR, replaced many long json schemas with DDL which is lot more concise. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -157,9 +157,6 @@ private[sql] class ProtobufDeserializer(
     (protoType.getJavaType, catalystType) match {
 
       case (null, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)
-      // It is possible that this will result in data being dropped, This is intentional,
-      // to catch recursive fields and drop them as necessary.
-      case (MESSAGE, NullType) => (updater, ordinal, _) => updater.setNullAt(ordinal)

Review Comment:
   `NullType` is no longer expected. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -59,6 +58,8 @@ object SchemaConverters {
   // existingRecordNames: Map[String, Int] used to track the depth of recursive fields and to
   // ensure that the conversion of the protobuf message to a Spark SQL StructType object does not
   // exceed the maximum recursive depth specified by the recursiveFieldMaxDepth option.
+  // A return of None implies the field has reached the maximum allowed recursive depth and

Review Comment:
   This file has the main fixes. A 'None' is returned when recursion depth is reached. And the field is dropped by the caller (i.e. outer struct). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org