You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "chaoqin-li1123 (via GitHub)" <gi...@apache.org> on 2024/01/09 22:10:40 UTC

[PR] [WIP] retain empty message field in protobuf connector [spark]

chaoqin-li1123 opened a new pull request, #44643:
URL: https://github.com/apache/spark/pull/44643

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Since Spark doesn't allow empty StructType, empty proto message type as field will be dropped by default. introduce an option to allow retaining an empty message field by inserting a dummy column.
   
   
   ### Why are the changes needed?
   In protobuf, it is common to have empty message type without any field as a place holder
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. The default behavior is still dropping an empty message field. The new option will enable customer to keep the empty message field though they will observe a dummy column. 
   
   
   ### How was this patch tested?
   Unit test and integration test.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1468253272


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,23 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  // For example, an empty message is used as field in another message:
+  //
+  // ```
+  // message A {}
+  // Message B {A a = 1, string name = 2}
+  // ```
+  //
+  // By default, in the spark schema field a will be dropped, which result in schema
+  // b struct<name: string>
+  // If retain.empty.message.types=true, field a will be retained by inserting a dummy column.
+  // b struct<name: string, a struct<__dummy_field_in_empty_struct: string>>

Review Comment:
   `a` would be first and `name` would be second right? 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -230,4 +235,11 @@ object SchemaConverters extends Logging {
       case dt => StructField(fd.getName, dt, nullable = !fd.isRequired)
     }
   }
+
+  // Insert a dummy column to retain the empty message because
+  // spark doesn't allow empty struct type.
+  private def convertEmptyProtoToStructWithDummyField(desc: String): StructType = {
+    log.info(s"Keep $desc which is empty struct by inserting a dummy field.")
+    StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)

Review Comment:
   minor: Shall we add "nullable = true" though that is the default?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1136,7 +1208,8 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
           val df = emptyBinaryDF.select(
             from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
           )
-        assert(df.schema == structFromDDL("empty_proto struct<>"))
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<>"))

Review Comment:
   I thought we would not have empty struct after this PR.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(

Review Comment:
   Could you add comment to show this schema? It is hard to see. 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)
+          ) :: Nil
+        )
+      ) :: Nil
+    )
+
+    val options = Map("recursive.fields.max.depth" -> "2", "retain.empty.message.types" -> "true")

Review Comment:
   Could you change it to 3 or add a case with 3? I just want to see how the schema changes. Please add comment to make it easier to see the schema.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {

Review Comment:
    Is `EmptyRecursiveProto` used in this test?
        - If not,  lets remove this recursive setting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456634624


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Example added to the documentation, I feel it ok to not add dummy column to top level empty struct because I can't think of any use case for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1469066305


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {

Review Comment:
   Option removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1470393614


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,79 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<__dummy_field_in_empty_struct: string>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")

Review Comment:
   Removed.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,79 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<__dummy_field_in_empty_struct: string>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,

Review Comment:
   Fixed.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,23 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  // For example, an empty message is used as field in another message:
+  //
+  // ```
+  // message A {}
+  // Message B {A a = 1, string name = 2}

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456452602


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Because this is top level empty struct, only nested level empty struct is not allowed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1469065953


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -230,4 +235,11 @@ object SchemaConverters extends Logging {
       case dt => StructField(fd.getName, dt, nullable = !fd.isRequired)
     }
   }
+
+  // Insert a dummy column to retain the empty message because
+  // spark doesn't allow empty struct type.
+  private def convertEmptyProtoToStructWithDummyField(desc: String): StructType = {
+    log.info(s"Keep $desc which is empty struct by inserting a dummy field.")
+    StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)

Review Comment:
   This will result in compiler warning. I believe they won't flip the default of this field, which is super risky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457920251


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -212,11 +212,23 @@ object SchemaConverters extends Logging {
           ).toSeq
           fields match {
             case Nil =>
-              log.info(
-                s"Dropping ${fd.getFullName} as it does not have any fields left " +
-                "likely due to recursive depth limit."
-              )
-              None
+              if (protobufOptions.retainEmptyMessage) {
+                // Insert a dummy column to retain the empty message because
+                // spark doesn't allow empty struct type.
+                log.info(
+                  s"Keep ${fd.getFullName} which is empty struct by inserting a dummy field" +
+                    s" because retain.empty.message=true"

Review Comment:
   `because ...` could be removed.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,21 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be

Review Comment:
   Could you format this better? 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(

Review Comment:
   Could you add a comment here showing expected schema? It is hard to see here. 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {

Review Comment:
   Could add the a test with recursive proto version of this test?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"

Review Comment:
   Btw, since `"recursive.fields.max.depth"` is 4, there should be 4 levels of struct here, right?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")

Review Comment:
   Remove `"recursive.fields.max.depth" -> "4"` ? This test does not use recursive protobufs.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)
+          ) :: Nil
+        )
+      ) :: Nil
+    )
+
+    val options = Map("recursive.fields.max.depth" -> "2", "retain.empty.message.types" -> "true")

Review Comment:
   Will the schema change if max depth is higher, say 3?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"
+      ))
+      // The dummy column of empty proto should have null value.
+      checkAnswer(resultDF, Seq(Row(Row("my_name", null))))
+    }
+
+    // Top level message can't be empty, otherwise AnalysisException will be thrown.

Review Comment:
   Can you add a case where top level struct is empty and we try to store (using `EmptyProto`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457991389


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   I don't have a strong opinion here, changed to add dummy column also for top level empty strcut.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458090292


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"

Review Comment:
   Sorry, forgot to change the name when pasting the code, change to EmptyProtoWrapper.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456459115


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {

Review Comment:
   Description added.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -212,11 +212,19 @@ object SchemaConverters extends Logging {
           ).toSeq
           fields match {
             case Nil =>
-              log.info(
-                s"Dropping ${fd.getFullName} as it does not have any fields left " +
-                "likely due to recursive depth limit."
-              )
-              None
+              if (protobufOptions.retainEmptyMessage) {
+                // Insert a dummy column to retain the empty message because
+                // spark doesn't allow empty struct type.
+                Some(
+                  StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)

Review Comment:
   Log added.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -212,11 +212,19 @@ object SchemaConverters extends Logging {
           ).toSeq
           fields match {
             case Nil =>
-              log.info(
-                s"Dropping ${fd.getFullName} as it does not have any fields left " +
-                "likely due to recursive depth limit."
-              )
-              None
+              if (protobufOptions.retainEmptyMessage) {
+                // Insert a dummy column to retain the empty message because
+                // spark doesn't allow empty struct type.
+                Some(
+                  StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456458316


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,12 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow empty StructType, empty proto message type as field will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  val retainEmptyMessage: Boolean =
+    parameters.getOrElse("retain.empty.message", false.toString).toBoolean

Review Comment:
   Renamed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456492648


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Interesting. Is this tested in the parquet test below?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1469066080


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,23 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  // For example, an empty message is used as field in another message:
+  //
+  // ```
+  // message A {}
+  // Message B {A a = 1, string name = 2}
+  // ```
+  //
+  // By default, in the spark schema field a will be dropped, which result in schema
+  // b struct<name: string>
+  // If retain.empty.message.types=true, field a will be retained by inserting a dummy column.
+  // b struct<name: string, a struct<__dummy_field_in_empty_struct: string>>

Review Comment:
   Nice catch, fixed.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1136,7 +1208,8 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
           val df = emptyBinaryDF.select(
             from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
           )
-        assert(df.schema == structFromDDL("empty_proto struct<>"))
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<>"))

Review Comment:
   This test don't configure the option explicitly, so the default behavior is the same as before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458090556


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Yes, now the top level also get the dummy column if the option is true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "charlesy6 (via GitHub)" <gi...@apache.org>.
charlesy6 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458380985


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   LGTM



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456327396


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,12 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow empty StructType, empty proto message type as field will be

Review Comment:
   Treat this as full official documentation. This should be aimed at end user. Could you expand it and add an example? PTAL documentation for the flags above. 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.

Review Comment:
   Comment needs to be updated, we are using non-recursive proto below.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1198,34 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Corner case: retain empty recursive proto fields") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dumy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)
+    val expectedSchema = StructType(
+      // DDL: "proto STRUCT<name: string, groups: map<
+      //    struct<name: string, group: map<struct<name: string>>>>>"
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)

Review Comment:
   Where is the dummy field here?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Why do we have empty struct here even though we have set the option?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyRecursiveProto at inner level, because empty struct type is not allowed in Spark.,

Review Comment:
   Fix comment. Even better, could you move non-recursive proto test to another test?



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1198,34 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Corner case: retain empty recursive proto fields") {

Review Comment:
   update the name to say '... when <optio_name> is set'.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {

Review Comment:
   Add description for this test.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -212,11 +212,19 @@ object SchemaConverters extends Logging {
           ).toSeq
           fields match {
             case Nil =>
-              log.info(
-                s"Dropping ${fd.getFullName} as it does not have any fields left " +
-                "likely due to recursive depth limit."
-              )
-              None
+              if (protobufOptions.retainEmptyMessage) {
+                // Insert a dummy column to retain the empty message because
+                // spark doesn't allow empty struct type.
+                Some(
+                  StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)

Review Comment:
   Better to use double underscores `__`, also it is quite long. How about `__dummy_field_in_empty_struct` ? Even my suggestion is bit long :). 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -212,11 +212,19 @@ object SchemaConverters extends Logging {
           ).toSeq
           fields match {
             case Nil =>
-              log.info(
-                s"Dropping ${fd.getFullName} as it does not have any fields left " +
-                "likely due to recursive depth limit."
-              )
-              None
+              if (protobufOptions.retainEmptyMessage) {
+                // Insert a dummy column to retain the empty message because
+                // spark doesn't allow empty struct type.
+                Some(
+                  StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)

Review Comment:
   Add a log here as well.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,12 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow empty StructType, empty proto message type as field will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  val retainEmptyMessage: Boolean =
+    parameters.getOrElse("retain.empty.message", false.toString).toBoolean

Review Comment:
   Better to add 'type' to the name. How about `retain.empty.message.types` ? 
   Normally a 'message' refers to runtime actual message, rather than type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457985105


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"

Review Comment:
   This is EmptyProto, not EmptyRecursiveProto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457990399


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,21 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be

Review Comment:
   Reformated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456493111


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Please add this example in the documentation.
   Btw, is it better to be consistent with the dummy field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457985327


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"
+      ))
+      // The dummy column of empty proto should have null value.
+      checkAnswer(resultDF, Seq(Row(Row("my_name", null))))
+    }
+
+    // Top level message can't be empty, otherwise AnalysisException will be thrown.

Review Comment:
   Test added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457988505


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)
+          ) :: Nil
+        )
+      ) :: Nil
+    )
+
+    val options = Map("recursive.fields.max.depth" -> "2", "retain.empty.message.types" -> "true")

Review Comment:
   yes, schema will change for recursive empty field if the option changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "charlesy6 (via GitHub)" <gi...@apache.org>.
charlesy6 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458381280


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"
+      ))
+      // The dummy column of empty proto should have null value.
+      checkAnswer(resultDF, Seq(Row(Row("my_name", null))))
+    }
+
+    // Top level message can't be empty, otherwise AnalysisException will be thrown.

Review Comment:
   > Test added.
   
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457918795


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Why is that? Isn't it better to be consistent? "If there is an empty protobuf, it will have a dummy column". 
   The users don't distinguish between top level vs nested.
   Btw, does parquet test below include empty top level struct? Could you point to it? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1470079760


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1221,32 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val expectedSchema = StructType(

Review Comment:
   Thanks. The comment above looks good. Could you update the test with recursive depth set to 4?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #44643:
URL: https://github.com/apache/spark/pull/44643#issuecomment-1915920052

   CI failure is unrelated to this change:
   
   * https://github.com/chaoqin-li1123/spark/actions/runs/7704149381/job/20995976857 - unrelated K8S integration test failure
   * https://github.com/chaoqin-li1123/spark/actions/runs/7704149381/job/20995976857 - lint, known Sphinx version issue (maybe need to bump the master branch in fork?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #44643:
URL: https://github.com/apache/spark/pull/44643#issuecomment-1915927188

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458005773


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+    withTempDir { file =>
+      val binaryDF = Seq(
+        EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
+        .toDF("binary")
+      checkWithFileAndClassName("EmptyProtoWrapper") {
+        case (name, descFilePathOpt) =>
+          val df = binaryDF.select(
+            from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+          )
+          df.write.format("parquet").mode("overwrite").save(file.getAbsolutePath)
+      }
+      val resultDF = spark.read.format("parquet").load(file.getAbsolutePath)
+      assert(resultDF.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"

Review Comment:
   Why are we using recursive proto in the line above (1162) `EmptyRecursiveProtoWrapper.newBuilder.setName("my_name").build().toByteArray)
           .toDF("binary")` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457990911


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {

Review Comment:
   There is a similar test for EmptyRecursiveProto below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1457989900


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,78 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")

Review Comment:
   There is a EmptyProtoWrapper below which is nested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1470375085


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,79 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.

Review Comment:
   Fix this comment. Top level also has dummy column.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,79 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<__dummy_field_in_empty_struct: string>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,

Review Comment:
   Fix this comment.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -207,6 +207,23 @@ private[sql] class ProtobufOptions(
   //    nil => nil, Int32Value(0) => 0, Int32Value(100) => 100.
   val unwrapWellKnownTypes: Boolean =
     parameters.getOrElse("unwrap.primitive.wrapper.types", false.toString).toBoolean
+
+  // Since Spark doesn't allow writing empty StructType, empty proto message type will be
+  // dropped by default. Setting this option to true will insert a dummy column to empty proto
+  // message so that the empty message will be retained.
+  // For example, an empty message is used as field in another message:
+  //
+  // ```
+  // message A {}
+  // Message B {A a = 1, string name = 2}

Review Comment:
   message with small `m`. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala:
##########
@@ -230,4 +235,11 @@ object SchemaConverters extends Logging {
       case dt => StructField(fd.getName, dt, nullable = !fd.isRequired)
     }
   }
+
+  // Insert a dummy column to retain the empty message because
+  // spark doesn't allow empty struct type.
+  private def convertEmptyProtoToStructWithDummyField(desc: String): StructType = {
+    log.info(s"Keep $desc which is empty struct by inserting a dummy field.")
+    StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)

Review Comment:
   I see. We have set that in other areas. You can resolve this comment. 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1136,7 +1208,8 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
           val df = emptyBinaryDF.select(
             from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
           )
-        assert(df.schema == structFromDDL("empty_proto struct<>"))
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<>"))

Review Comment:
   Could you remove this diff. Looks like spurious change. 



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1223,56 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Retain empty recursive proto fields when retain.empty.message.types=true") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dummy field.
+
+    val structWithDummyColumn =
+      StructType(StructField("__dummy_field_in_empty_struct", StringType) :: Nil)
+    val structWithRecursiveDepthEquals2 = StructType(
+      StructField("recursive_field", structWithDummyColumn)
+        :: StructField("recursive_array", ArrayType(structWithDummyColumn, containsNull = false))
+        :: Nil)
+    /*
+      The code below construct the expected schema with recursive depth set to 3.

Review Comment:
   Thanks for adding this comment.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,79 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("Retain empty proto fields when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained as
+    // a field by inserting a dummy column as sub column.
+    val options = Map("retain.empty.message.types" -> "true")
+
+    // EmptyProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema ==
+          structFromDDL("empty_proto struct<__dummy_field_in_empty_struct: string>"))
+    }
+
+    // EmptyProto at inner level, because empty struct type is not allowed in Spark.,
+    // a dummy column is inserted to retain the empty message.
+    checkWithFileAndClassName("EmptyProtoWrapper") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("wrapper")
+        )
+        // Nested empty message is retained by adding dummy column to the schema.
+        assert(df.schema == structFromDDL("wrapper struct" +
+          "<name: string, empty_proto struct<__dummy_field_in_empty_struct: string>>"))
+    }
+  }
+
+  test("Write empty proto to parquet when retain.empty.message.types=true") {
+    // When retain.empty.message.types=true, empty proto like 'message A {}' can be retained
+    // as a field by inserting a dummy column as sub column, such schema can be written to parquet.
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message.types" -> "true")

Review Comment:
   Remove 'recursive' setting. Not used here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR closed pull request #44643: [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector 
URL: https://github.com/apache/spark/pull/44643


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on PR #44643:
URL: https://github.com/apache/spark/pull/44643#issuecomment-1888140706

   @rangadi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "chaoqin-li1123 (via GitHub)" <gi...@apache.org>.
chaoqin-li1123 commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456458708


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1198,34 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Corner case: retain empty recursive proto fields") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dumy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)
+    val expectedSchema = StructType(
+      // DDL: "proto STRUCT<name: string, groups: map<
+      //    struct<name: string, group: map<struct<name: string>>>>>"
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)

Review Comment:
   Inside emptyProtoSchema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1456494004


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1150,6 +1198,34 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("Corner case: retain empty recursive proto fields") {
+    // This verifies that a empty proto like 'message A { A a = 1}' can be retained by
+    // inserting a dumy field.
+
+    val emptyProtoSchema =
+      StructType(StructField("_dummy_field_to_retain_empty_message", StringType) :: Nil)
+    val expectedSchema = StructType(
+      // DDL: "proto STRUCT<name: string, groups: map<
+      //    struct<name: string, group: map<struct<name: string>>>>>"
+      StructField("empty_proto",
+        StructType(
+          StructField("recursive_field", emptyProtoSchema) ::
+            StructField("recursive_array", ArrayType(emptyProtoSchema, containsNull = false)

Review Comment:
   Could you update the DDL comment at 1208?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46736][PROTOBUF] retain empty message field in protobuf connector [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44643:
URL: https://github.com/apache/spark/pull/44643#discussion_r1458006485


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1124,7 +1124,55 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
-  test("Corner case: empty recursive proto fields should be dropped") {
+  test("retain empty proto fields") {
+    val options = Map("recursive.fields.max.depth" -> "4", "retain.empty.message" -> "true")
+
+    // EmptyRecursiveProto at the top level. It will be an empty struct.
+    checkWithFileAndClassName("EmptyProto") {
+      case (name, descFilePathOpt) =>
+        val df = emptyBinaryDF.select(
+          from_protobuf_wrapper($"binary", name, descFilePathOpt, options).as("empty_proto")
+        )
+        // Top level empty message is retained without adding dummy column to the schema.
+        assert(df.schema == structFromDDL("empty_proto struct<>"))

Review Comment:
   Is this changed merged?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org