You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "pang-wu (via GitHub)" <gi...@apache.org> on 2023/04/29 19:45:05 UTC

[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We think that Spark should provide an option for developer who need this behavior because this is the default behavior for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- I believe there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org