You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/04/27 23:34:46 UTC

[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179798280


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   I didn't quite follow the rationale for `!hasPresence() && flag`. What are couple of examples? Or you could point to where this is tested. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, we will deserialize both cases as null. However, this flag can
+  // choose to explicitly deserialize as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(name="", age=0)`

Review Comment:
   What would be the output foir `Person(age=0)` (name is not set)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org