You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "pang-wu (via GitHub)" <gi...@apache.org> on 2023/04/06 22:01:26 UTC

[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160291713


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @viirya I don't think this is a bug, according to proto3 default value spec,
   
   > When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:
   > 
   > For strings, the default value is the empty string.
   > For bytes, the default value is empty bytes.
   > For bools, the default value is false.
   > For numeric types, the default value is zero.
   > For [enums](https://protobuf.dev/programming-guides/proto3/#enum), the default value is the first defined enum value, which must be 0.
   > For message fields, the field is not set. Its exact value is language-dependent. See the [generated code guide](https://protobuf.dev/reference/) for details.
   > 
   
   What that mean is the absent optional field's value should be the default (0 for int, for example) if they are retrieved by the client -- this is very confusing (I know), it basically says there is by default (not using oneof keyword) there is no way to distinguish between absent field and fields populated with default value (like int64 field set to 0). Protobuf 3.15 somewhat bring this back, more details could be found [here](https://stackoverflow.com/questions/42622015/how-to-define-an-optional-field-in-protobuf-3)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org