You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "justaparth (via GitHub)" <gi...@apache.org> on 2023/04/06 15:03:32 UTC

[GitHub] [spark] justaparth opened a new pull request, #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

justaparth opened a new pull request, #40686:
URL: https://github.com/apache/spark/pull/40686

   ### What changes were proposed in this pull request?
   Currently, when deserializing protobufs using from_protobuf, fields that are not explicitly present in the serialized message are deserialized as null in the resulting struct. (In proto3, this also includes fields that have been explicitly set to their zero value, as it is not distinguishable in the serialized format. https://protobuf.dev/programming-guides/field_presence/)
   
   For example, given a message format like
   
   ```
   syntax = "proto3";
   
   message SearchRequest {
     string query = 1;
     int32 page_number = 2;
     int32 result_per_page = 3;
   }
   ```
   
   and an example message like
   ```
   SearchRequest(query = "", page_number = 10)
   ```
   
   the result from calling from_protobuf on the serialized form of the above message would be
   
   ```
   {"query": null, "page_number": 10, "result_per_page": null}
   ```
   
   In proto3, all fields are considered optional and have default values (https://protobuf.dev/programming-guides/proto3/#default), and reader clients in some languages (e.g. go, scala) will fill in that default value when reading the protobuf. It could be useful to make this configurable so that zero values can optionally be materialized if desired.
   
   Concretely, in the example above, we might want to deserialize it instead as
   ```
   {"query": "", "page_number": 10, "result_per_page": 0}
   ```
   
   In this PR I implemented this behavior by adding an option, `materializeZeroValues` that can be passed to the options map for `from_protobuf` which controls whether to materialize these defaults or not.
   
   
   ### Why are the changes needed?
   Additional functionality to help control the deserialization behavior.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it provides a new option that can be accepted by `from_protobuf`
   
   
   
   ### How was this patch tested?
   I added a test case that asserted the existing behavior explicitly, and then a test case that asserted that with the option the behavior is as intended.
   
   I've also built the protobuf jar locally and tried it. Please do let me know if theres any more testing or more information about reproducibility that would be helpful.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179236784


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   great point, let me make sure there is a test case like this. 
   
   message fields have presence information (https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis), so we won't do anything special with materialize zero values,
   
   i.e.
   
   ```
   message Person {
     <some fields>
   }
   
   message Example {
     Person tom = 1;
   }
   ```
   
   will get deserialized like
   ```
   from_protobuf(Example()) ==> {tom: null}
   from_protobuf(Example(Person(...))) ==> {tom: {....}}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181122536


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   Well, I am not proposing change of behavior, but explicitly adding a check for Messages (through it is already covered in 'hasPresense()' check. At no point we want to serialize default messages. 
   
   We should also explicitly mention in the documentation about messages. Most readers don't understands what 'has presence information' means. They shouldn't have to read detailed protobuf spec to understand what this flag does. 
   
   How about this: add the message check in addition to !hasPresense() check? (It's no op). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181135882


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > lit(null).as("optional_int"), 
   
   This [line at 1229](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1229) ? 
   
   I am more confused about the feature now. I thought all scaler fields like 'int' will have a value with this option set. 
   
   What is the spec we are adhering to? What is the use case we are solving? May be we could update the flag documentation. 
   
   Is this essentially 'materialize materialize protobuf V3 scalar fields that are not declared optional'? 
   
   We are giving too much weightage to `optional` keyword it looks like. Logically all fields are optional in Proto3 by default. 
   
   What happens to a `optional int` in V2 proto? Will it behave like proto3 `int` proto2 `optional int`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179249860


##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {
+  int32 bar = 1;
+  optional int32 baz = 2;

Review Comment:
   yeah, so `optional` added in proto 3.15 has field presence information, so when we read a serialized proto with an optional field we can know whether the field was written to or not. I didn't see an example of this in the existing tests, so I wanted to test it out to make sure my modifications would do the right thing for a field like this. i'll combine it as you mentioned above into a different file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181141599


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   but, in any case i've actually restructured the comment a bit, i wonder if its more clear now, do you want to take a look?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos

Review Comment:
   done!



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, this library deserializes both cases as null. However, this flag can
+  // choose to explicitly deserialize them as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160298034


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   Oh, I meant the original behavior of protobuf deseralizing here. Before this PR, based on it describes, a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 serialized messages, zero-value materialization is the default behavior [link[(https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis)
   - for proto2 serailazed messages, check for field presence since that info is available [link](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179798772


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   @justaparth this is an important comment. What is the behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124838


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   @rangadi with respect to message serialization, what do you think about this [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239)? in it, there is a message field that is never set originally and in deserializing it with/without the flag it is null. is that expected / unexpected to you? 
   
   > Exclude message types from serializing explicitly (no matter what our presense policy)
   Give a rational for using presence information. I.e. why should optional string a and string a behave differently with this flag.
   In fact this check does not even matter for scala types due to hadField() earlier. So there is no need to have it.
   
   Sorry just to clarify, are you asking for a rationale here in the comments, or would you like me to include it in the documentation so that its a bit more clear?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181160094


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi the discussion of the spec behavior could be found [here](https://github.com/apache/spark/pull/40686#discussion_r1160291713)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take serialized protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().setName("").build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types (docs)[https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3] :
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same broadly supported functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181711036


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   > If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called includingDefaultValueFields().
   
   I see, thanks. Could you update the documentation saying so? Lets remove 'ambiguity' argument. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124838


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   @rangadi with respect to message serialization, what do you think about this [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239)? in it, there is a message field that is never set originally and in deserializing it with/without the flag it is null. is that expected / unexpected to you? 
   
   > Exclude message types from serializing explicitly (no matter what our presense policy)
   Give a rational for using presence information. I.e. why should optional string a and string a behave differently with this flag.
   
   
   Sorry just to clarify, are you asking for a rationale here in the comments, or would you like me to include it in the documentation so that its a bit more clear?
   
   > In fact this check does not even matter for scala types due to hadField() earlier. So there is no need to have it.
   
   Ah check the other thread! i commented there as to how this breaks if we don't use hasPresence https://github.com/apache/spark/pull/40686#discussion_r1181124590



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181138783


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,43 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // If the materializeZeroValues config is set, we will materialize zero values when the field

Review Comment:
   i only added the comment because you expressed some concerns about the original implementation :(. i'll remove it



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,43 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // If the materializeZeroValues config is set, we will materialize zero values when the field

Review Comment:
   i only added the comment because you expressed some concerns about the original lack of clarity :(. i'll remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181138746


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Plus, lets just add field.getJavaType != message in addition to !hasPresense() (though it is strictly not required. Better for readability.
   
   i feel like its a little bit odd to introduce code just for 1 specific type like this, especially given that it is redundant with the previous check. let's just leave a comment that indicates this behavior for future readers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181139613


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   I would prefer to have the check. Message type is pretty important. You have a test covering it, that is good. 
   We don't want to serialize messages even if some future java library changes meaning of 'hasPresense' (though likely will not).
   
   It is ok if you add a comment that it is really not required. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160318761


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   One more thing to add is @justaparth, with this PR in, proto2 and >= proto3.15 message with optional key word will still work as expected? So nothing will be break.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179790306


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   ok quick update, i took another pass at this based on some comments below and i think the code is much clearer in its intent now! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1526729416

   @rangadi thanks for the comments! i've updated the pr quite a bit, namely i've split it up into two commits:
   - commit 1: add tests that show the current state
   - commit 2: add flag + modify tests to show what changed
   
   and i addressed many of your comments as well along the way 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181126482


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   ah one thing here, the if statement is quite subtle as its written which i think is contributing to confusion :(. let me update the PR to make a small change here that I think will make this more clear



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181128972


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   > Give a rational for using presence information. I.e. why should optional string a and string a behave differently with this flag.
   
   In response to this i've updated the comment for this piece of logic! I think it should hopefully be more clear now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136226


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Is this essentially materialize Protobuf V3 scalar fields that are not declared optional?
   
   
   If this is the case, lets update the documentation clearly say this. Better not to say anything about 'presence information`. We shouldn't not require users to go hunting down Protobuf documentation to understand it.
   
   Plus, lets just add `field.getJavaType != message` in addition to `!hasPresense()` (though it is strictly not required. Better for readability. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181248194


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   > unset field set as 0/default value in this struct
   
   That is true for fields declared optional too, but we are leaving them as null when unset. This is the confusion I have. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3:
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same broadly supported functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 this zero-value materialization to be the default behavior
   - for proto2 we'd want to look for field presence
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1161754527


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @pang-wu thanks for the callout, i'll double check that we're testing those cases and if not add a new one.
   
   In the meantime, i've refactored the code here slightly. I _think_ it might be simpler/correct to define the behavior through the lens of [field presence](https://protobuf.dev/programming-guides/field_presence/), i.e.
   - if a field has presence info, we will deserialize it if present and otherwise fill in `null`
   - if a field **doesn't** have presence info, we only deserialize it if it is explicitly set, or is proto2 with an explicit default.
   
   And then, on top of that logic (which mirrored the logic as-it-was), i've added the `materializeZeroValues` option which will materialize the default if a field does not have presence information available.
   
   Does this make more sense?
   
   I think now it's easier to see how the current code is treating fields without presence info, i.e. it's choosing to deserialize to null instead of choosing a default. If we think that's the right behavior going forward, maybe we could consider changing the default behavior of the code (though i'm afraid to break any existing applications that may use this). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1523054485

   adding @rangadi as well, as they seem to have been a recent committer on a lot of the protobuf parsing logic!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179011020


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   I've done a quick update to this PR! I've added some additional tests and made the code more clear as to how field presence is the main factor to consider when deserializing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179798280


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   I didn't quite follow the rationale for `!hasPresence() && flag`. What are couple of examples? Or you could point to where this is tested. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, we will deserialize both cases as null. However, this flag can
+  // choose to explicitly deserialize as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(name="", age=0)`

Review Comment:
   What would be the output foir `Person(age=0)` (name is not set)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181193524


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library
+  // deserializes this situation as `null`, but with this flag will deserialize them
+  // as the type-specific default value.
+  //
+  // Note that this won't affect fields with the optional keyword in proto3, or
+  // any fields in proto2, as they don't have the ambiguity described above because
+  // they have presence information.
+  //
+  // As an example:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  //   optional string middle_name = 3;
+  //   optional int64 salary = 4;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(age=20, middle_name="smith")

Review Comment:
   yeahh, thanks for catching this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We believe that Spark should provide an option for developer who need the spec behavior because this is the default behavior for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- yes, there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181159725


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   You are right this feature does not solve the ambiguity, but rather give developer a chance to bring the behavior align with the proto3 spec (unset field set as 0/default value in this struct), as I mentioned [before](https://github.com/apache/spark/pull/40686#discussion_r1181120470). 
   
   As long as the message is not created using proto2 syntax or with optional keyword supported since protobuf 3.15, there is no way the ambiguity could be solved. 
   
   Maybe we can be explicit about that in the comment, I will work with @justaparth on that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181160094


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi the discussion of the spec behavior could be found [here](https://github.com/apache/spark/pull/40686#discussion_r1160291713). The reference is from here: https://protobuf.dev/programming-guides/proto3/#default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181490002


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   (also in case it wasn't clear, i actually ran the code samples above to verify the output)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior in spark-protobuf]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 serialized messages, zero-value materialization is the default behavior since [field presence](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis) is not available
   - for proto2 serailazed messages, check for field presence since that info is [available](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124590


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > I.e. if we replace !field.hasPresence with field.getJavaType != MESSAGE, I think the tests pass.
   
   actually, the tests, as written, will not pass. specifically, if we change `!field.hasPresence` to `field.getJavaType != message` then proto3 optional fields fields that were never set on the initial proto will get deserialized as their default value. the test [here](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239) checks that.
   
   just to make sure, i did try this locally and verified the tests failed in the way described above.
   
   > We should also explicitly mention in the documentation about messages. Most readers don't understands what 'has presence information' means. They shouldn't have to read detailed protobuf spec to understand what this flag does.
   How about this: add the message check in addition to !hasPresense() check? (It's no op).
   
   ah, if documentation / clarity is the issue i'm more than happy to write up some more comments or clear up the documentation!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1528873292

   Thanks a bunch @pang-wu and @rangadi for the thoughtful discussion, really appreciate all the feedback and input 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181141585


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Message type is pretty important.
   > You have a test covering it, that is good.
   
   aren't all the types important? by this logic we should have a callout for every single type. i agree message is important and "not a scalar" so i thought a comment + test suite that explicitly asserts the behavior would help us prevent any problems
   
   > We don't want to serialize messages even if some future java library changes meaning of 'hasPresence'
   
   I don't think this would be possible, right? Because that would mean protobuf changes the sematics of their protocol, which doesn't seem possible (it would probably break proto parsing).
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136578


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos

Review Comment:
   Looks like the policy seems to be `materialize Protobuf V3 scalar fields that are not explicitly declared optional`. Lets make it clear and simplify this. We shouldn't have to read any external documentation (can still list the link at the bottom after the illustrative example). 
   
   Also update the example to include two optional fields and illustrate the difference between optional and regular fields. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136578


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos

Review Comment:
   Looks like the policy seems to be `materialize Protobuf V3 scalar fields that are not explicitly declared optional`. Lets make it clear and simplify this. We shouldn't have to read anything more. 
   
   Also update the example to include two optional fields and illustrate the difference between optional and regular fields. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181155237


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > aren't all the types important? by this logic we should have a callout for every single type.
   
   What? You want to exclude int etc? Message is different because it can contain other messages and easily blow up the struct. Remember you could not explain if Messages would be serialized or not in earlier comment? Why, because it was not easy why they would be or wound not be included. You tested it, but didn't have an explanation. 
   
   > The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. 
   
   @pang-wu what spec is this? It will still have null for fields. I am ok with the feature, I don't understand the motivation.
   We are providing any extra information in the Spark struct.
   Could you given an example of problem this lets you solve? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1180874763


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   I.e. if we replace `!field.hasPresence`  with `field.getJavaType != MESSAGE`, I think the tests pass. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We believe that Spark should provide an option for developer who need the spec behavior because this is the default implementation for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- yes, there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181159725


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   You are right this feature does not solve the ambiguity, but rather bring the behavior align with the proto3 spec (unset field set as 0/default value in this struct). 
   
   As long as the message is not created using proto2 syntax or with optional key word supported since protobuf 3.15, there is no way the ambiguity could be solved. 
   
   Maybe we can be explicit about that in the comment, I will work with @justaparth on that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160291713


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @viirya I don't think this is a bug, according to proto3 default value spec,
   
   > When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:
   > 
   > For strings, the default value is the empty string.
   > For bytes, the default value is empty bytes.
   > For bools, the default value is false.
   > For numeric types, the default value is zero.
   > For [enums](https://protobuf.dev/programming-guides/proto3/#enum), the default value is the first defined enum value, which must be 0.
   > For message fields, the field is not set. Its exact value is language-dependent. See the [generated code guide](https://protobuf.dev/reference/) for details.
   > 
   
   What that mean is the absent optional field's value of a proto3 message should be the default instead of null (0 for int, for example) if they are retrieved by the client -- this is very confusing (I know), it basically says there is by default (not using oneof keyword) there is no way to distinguish between absent field and fields populated with default value (like int64 field set to 0). Protobuf 3.15 somewhat bring this back, more details could be found [here](https://stackoverflow.com/questions/42622015/how-to-define-an-optional-field-in-protobuf-3)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1161754527


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @pang-wu thanks for the callout, i'll double check that we're testing those cases and if not add a new one.
   
   In the meantime, i've refactored the code here slightly. I _think_ it might be simpler/correct to define the behavior through the lens of [field presence](https://protobuf.dev/programming-guides/field_presence/), i.e.
   - if a field has presence info, we will deserialize it if present and otherwise fill in `null`
   - if a field **doesn't** have presence info, we only deserialize it if it is explicitly set, or is proto2 with an explicit default.
   
   And then, on top of that logic (which mirrored the logic as-it-was), i've added the `materializeZeroValues` option which will materialize the default if a field does not have presence information available.
   
   Does this make more sense?
   
   I think also it's easier to see how the current code is treating fields without presence info, i.e. it's choosing to deserialize to null instead of the default. If we think that's the right behavior going forward, maybe we could consider changing the default behavior of the code too (though i'm afraid to break any existing applications that may use this). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179803443


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   one example is this test: `test("test unset values - proto3") {` on line 1196 of `ProtobufFunctionsSuit.scala` (sorry, i'm not sure how to link to it)
   
   in that test, there are some proto3 optional fields, like `optional_int` which are not set to any value. In this case, even with materialize zero values set, we don't want to deserialize it as 0. This is because for proto3 optional fields we have field presence and there is no ambiguity between "unset" and "set to zero value", so it'd actually be _incorrect_ of us to try to set it to 0.
   
   basically the materialize zero values flag is only intended to choose the behavior in cases where "unset" and "set to zero value" are not distinguishable. the default behavior is to deserialize both to null. with the flag, it is to deserialize both as the zero value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124590


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > I.e. if we replace !field.hasPresence with field.getJavaType != MESSAGE, I think the tests pass.
   
   actually, the tests, as written, will not pass. specifically, if we change `!field.hasPresence` to `field.getJavaType != message` then proto3 optional fields fields that were never set on the initial proto will get deserialized as default value for their type. the test that verifies this behavior is [here](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239)
   
   just to make sure, i did try this locally and verified the tests failed in the way described above.
   
   > We should also explicitly mention in the documentation about messages. Most readers don't understands what 'has presence information' means. They shouldn't have to read detailed protobuf spec to understand what this flag does.
   How about this: add the message check in addition to !hasPresense() check? (It's no op).
   
   ah, if documentation / clarity is the issue i'm more than happy to write up some more comments or clear up the documentation!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181155237


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   
   > aren't all the types important? by this logic we should have a callout for every single type.
   
   What? You want to exclude int etc? Message is different because it can contain other messages and easily blow up struct. Remember you could not explain if Messages would be serialized or not in earlier comment? Why, because it was not easy why they would be or wound not be included. You tested it, but didn't have an explanation. 
   
   > The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. 
   
   @pang-wu what spec is this? It will still have null for fields. I am ok with the feature, I don't understand the motivation.
   We are providing any extra information in the Spark struct.
   Could you given an example of problem this lets you solve? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1180843562


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   and in addition, to respond to your comment about why we need hasPresence if we already have hasField, here:https://github.com/apache/spark/pull/40686#discussion_r1179432031
   
   on line 317 we know that `record.hasField(field)` is false, indicating that the field has no value in the serialized proto. However, what we don't know is _why_ it is false. 
   
   If `field.hasPresence()` is true, then we know for sure that the field was never set (i.e. optional proto3, proto2, etc.), and it would be incorrect for us to materialize the zero value even if the flag is set.
   
   If `field.hasPresence()` is false, then its the situation where the "not set" and "zero value" are not distinguishable and we check the flag to see if we should materialize the zero value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take serialized protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types (docs)[https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3] :
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same broadly supported functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136862


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, this library deserializes both cases as null. However, this flag can
+  // choose to explicitly deserialize them as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;

Review Comment:
   Update this to add two more fields:
    ```
       optional string middle_name = 3
       optional int id = 3;
   ```
   
   And update the setter to `Person(name="", id=0)` and list the resulting value with and without the flag. I think that covers all the cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136862


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, this library deserializes both cases as null. However, this flag can
+  // choose to explicitly deserialize them as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;

Review Comment:
   Update this to add two more fields:
    ```
       optional string middle_name = 3
       optional int weight = 3;
   ```
   
   And update the setter to `Person(name="", weight=0)` and list the resulting value with and without the flag. I think that covers all the cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181194005


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,36 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // See the docs for materialize.default.values in [[ProtobufOptions]].
+  // Decides whether we should get the zero values. We do this when
+  // the field is not present in the proto, and theres ambiguity about
+  // whether it was never set or if it was set to its zero value (i.e.
+  // the field lacks field presence information). This is basically
+  // proto3 scalar values only.
+  private def shouldGetZeroValue(record: DynamicMessage, field: FieldDescriptor): Boolean = {
+    !record.hasField(field) && !field.hasPresence && this.materializeZeroValues

Review Comment:
   okay moved it back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179805430


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   ah, i figured out how to link changes: 
   
   i think this behavior is shown in this test: https://github.com/apache/spark/pull/40686/files#diff - f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1123-R1194
   
   If you look at this test, you'll see that if we [don't set a value](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1126-R1137) for the `message` field. And the deserialized [result](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1148) is null for that field, [regardless](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1168) of the flag.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179422692


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Example(s="", i=0)`
+  //
+  // The result after calling from_protobuf without this flag set would be:
+  // `{"s": null, "i": null}`
+  //
+  // To explicitly materialize that default zero value, as readers in some other languages
+  // will do, this flag can be set to get a deserialized result like:
+  // `{"s": "", "i": 0}`
+  val materializeZeroValues: Boolean =
+    parameters.getOrElse(ProtobufOptions.materializeZeroValues, false.toString).toBoolean
 }
 
 private[sql] object ProtobufOptions {
+  val materializeZeroValues = "materializeZeroValues"

Review Comment:
   Also 'materialize' might be a bit confusing too (since if often refers to materializing intermediate Dataframes/RDD in Spark context).  I don't have have alternate name. 
   We could go with `materialize.default.values` for now. I don't think we need `presenceless` in it.
   The goal is the documentation should be detailed enough that users understand the behaviour. Naming could be fairly simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1183097908


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to render fields with zero values when deserializing protobufs to a spark struct.

Review Comment:
   Fix capitalization throughout this. E.g. `Spark` instead of `spark`, `Protobuf` instead of `protobuf`.  



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to render fields with zero values when deserializing protobufs to a spark struct.
+  // When a field is empty in the serialized protobuf, this library will deserialize them as
+  // null by default. However, this flag can control whether to render the type-specific zero value.
+  // This operates similarly to `includingDefaultValues` in java's jsonformat, or `emitDefaults` in

Review Comment:
   nit: protobuf-java-util's rather than java's? it is not Java library. 
   Also, `JsonFormat` (with capital letter). 
   
   Alternately we could say `com.google.protobuf.util.JsonFormat` and avoid saying `java's jsonformat`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160291713


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @viirya I don't think this is a bug, according to proto3 default value spec,
   
   > When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:
   > 
   > For strings, the default value is the empty string.
   > For bytes, the default value is empty bytes.
   > For bools, the default value is false.
   > For numeric types, the default value is zero.
   > For [enums](https://protobuf.dev/programming-guides/proto3/#enum), the default value is the first defined enum value, which must be 0.
   > For message fields, the field is not set. Its exact value is language-dependent. See the [generated code guide](https://protobuf.dev/reference/) for details.
   > 
   
   What that mean is the absent optional field's value of a proto3 message should be the default instead of null (0 for int, for example) if they are retrieved by the client -- this is very confusing (I know), it basically says by default (not using oneof keyword) there is no way to distinguish between absent field and fields populated with default value (like int64 field set to 0). Protobuf 3.15 somewhat bring this back, more details could be found [here](https://stackoverflow.com/questions/42622015/how-to-define-an-optional-field-in-protobuf-3)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "dbtsai (via GitHub)" <gi...@apache.org>.
dbtsai commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1499428796

   cc @dongjoon-hyun @huaxingao @viirya to take a look. Thanks for contribution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1532749008

   thanks @rangadi!
   
   cc @viirya or @gengliangwang , the tests have passed! do you mind taking a look and helping us merge this? thanks a bunch!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160291713


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   @viirya I don't think this is a bug, according to proto3 default value spec,
   
   > When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:
   > 
   > For strings, the default value is the empty string.
   > For bytes, the default value is empty bytes.
   > For bools, the default value is false.
   > For numeric types, the default value is zero.
   > For [enums](https://protobuf.dev/programming-guides/proto3/#enum), the default value is the first defined enum value, which must be 0.
   > For message fields, the field is not set. Its exact value is language-dependent. See the [generated code guide](https://protobuf.dev/reference/) for details.
   > 
   
   What that mean is the absent optional field's value should be the default (0 for int, for example) if they are retrieved by the client -- this is very confusing (I know), it basically says there is by default (not using oneof keyword) there is no way to distinguish between absent field and fields populated with default value (like int64 field set to 0). Protobuf 3.15 somewhat bring this back, more details could be found [here](https://stackoverflow.com/questions/42622015/how-to-define-an-optional-field-in-protobuf-3)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior in spark-protobuf]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 serialized messages, zero-value materialization is the default behavior since [field presence](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis) is not available
   - for proto2 serialized messages, check for field presence since that info is [available](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We believe that Spark should provide an option for developer who need the spec behavior because this is the default implementation for many other protobuf libraries (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- yes, there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181122536


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   Well, I am not proposing change of behavior, but explicitly adding a check for Messages (through it is already covered in 'hasPresense()' check. At no point we want to serialize default messages. 
   
   We should also explicitly mention in the documentation about messages. Most readers don't understands what 'has presence information'. They shouldn't have to read detailed protobuf spec to understand what this flag does. 
   
   How about this: add the message check in addition to !hasPresense() check? (It's no op). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181123353


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   In addition, note that this applies to Proto2 as well, not just Proto 3. This is correct. 
   
   If you need to go strictly with "spec" we would need to materialize message types to since protobufs values are never null. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take serialized protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().setName("").build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types (docs)[https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3] :
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1180841979


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,9 +289,34 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
-          record.getField(field)
-        } else null
+
+        // In case field presence is defined, we can use it to figure out whether
+        // a field is present, and output a value (or null) based on that.
+        // If a field has no presence info, we'll populate it if:
+        // - It's explicitly available in the *serialized* proto.
+        // - its a proto2 field with an explicit default value
+        // - `materializeZeroValues` has been set. In this case
+        //    getField will return the default value for the field's type.
+        // - It's a repeated field, which gets populated as []
+        // Please see: https://protobuf.dev/programming-guides/field_presence
+        val value = if (field.hasPresence) {

Review Comment:
   resolving since the code has changed a bit and we have a new thread!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1183133402


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to render fields with zero values when deserializing protobufs to a spark struct.
+  // When a field is empty in the serialized protobuf, this library will deserialize them as
+  // null by default. However, this flag can control whether to render the type-specific zero value.
+  // This operates similarly to `includingDefaultValues` in java's jsonformat, or `emitDefaults` in

Review Comment:
   done!



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to render fields with zero values when deserializing protobufs to a spark struct.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1533676324

   @HeartSaVioR could you merge this? You don't need to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124590


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > I.e. if we replace !field.hasPresence with field.getJavaType != MESSAGE, I think the tests pass.
   
   actually, the tests, as written, will not pass. specifically, if we change `!field.hasPresence` to `field.getJavaType != message` then proto3 optional fields fields that were never set on the initial proto will get deserialized as default value for their type. the test [here](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239) checks that.
   
   just to make sure, i did try this locally and verified the tests failed in the way described above.
   
   > We should also explicitly mention in the documentation about messages. Most readers don't understands what 'has presence information' means. They shouldn't have to read detailed protobuf spec to understand what this flag does.
   How about this: add the message check in addition to !hasPresense() check? (It's no op).
   
   ah, if documentation / clarity is the issue i'm more than happy to write up some more comments or clear up the documentation!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179803443


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   one example is this test: `test("test unset values - proto3") {` on line 1196 of `ProtobufFunctionsSuit.scala` (sorry, i'm not sure how to link to it)
   
   in that test, there are some proto3 optional fields, like `optional_int` which are not set to any value. In this case, even with materialize zero values set, we don't want to deserialize it as 0. This is because for proto3 optional fields we have field presence and there is no ambiguity between "unset" and "set to zero value", so it'd actually be _incorrect_ of us to try to set it to 0.
   
   basically the materialize zero values flag is only intended to choose the behavior in cases where "unset" and "set to zero value" are not distinguishable. the default behavior is to deserialize both to null. with the flag, it is to deserialize both as the zero value. 
   
   the latter behavior is what the proto3 docs actually talk about (and was the motivation for this PR, as we have other places in our codebase where default values get serialized like this https://protobuf.dev/programming-guides/proto3/#default)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181135882


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > lit(null).as("optional_int"), 
   
   This [line at 1229](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1229) ? 
   
   I am more confused about the feature now. I thought all scaler fields like 'int' will have a value with this option set. 
   
   What is the spec we are adhering to? What is the use case we are solving? May be we could update the flag documentation. 
   
   Is this essentially `materialize materialize protobuf V3 scalar fields that are not declared optional`? 
   
   We are giving too much weightage to `optional` keyword it looks like. Logically all fields are optional in Proto3 by default. 
   
   What happens to a `optional int` in V2 proto? Will it behave like proto3 `int` proto2 `optional int`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181141629


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,43 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // If the materializeZeroValues config is set, we will materialize zero values when the field

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181193391


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Remember you could not explain if Messages would be serialized or not in earlier comment? Why, because it was not easy why they would be or wound not be included. You tested it, but didn't have an explanation.
   
   If you check the thread where you originally asked:
   https://github.com/apache/spark/pull/40686#discussion_r1179208375
   
   I stated that message that weren't set in the original proto would remain null because message fields have presence information. I then linked a test asserting this exact behavior.
   
   I don't think theres any confusion here?
   
   > Message is different because it can contain other messages and easily blow up the struct. 
   
   Yeah, i agree that message is an important type. This is why having a test to explicitly assert the deserialization, and a comment explaining it will cements the behavior. But i'm also trying to put myself in the shoes of a future reader of this code. If we add additional clauses to the conditional that are redundant, we are likely to confuse future readers who will inevitably ask "why is this code checking specifically for message types here", for which there isn't a fundamental answer, just that we had a desire in this PR to make sure message types were handled. But the tests / comments already should do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181137367


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,43 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // If the materializeZeroValues config is set, we will materialize zero values when the field

Review Comment:
   Ideally we shouldn't need very long comment. Brief line and pointing to documentation of the flag should be enough. with that, we could move the check back into `getFieldValue()` method. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,43 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // If the materializeZeroValues config is set, we will materialize zero values when the field

Review Comment:
   Ideally we shouldn't need very long comment here again. Brief line and pointing to documentation of the flag should be enough. with that, we could move the check back into `getFieldValue()` method. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136862


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, this library deserializes both cases as null. However, this flag can
+  // choose to explicitly deserialize them as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;

Review Comment:
   Update this to add two more fields:
    ```
       optional string optional_middle_name = 3
       optional int optional_weight = 3;
   ```
   
   And update the setter to `Person(name="", weight=0)` and list the resulting value with and without the flag. I think that covers all the cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136226


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Is this essentially materialize Protobuf V3 scalar fields that are not declared optional?
   
   
   If this is the case, lets update the documentation clearly say this. Better not to say anything about 'presence information`. We shouldn't not require users to hunting down Protobuf documentation to understand it.
   
   Plus, lets just add `field.getJavaType != message` in addition to `!hasPresense()` (though it is strictly not required. Better for readability. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181248194


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   > unset field set as 0/default value in this struct
   
   That is true for fields declared optional too, but we are setting those to null here. This is the confusion I have. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181122600


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   I added a general motivation in [this reply](https://github.com/apache/spark/pull/40686/files#r1181120470). By looking at the discussion, I feel there are quite some confusion here, primary on what "we" think vs. proto3 spec says the deserialzation behavior should be, and what the default behavior would look like. Shall we schedule a quick call to clear them out? We can invite everyone in the discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181711867


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   > Other official libraries that take protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.
   
   
   Lets add this to the documentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181159895


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library
+  // deserializes this situation as `null`, but with this flag will deserialize them
+  // as the type-specific default value.
+  //
+  // Note that this won't affect fields with the optional keyword in proto3, or
+  // any fields in proto2, as they don't have the ambiguity described above because
+  // they have presence information.
+  //
+  // As an example:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  //   optional string middle_name = 3;
+  //   optional int64 salary = 4;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(age=20, middle_name="smith")

Review Comment:
   good callout, will change!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179432031


##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {
+  int32 bar = 1;
+  optional int32 baz = 2;

Review Comment:
   Yeah, we could keep this. But likely we don't need to call `hasPresense()`. It is likely implicit in `hasField()` (I have another comment about this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179804071


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, we will deserialize both cases as null. However, this flag can
+  // choose to explicitly deserialize as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(name="", age=0)`

Review Comment:
   It would be `{"name": null, "age": null}` by default, and `{"name": "", "age": 0}` with the materialize zero value flag set. The tests from line 1123 - 1254 i think demonstrate this behavior



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1180870422


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   Messages are not getting serialized because of this check (those fields have `hasPresence()`). It would be better if we explicitly exclude them there. 
   
   This check in fact does not matter for scalar fields since `hasField()` returns true any way. So `hasPresense()` check has no effect. I.e if you remove this check only thing that will change in your tests should be with message fields. 
   
   Overall:
     - Exclude message types from serializing explicitly (no matter what our presense policy)
     - Give a rational for using presence information. I.e. why should `optional string a` and `string a` behave differently with this flag. 
         - In fact this check does not even matter for scala types due to `hadField()` earlier. So there is no need to have it. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1180870422


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   Messages are not getting serialized because of this check (those fields have `hasPresence()`). It would be better if we explicitly exclude them there. 
   
   This check in fact does not matter for scalar fields since `hasField()` returns true any way. So `hasPresense()` check has no effect. I.e if you remove this check only thing that will change in your tests should be with message fields. 
   
   Overall:
     - Exclude message types from serializing explicitly (no matter what our presense policy)
     - Give a rational for using presence information. I.e. why should `optional string a` and `string a` behave differently with this flag. 
         - In fact this check does not even matter exception for Message types. So there is no need to have it. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160186179


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1175212228


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   hey @viirya just following up if we could get another review on this? thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179208347


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;

Review Comment:
   Use words like `name`. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,9 +289,34 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
-          record.getField(field)
-        } else null
+
+        // In case field presence is defined, we can use it to figure out whether
+        // a field is present, and output a value (or null) based on that.
+        // If a field has no presence info, we'll populate it if:
+        // - It's explicitly available in the *serialized* proto.
+        // - its a proto2 field with an explicit default value
+        // - `materializeZeroValues` has been set. In this case
+        //    getField will return the default value for the field's type.
+        // - It's a repeated field, which gets populated as []
+        // Please see: https://protobuf.dev/programming-guides/field_presence
+        val value = if (field.hasPresence) {

Review Comment:
   What is an example that is covered by this check, but not covered by existing check?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -36,6 +36,7 @@ private[sql] class ProtobufOptions(
     this(CaseInsensitiveMap(parameters), conf)
   }
 
+

Review Comment:
   Remove.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -53,6 +57,15 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
         fn(s"$javaClassNamePrefix$messageName", None)
       }
   }
+  private def checkWithProto3FileAndClassName(messageName: String)(

Review Comment:
   Pretty much all the protos used are proto3 messages. I don't think this is required. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Example(s="", i=0)`
+  //
+  // The result after calling from_protobuf without this flag set would be:
+  // `{"s": null, "i": null}`
+  //
+  // To explicitly materialize that default zero value, as readers in some other languages
+  // will do, this flag can be set to get a deserialized result like:
+  // `{"s": "", "i": 0}`
+  val materializeZeroValues: Boolean =
+    parameters.getOrElse(ProtobufOptions.materializeZeroValues, false.toString).toBoolean
 }
 
 private[sql] object ProtobufOptions {
+  val materializeZeroValues = "materializeZeroValues"

Review Comment:
   We can remove this and use actual string. Also use '.' separated names, similar to `recursive.fields.max.depth`. How about: "materialize.default.values" 
   



##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {

Review Comment:
   Use better, more descriptive names. These are too generic. Move this into existing proto file. It is is just more maintenance to have lot more files. 
   
   Include other protobuf message in this. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   What about a protobuf message? I think this PR will materialize those too. Is there a test case that has messages?  E.g.
   
   ```
       message Example {
           string name = 1;
           Person friend = 2;
      }
   ```
   
   



##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {
+  int32 bar = 1;
+  optional int32 baz = 2;

Review Comment:
   Does `optional` make a difference? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179241756


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Example(s="", i=0)`
+  //
+  // The result after calling from_protobuf without this flag set would be:
+  // `{"s": null, "i": null}`
+  //
+  // To explicitly materialize that default zero value, as readers in some other languages
+  // will do, this flag can be set to get a deserialized result like:
+  // `{"s": "", "i": 0}`
+  val materializeZeroValues: Boolean =
+    parameters.getOrElse(ProtobufOptions.materializeZeroValues, false.toString).toBoolean
 }
 
 private[sql] object ProtobufOptions {
+  val materializeZeroValues = "materializeZeroValues"

Review Comment:
   > We can remove this and use actual string.
   
   sorry, could you explain more what you mean by this?
   
   >  Also use '.' separated names, similar to recursive.fields.max.depth. How about: "materialize.default.values"z
   
   sure! i'll follow the convention in the file 👍
   
   i'm realizing the name is a bit odd, as the behavior is that we materialize zero values _for fields without presence information_. so maybe the right to call it would be `materialize.presenceless.default.values`? though thats kind of a mouthful.. would love if you have any thoughts on a good name (i'll keep thinking while im updating the PR as well)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1526548109

   FYI: I sent another improvement for Protobuf here: https://github.com/apache/spark/pull/40983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181128845


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   okay! i've updated this if statement. I factored out the check for "should we materialize" into a function and added a good comment there! PTAL 🙏 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181159725


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   You are right this feature does not solve the ambiguity, but rather give developer a chance to bring the behavior align with the proto3 spec (unset field set as 0/default value in this struct), as I mentioned (before)[https://github.com/apache/spark/pull/40686#discussion_r1181120470]. 
   
   As long as the message is not created using proto2 syntax or with optional key word supported since protobuf 3.15, there is no way the ambiguity could be solved. 
   
   Maybe we can be explicit about that in the comment, I will work with @justaparth on that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179805430


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   ah, i figured out how to link changes: 
   
   i think this behavior is shown in this test: [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1123-R1159)
   
   If you look at this test, you'll see that if we [don't set a value](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1126-R1137) for the `message` field. And the deserialized [result](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1148) is null for that field, [regardless](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1168) of the flag.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179805430


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   ah, i figured out how to link changes: 
   
   i think this behavior is shown in this test: [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1123-R1159)
   
   If you look at this test, you'll see that we [don't set a value](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1126-R1137) for the `message` field. And the deserialized [result](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1148) is null for that field, [regardless](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1168) of the flag.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181138667


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > Is this essentially materialize Protobuf V3 scalar fields that are not declared optional?
   
   yes, this is the basic effect of this PR. this follows the proto3 spec https://protobuf.dev/programming-guides/proto3/#default. Maybe i could have made this more clear earlier
   
   from the spec:
   ```
   When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field.
   ```
   
   It's more precise to say that in cases where "field was never set" and "field was explicitly set to the zero value" are not distinguishable, (i.e. _field presence is not available_), the library deserializes both cases as  `null` by default, or to the type-specific zero value with this flag. This phrasing correctly covers every type in proto2 and proto3, and is consistent with the spec about field presence.
   
   However, I understand your reservations that field presence is not the most accessible concept, and i'll update this documentation to say what you said above and then link to field presence documentation for people who want to read further.
   
   > What happens to a optional int in V2 proto? Will it behave like proto3 int proto2 optional int?
   
   no proto2 fields get default values in this implementation because theres no ambiguity.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > If this is the case, lets update the documentation clearly say this. Better not to say anything about 'presence information`. We shouldn't not require users to go hunting down Protobuf documentation to understand it.
   
   yeah, ill update it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181248194


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   > unset field set as 0/default value in this struct
   
   That is true for fields optional too, but we are setting those to null here. This is the confusion I have. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1534070113

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1532446807

   @justaparth thanks for the PR. Please ping @gengliangwang to merge this once the tests pass (or @viirya if he can merge).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124838


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   @rangadi with respect to message serialization, what do you think about this [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239)? in it, there is a message field that is never set originally and in deserializing it with/without the flag it is null.
   
   > Exclude message types from serializing explicitly (no matter what our presense policy)
   Give a rational for using presence information. I.e. why should optional string a and string a behave differently with this flag.
   In fact this check does not even matter for scala types due to hadField() earlier. So there is no need to have it.
   
   Sorry just to clarify, are you asking for a rationale here in the comments, or would you like me to include it in the documentation so that its a bit more clear?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on pull request #40686: [SPARK-43051][CONNECT] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on PR #40686:
URL: https://github.com/apache/spark/pull/40686#issuecomment-1531258227

   okay @rangadi @pang-wu (and cc @viirya)  i think i've addressed all comments; i've referenced the external libraries and also removed the mentions of "ambiguity" in the documentation. i've also renamed the option `emit.default.values` which i think is clearer than `materialize.default.values`. would y'all be able to take another look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior in spark-protobuf]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 serialized messages, zero-value materialization is the default behavior [link](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis)
   - for proto2 serailazed messages, check for field presence since that info is available [link](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160316173


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   > Hmm, is it a possible bug? If a field is explicitly set to zero value, the serialized proto message won't contain it, so when deserializing the message we will get null instead of zero?
   >  a field set with zero value will not be present in serialized message, so when deserializing it, it will be given a null instead of zero value originally?
   
   @viirya i'm not sure; in some sense it could be considered a bug and in another sense it could be considered a convention. I think the behavior of filling in defaults makes sense because of the proto3 spec Pang linked above.
   
   But because there is no way to distinguish between "field is set to zero value" and "field is not present" in proto3 serialized messages, deserialization libraries much choose what to do, either
   
   a) if a field is not present in serialized value, deserialize as null [the default behavior]
   - in case the field was not set originally, this is "correct"
   - in case the field was set to the zero value this is "incorrect"
   
   b) if a field is not present, deserialize to its zero value [what i tried to add in this PR]
   - in case the field was not set originally, this is "incorrect"
   - in case the field was set to the zero value, this is "correct"
   
   So either way the library chooses, one of the cases is kind of strange (i.e. you set a value and it doesn't come out, or you didn't set a value and yet a default comes out). Given this, it does feel like it's not clear which is the right "default". It also depends on whether the original message is proto2 or proto3, i think. Maybe what we would actually want is:
   
   - for proto3 serialized messages, zero-value materialization is the default behavior [link](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis)
   - for proto2 serailazed messages, check for field presence since that info is available [link](https://protobuf.dev/programming-guides/field_presence/#presence-in-proto2-apis)
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051] Add option to materialize zero values when deserializing protobufs

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1160318761


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
+
+        // If `materializeZeroValues` is true, the written field will contain the
+        // default zero value for its type (e.g. 0 for int, "" for string, etc:
+        // https://protobuf.dev/programming-guides/proto3/#default).
+        // Otherwise, the field will be null unless the serialized proto explicitly
+        // contains the field, or it has an explicit default (
+        // proto2 only https://protobuf.dev/programming-guides/proto2/#optional)
+        // Note that in proto3, the serialized proto will not contain the field
+        // if the value was explicitly set to its zero value. See:
+        // https://protobuf.dev/programming-guides/field_presence/#presence-in-proto3-apis

Review Comment:
   One more think to add is @justaparth, with this PR in, proto2 and proto3.15 message with optional key word will still work as expected? So nothing will be break.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181155546


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   This feature does not solve the ambiguity, right? I still don't know what problem this is solving. What is an example usecase?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,36 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - we should materialize the zero value
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them.
+    if (
+      field.isRepeated
+        || record.hasField(field) // checks for the field explicitly being set
+        || field.hasDefaultValue
+        || shouldGetZeroValue(record, field)) {
+      record.getField(field)
+    } else {
+      null
+    }
+  }
+
+  // See the docs for materialize.default.values in [[ProtobufOptions]].
+  // Decides whether we should get the zero values. We do this when
+  // the field is not present in the proto, and theres ambiguity about
+  // whether it was never set or if it was set to its zero value (i.e.
+  // the field lacks field presence information). This is basically
+  // proto3 scalar values only.
+  private def shouldGetZeroValue(record: DynamicMessage, field: FieldDescriptor): Boolean = {
+    !record.hasField(field) && !field.hasPresence && this.materializeZeroValues

Review Comment:
   lets move this to `getFieldValue()`.  We can see `!record.hasField(field)` is not needed. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library
+  // deserializes this situation as `null`, but with this flag will deserialize them
+  // as the type-specific default value.
+  //
+  // Note that this won't affect fields with the optional keyword in proto3, or
+  // any fields in proto2, as they don't have the ambiguity described above because
+  // they have presence information.
+  //
+  // As an example:
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;
+  //   optional string middle_name = 3;
+  //   optional int64 salary = 4;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Person(age=20, middle_name="smith")

Review Comment:
   You meant 'age=0'? 
   Use default value for 'middle_name' as well. That is the only case we are talking about.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181248488


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   I suggest removing ambiguity argument since we are not solving that. All we are doing is returning default value some types of fields. I still don't know what problem this is solving, but lets simplify the description. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   I suggest removing ambiguity argument since we are not solving that. All we are doing is returning default value for some types of fields. I still don't know what problem this is solving, but lets simplify the description. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181248829


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   It would be useful if you could describe the problem you are solving. Originally I thought you didn't want to see nulls for fields like 'int'. But that is not what we are doing here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take serialized protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3:
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same broadly supported functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "justaparth (via GitHub)" <gi...@apache.org>.
justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181124838


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.

Review Comment:
   @rangadi with respect to message serialization, what do you think about this [test](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1196-R1239)? in it, there is a message field that is never set originally and in deserializing it with/without the flag it is null. is that expected / unexpected to you? 
   
   > Exclude message types from serializing explicitly (no matter what our presense policy)
   Give a rational for using presence information. I.e. why should optional string a and string a behave differently with this flag.
   
   
   Sorry just to clarify, are you asking for a rationale here in the comments, or would you like me to include it in the documentation so that its a bit more clear?
   
   > In fact this check does not even matter for scala types due to hadField() earlier. So there is no need to have it.
   
   Ah check the other thread! i commented there as to how this breaks if we don't use hasPresence



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181135882


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   > lit(null).as("optional_int"), 
   
   This [line at 1229](https://github.com/apache/spark/pull/40686/files#diff-f5dc81b1d20dde97f9d1dd26e993f811fa096bf68936ab5e449b9937314585d8R1229) ? 
   
   I am more confused about the feature now. I thought all scaler fields like 'int' will have a value with this option set. 
   
   What is the spec we are adhering to? What is the use case we are solving? May be we could update the flag documentation. 
   
   Is this essentially `materialize Protobuf V3 scalar fields that are not declared optional`? 
   
   We are giving too much weightage to `optional` keyword it looks like. Logically all fields are optional in Proto3 by default. 
   
   What happens to a `optional int` in V2 proto? Will it behave like proto3 `int` proto2 `optional int`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181136862


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,42 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // For fields without presence information, there is ambiguity in serialized protos
+  // as to whether the field was never written or was written with its zero value.
+  // This is because such fields are not serialized if they contain their zero value.
+  // This includes most fields in proto3.
+  // Ref: https://protobuf.dev/programming-guides/field_presence
+  // https://protobuf.dev/programming-guides/field_presence/
+  //  #presence-in-tag-value-stream-wire-format-serialization
+  //
+  // By default, this library deserializes both cases as null. However, this flag can
+  // choose to explicitly deserialize them as the zero value for the type, as
+  // libraries in some other will languages do.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Person {
+  //   string name = 1;
+  //   int64 age = 2;

Review Comment:
   Update this to add two more fields:
    ```
       optional string middle_name = 3
       optional int id = 4;
   ```
   
   And update the setter to `Person(name="", id=0)` and list the resulting value with and without the flag. I think that covers all the cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181159725


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   You are right this feature does not solve the ambiguity, but rather give developer a chance to bring the behavior align with the proto3 spec (unset field set as 0/default value in this struct), as I mentioned [before](https://github.com/apache/spark/pull/40686#discussion_r1181120470). 
   
   As long as the message is not created using proto2 syntax or with optional key word supported since protobuf 3.15, there is no way the ambiguity could be solved. 
   
   Maybe we can be explicit about that in the comment, I will work with @justaparth on that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179415611


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   I don't think so. Try it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We believe that Spark should provide an option for developer who need the spec behavior because this is the default behavior for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- I believe there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We think that Spark should provide an option for developer who need the spec behavior because this is the default behavior for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- I believe there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pang-wu commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information

Posted by "pang-wu (via GitHub)" <gi...@apache.org>.
pang-wu commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181120470


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -299,6 +298,29 @@ private[sql] class ProtobufDeserializer(
     }
   }
 
+  private def getFieldValue(record: DynamicMessage, field: FieldDescriptor): AnyRef = {
+    // We return a value if one of:
+    // - the field is repeated
+    // - the field is explicitly present in the serialized proto
+    // - the field is proto2 with a default
+    // - field presence is not available and materializeZeroValues is set
+    //
+    // Repeated fields have to be treated separately as they cannot have `hasField`
+    // called on them. And we only materialize zero values for fields without presence
+    // information because that flag controls the behavior for those fields in the ambiguous
+    // case of "unset" or "set to zero value". See the docs in [[ProtobufOptions]] for more
+    // details.
+    if (
+      field.isRepeated
+        || record.hasField(field)
+        || field.hasDefaultValue
+        || (!field.hasPresence && this.materializeZeroValues)) {

Review Comment:
   @rangadi I don't think this is the right approach. 
   
   The goal of this PR is to give an option for developer who want Spark's protobuf to struct deserialization behavior comply to proto3 specs. As @justaparth mentioned earlier in [a discussion](https://github.com/apache/spark/pull/40686#discussion_r1160316173) with @viirya, the current behavior is something "make sense" (where unset/non-presense field set to null) but does not comply to proto3's spec (unset/non-presense field should be set to default value).
   
   We think that Spark should provide an option for developer who need this behavior because this is the default behavior for many other protobuf library (especially those implemented in Go) so when user use Spark in conjunction with those libraries in different stages of their infra the behavior is consistent -- I believe there are ways to make the test pass, but that defeat the purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #40686: [SPARK-43051][CONNECT] Add option to emit default values
URL: https://github.com/apache/spark/pull/40686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org