You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/04/27 14:03:15 UTC

[GitHub] [spark] rangadi commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values for fields without presence information

rangadi commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1179208347


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;

Review Comment:
   Use words like `name`. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala:
##########
@@ -288,9 +289,34 @@ private[sql] class ProtobufDeserializer(
       var skipRow = false
       while (i < validFieldIndexes.length && !skipRow) {
         val field = validFieldIndexes(i)
-        val value = if (field.isRepeated || field.hasDefaultValue || record.hasField(field)) {
-          record.getField(field)
-        } else null
+
+        // In case field presence is defined, we can use it to figure out whether
+        // a field is present, and output a value (or null) based on that.
+        // If a field has no presence info, we'll populate it if:
+        // - It's explicitly available in the *serialized* proto.
+        // - its a proto2 field with an explicit default value
+        // - `materializeZeroValues` has been set. In this case
+        //    getField will return the default value for the field's type.
+        // - It's a repeated field, which gets populated as []
+        // Please see: https://protobuf.dev/programming-guides/field_presence
+        val value = if (field.hasPresence) {

Review Comment:
   What is an example that is covered by this check, but not covered by existing check?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -36,6 +36,7 @@ private[sql] class ProtobufOptions(
     this(CaseInsensitiveMap(parameters), conf)
   }
 
+

Review Comment:
   Remove.



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -53,6 +57,15 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
         fn(s"$javaClassNamePrefix$messageName", None)
       }
   }
+  private def checkWithProto3FileAndClassName(messageName: String)(

Review Comment:
   Pretty much all the protos used are proto3 messages. I don't think this is required. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;
+  // }
+  // ```
+  //
+  // And we have the serialized representation of the following proto:
+  // `Example(s="", i=0)`
+  //
+  // The result after calling from_protobuf without this flag set would be:
+  // `{"s": null, "i": null}`
+  //
+  // To explicitly materialize that default zero value, as readers in some other languages
+  // will do, this flag can be set to get a deserialized result like:
+  // `{"s": "", "i": 0}`
+  val materializeZeroValues: Boolean =
+    parameters.getOrElse(ProtobufOptions.materializeZeroValues, false.toString).toBoolean
 }
 
 private[sql] object ProtobufOptions {
+  val materializeZeroValues = "materializeZeroValues"

Review Comment:
   We can remove this and use actual string. Also use '.' separated names, similar to `recursive.fields.max.depth`. How about: "materialize.default.values" 
   



##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {

Review Comment:
   Use better, more descriptive names. These are too generic. Move this into existing proto file. It is is just more maintenance to have lot more files. 
   
   Include other protobuf message in this. 



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,9 +47,35 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether or not to explicitly materialize the zero values for fields
+  // without field presence information https://protobuf.dev/programming-guides/field_presence/.
+  // This includes most fields in proto3.
+  //
+  // For example, if we have a proto like
+  // ```
+  // syntax = "proto3";
+  // message Example {
+  //   string s = 1;
+  //   int64 i = 2;

Review Comment:
   What about a protobuf message? I think this PR will materialize those too. Is there a test case that has messages?  E.g.
   
   ```
       message Example {
           string name = 1;
           Person friend = 2;
      }
   ```
   
   



##########
connector/protobuf/src/test/resources/protobuf/proto3_optional.proto:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+// To compile the descriptor:
+// cd connector/protobuf/src/test/resources/protobuf
+// protoc --include_imports --descriptor_set_out=proto3_optional.desc proto3_optional.proto
+
+syntax = "proto3";
+
+package org.apache.spark.sql.protobuf.protos;
+
+option java_outer_classname = "Proto3Optional";
+
+message TestMessage {
+  int32 bar = 1;
+  optional int32 baz = 2;

Review Comment:
   Does `optional` make a difference? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org