You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "justaparth (via GitHub)" <gi...@apache.org> on 2023/05/01 09:45:52 UTC

[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][CONNECTOR] Add option to materialize zero values for fields without presence information (proto3 scalars)

justaparth commented on code in PR #40686:
URL: https://github.com/apache/spark/pull/40686#discussion_r1181488452


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufOptions.scala:
##########
@@ -46,6 +46,41 @@ private[sql] class ProtobufOptions(
   // record has more depth than the allowed value for recursive fields, it will be truncated
   // and corresponding fields are ignored (dropped).
   val recursiveFieldMaxDepth: Int = parameters.getOrElse("recursive.fields.max.depth", "-1").toInt
+
+  // Whether to deserialize empty proto3 scalar fields as their default values.
+  // When a proto3 scalar field is empty, there is ambiguity as to whether the field
+  // was never set or if the field was explicitly set to zero. By default, the library

Review Comment:
   Maybe it would be helpful if i frame the goal in a different way.
   
   **Other official libraries that take serialized protobufs and put them into a different format (e.g. Json) have the option to emit defaults. This PR is implementing that exact same functionality, with same semantics.**
   
   Consider the following proto
   
   ```
   syntax = "proto3";
   
   message Person {
     string name = 1;
     optional int32 age = 2;
   }
   ```
   
   ## java
   If you use protobuf's [jsonformat](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat) library to convert protos to json, they have a [configuration](https://protobuf.dev/reference/java/api-docs/com/google/protobuf/util/JsonFormat.Printer.html#includingDefaultValueFields--) called `includingDefaultValueFields()`. 
   
   code:
   ```
   Person p = Person.newBuilder().setName("").build();
   
   String basic = JsonFormat.printer().print(p);
   String withDefaults = JsonFormat.printer().includingDefaultValueFields().print(p);
   
   System.out.println("basic: " + basic);
   System.out.println("withDefaults: " + withDefaults);
   ```
   
   outputs
   
   ```
   basic: {
   }
   withDefaults: {
     "name": ""
   }
   ```
   
   Notice that even with `includingDefaultValueFields` set, it doesn't fill in a value for the unset optional field age. This makes a lot of sense; because optional has field presence, the library can know _for sure_ that it was never set so hallucinating a value doesn't really make sense.
   
   ## golang
   
   ### struct generation
   If you look at how go generates structs for protobuf types (docs)[https://protobuf.dev/reference/go/go-generated/#singular-scalar-proto3] :
   
   ```
   type Person struct {  
    <some extraneous info removed>
    Name string   
    Age  *int32 
   }
   ```
   
   It will generate an `string` for the `singular` case, and `*int32` for the `optional` case. This is because optional fields have field presence, so it is possible to know when they were never set.
   
   ### jsonification
   
   Exactly like in java, there is an `EmitDefaults` option for jsonpb: https://pkg.go.dev/github.com/golang/protobuf/jsonpb#Marshaler
   
   ```
   a := &Person{}  
   var basicMarshaler = &jsonpb.Marshaler{}  
   var defaultsMarshaler = &jsonpb.Marshaler{  
    EmitDefaults: true,  
   }  
     
   basic, _ := basicMarshaler.MarshalToString(a)  
   withDefaults, _ := defaultsMarshaler.MarshalToString(a)  
     
   fmt.Printf("basic: %s\n", basic)  
   fmt.Printf("withDefaults: %s\n", withDefaults)
   ```
   
   outputs
   
   ```
   basic: {}
   withDefaults: {"name":"","age":null}
   ```
   
   Notice the same behavior, age is not set to 0.
   
   ## summary
   So, i'm just trying to implement this same functionality here. Would it help if I update the PR to mention this more explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org