You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikita Awasthi (Jira)" <ji...@apache.org> on 2023/04/27 11:55:00 UTC

[jira] [Commented] (SPARK-43051) Allow materializing zero values when deserializing protobuf messages

    [ https://issues.apache.org/jira/browse/SPARK-43051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717147#comment-17717147 ] 

Nikita Awasthi commented on SPARK-43051:
----------------------------------------

User 'justaparth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40686

> Allow materializing zero values when deserializing protobuf messages
> --------------------------------------------------------------------
>
>                 Key: SPARK-43051
>                 URL: https://issues.apache.org/jira/browse/SPARK-43051
>             Project: Spark
>          Issue Type: Improvement
>          Components: Protobuf
>    Affects Versions: 3.4.0
>            Reporter: Parth Upadhyay
>            Priority: Major
>
> Currently, when deserializing protobufs using {{{}from_protobuf{}}}, fields that are not explicitly present in the serialized message are deserialized as {{null}} in the resulting struct. (In proto3, this also includes fields that have been explicitly set to their zero value, as it is not distinguishable in the serialized format. [https://protobuf.dev/programming-guides/field_presence/])
> For example, given a message format like
> {code:java}
> syntax = "proto3";
> message SearchRequest {
>   string query = 1;
>   int32 page_number = 2;
>   int32 result_per_page = 3;
> }
> {code}
> and an example message like
> {code:python}
> SearchRequest(query = "", page_number = 10)
> {code}
> the result from calling {{from_protobuf}} on the serialized form of the above message would be
> {code:json}
> {"query": null, "page_number": 10, "result_per_page": null}
> {code}
> In proto3, all fields are considered optional and have default values ([https://protobuf.dev/programming-guides/proto3/#default]), and reader clients in some languages (e.g. go, scala) will fill in that default value when reading the protobuf. It could be useful to make this configurable so that zero values can optionally be materialized if desired.
> Concretely, in the example above, we might want to deserialize it instead as
> {code:json}
> {"query": "", "page_number": 10, "result_per_page": 0}
> {code}
> In this ticket I propose implementing a way to get the above functionality. In the linked PR, i've done it by adding an option, {{materializeZeroValues}} that can be passed to the options map in the {{from_protobuf}} function to enable this behavior. However i'd love any feedback on if i've understood the problem correctly and if the implementation makes sense.
>  
> PR: https://github.com/apache/spark/pull/40686



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org