You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Adam Richardson <as...@stripe.com.INVALID> on 2023/06/27 17:23:52 UTC

Feature requests for Flink protobuf deserialization

Hi there,

My company is in the process of rebuilding some of our batch Spark-based
ETL pipelines in Flink. We use protobuf to define our schemas. One major
challenge is that Flink protobuf deserialization has some semantic
differences with the ScalaPB encoders we use in our Spark systems. This
poses a serious barrier for adoption as moving any given dataset from Spark
to Flink will potentially break all downstream consumers. I have a long
list of feature requests in this area:

   1. Support for mapping protobuf optional wrapper types (StringValue,
   IntValue, and friends) to nullable primitive types rather than RowTypes
   2. Support for mapping the protobuf Timestamp type to a real timestamp
   rather than RowType
   3. A way of defining custom mappings from specific proto types to custom
   Flink types (the previous two feature requests could be implemented on top
   of this lower-level feature)
   4. Support for nullability semantics for message types (in the status
   quo, an unset message is treated as equivalent to a message with default
   values for all fields, which is a confusing user experience)
   5. Support for nullability semantics for primitives types (in many of
   our services, the default value for a field of primitive type is treated as
   being equivalent to unset or null, so it would be good to offer this as a
   capability in the data warehouse)

Would Flink accept patches for any or all of these feature requests? We're
contemplating forking flink-protobuf internally, but it would be better if
we could just upstream the relevant changes. (To my mind, 1, 2, and 4 are
broadly applicable features that are definitely worthy of upstream support.
3 and 5 may be somewhat more specific to our use case.)

Thanks,
Adam Richardson

Re: Feature requests for Flink protobuf deserialization

Posted by Benchao Li <li...@apache.org>.
Thanks for starting the discussion,

1. I'm +1 for this.
2. We have already supported this in [1]
3. I'm not sure about this, could you give more examples except the cases
1&2?
4&5. I think we also have considered this with the option
'protobuf.read-default-values' [2], is this what you want?

[1] https://issues.apache.org/jira/browse/FLINK-30093
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/formats/protobuf/#protobuf-read-default-values



Adam Richardson <as...@stripe.com.invalid> 于2023年6月28日周三 10:16写道:

> Hi there,
>
> My company is in the process of rebuilding some of our batch Spark-based
> ETL pipelines in Flink. We use protobuf to define our schemas. One major
> challenge is that Flink protobuf deserialization has some semantic
> differences with the ScalaPB encoders we use in our Spark systems. This
> poses a serious barrier for adoption as moving any given dataset from Spark
> to Flink will potentially break all downstream consumers. I have a long
> list of feature requests in this area:
>
>    1. Support for mapping protobuf optional wrapper types (StringValue,
>    IntValue, and friends) to nullable primitive types rather than RowTypes
>    2. Support for mapping the protobuf Timestamp type to a real timestamp
>    rather than RowType
>    3. A way of defining custom mappings from specific proto types to custom
>    Flink types (the previous two feature requests could be implemented on
> top
>    of this lower-level feature)
>    4. Support for nullability semantics for message types (in the status
>    quo, an unset message is treated as equivalent to a message with default
>    values for all fields, which is a confusing user experience)
>    5. Support for nullability semantics for primitives types (in many of
>    our services, the default value for a field of primitive type is
> treated as
>    being equivalent to unset or null, so it would be good to offer this as
> a
>    capability in the data warehouse)
>
> Would Flink accept patches for any or all of these feature requests? We're
> contemplating forking flink-protobuf internally, but it would be better if
> we could just upstream the relevant changes. (To my mind, 1, 2, and 4 are
> broadly applicable features that are definitely worthy of upstream support.
> 3 and 5 may be somewhat more specific to our use case.)
>
> Thanks,
> Adam Richardson
>


-- 

Best,
Benchao Li