You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/14 22:19:28 UTC

[GitHub] [hudi] the-other-tim-brown opened a new pull request, #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

the-other-tim-brown opened a new pull request, #6111:
URL: https://github.com/apache/hudi/pull/6111

   ## What is the purpose of the pull request
   
   RFC for adding support for Protobuf in the DeltaStreamer along with a Protobuf+kafka source for the delta streamer
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955032251


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   It will be helpful if we can support cases where the schema is present in Schema registry, this way users don't have to provide the class name every time. Similar to what we do today for Avro source. 
   
   Or should this be taken into the next cut ? wdyt @codope @vinothchandar ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r951731810


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter
+
+A class will be created that can take in a Protobuf Message and convert it to an Avro GenericRecord. This will be used inside the SourceFormatAdapter to properly convert to an avro RDD. To convert to `Dataset<Row>` we will first convert to Avro and then to Row. This change will be adding a new `Source.SourceType` as well so other sources in the future can implement this source type, for example Protobuf messages on PubSub.

Review Comment:
   I mean that I will write a class that does this. Updated the wording to hopefully be more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955021852


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.

Review Comment:
   Have we also taken into consideration nested Message types here ? how would that map here ? 
   Also repeated fields like repeated Message type, ENUMS etc. 
   
   May be we call that out explicitly as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r956755229


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Tickets for follow up items:
   https://issues.apache.org/jira/browse/HUDI-4727
   https://issues.apache.org/jira/browse/HUDI-4732



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r951401598


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.

Review Comment:
   Let's elaborate more on certain schema aspects. You have already covered nullable fields. Other things to consider:
   1. Unsigned types: Avro doesn't support unsigned types. So, probably Long type.
   2. Schema evolution: Both handle schema evolution differently. I think adding and removing fields should be ok as long as we consider default value in avro while converting.



##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter
+
+A class will be created that can take in a Protobuf Message and convert it to an Avro GenericRecord. This will be used inside the SourceFormatAdapter to properly convert to an avro RDD. To convert to `Dataset<Row>` we will first convert to Avro and then to Row. This change will be adding a new `Source.SourceType` as well so other sources in the future can implement this source type, for example Protobuf messages on PubSub.

Review Comment:
   > A class will be created
   
   What do you mean by this? When will this class be created? Do users need to implement an interface provided by us?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r951731388


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.

Review Comment:
   Added more details on both points to the RFC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955348630


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.

Review Comment:
   I'll add a mapping of proto to avro types here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955436628


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+#### Handling of Unsigned Integers and Longs
+Protobuf provides support for unsigned integers and longs while Avro does not. The schema provider will convert unsigned integers and longs to Avro long type in the schema definition.
+
+#### Schema Evolution
+**Adding a Field:**
+Protobuf has a default value for all fields and the translation from proto to avro schema will carry over this default value so there are no errors when adding a new field to the proto definition.
+**Removing a Field:**
+If a user removes a field in the Protobuf schema, the schema provider will not be able to add this field to the avro schema it generates. To avoid issues when writing data, users must use `hoodie.datasource.write.reconcile.schema=true` to properly reconcile the schemas if a field is removed from the proto definition. Users can avoid this situation by using `deprecated` field option in proto instead of removing the field from the schema.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter

Review Comment:
   I'll add this to the RFC but note that it likely won't be done in the first cut.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r922332178


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   My understanding of the MultiDeltaStreamer is that the tables are configured individually. I would expect that means you can configure a schema provider and in our case, proto class, for each of the tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955617721


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Yes, let's take it in next cut



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955348242


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+#### Handling of Unsigned Integers and Longs
+Protobuf provides support for unsigned integers and longs while Avro does not. The schema provider will convert unsigned integers and longs to Avro long type in the schema definition.
+
+#### Schema Evolution
+**Adding a Field:**
+Protobuf has a default value for all fields and the translation from proto to avro schema will carry over this default value so there are no errors when adding a new field to the proto definition.
+**Removing a Field:**
+If a user removes a field in the Protobuf schema, the schema provider will not be able to add this field to the avro schema it generates. To avoid issues when writing data, users must use `hoodie.datasource.write.reconcile.schema=true` to properly reconcile the schemas if a field is removed from the proto definition. Users can avoid this situation by using `deprecated` field option in proto instead of removing the field from the schema.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter

Review Comment:
   This is done to reduce the scope of the initial changes. The converter utils for generating a schema and converting into an Avro with that schema got a bit large so I was trying to reuse the avro to row logic for now. I can follow up with a direct converter. I think going from proto to avro though will have similar perf to the json to avro transformation that exists today. Proto to avro to row is definitely something to improve on though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r921799948


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   How would this work when we want to read from Multiple Kafka topics ? for example use MultiDeltaStreamer. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r956710654


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Let's track it in the epic. Can you please add all tasks for this RFC in the epic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955436277


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Do you have experience using that confluent value deserializer? We can add that in as an option but I don't have experience with it so may need your help.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on PR #6111:
URL: https://github.com/apache/hudi/pull/6111#issuecomment-1185199244

   cc @veenaypatil to review as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955025011


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.
+
+#### Handling of Unsigned Integers and Longs
+Protobuf provides support for unsigned integers and longs while Avro does not. The schema provider will convert unsigned integers and longs to Avro long type in the schema definition.
+
+#### Schema Evolution
+**Adding a Field:**
+Protobuf has a default value for all fields and the translation from proto to avro schema will carry over this default value so there are no errors when adding a new field to the proto definition.
+**Removing a Field:**
+If a user removes a field in the Protobuf schema, the schema provider will not be able to add this field to the avro schema it generates. To avoid issues when writing data, users must use `hoodie.datasource.write.reconcile.schema=true` to properly reconcile the schemas if a field is removed from the proto definition. Users can avoid this situation by using `deprecated` field option in proto instead of removing the field from the schema.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter

Review Comment:
   Is this done to use existing code base itself ? 
   Do you forsee any performance issue as this will be done for every message ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r956710494


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,76 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-56: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system.

Review Comment:
   @the-other-tim-brown Let's track it in the epic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r956710654


##########
rfc/rfc-57/rfc-57.md:
##########
@@ -0,0 +1,85 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   @the-other-tim-brown Let's track it in the epic. Can you please add all tasks for this RFC in the epic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope merged pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

Posted by GitBox <gi...@apache.org>.
codope merged PR #6111:
URL: https://github.com/apache/hudi/pull/6111


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org