You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/06/09 18:45:00 UTC

[jira] [Commented] (PARQUET-1020) Add support for Dynamic Messages in parquet-protobuf

    [ https://issues.apache.org/jira/browse/PARQUET-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552379#comment-17552379 ] 

ASF GitHub Bot commented on PARQUET-1020:
-----------------------------------------

dossett commented on code in PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#discussion_r893846424


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -115,27 +120,32 @@ public void prepareForWrite(RecordConsumer recordConsumer) {
   public WriteContext init(Configuration configuration) {
 
     // if no protobuf descriptor was given in constructor, load descriptor from configuration (set with setProtobufClass)
-    if (protoMessage == null) {
-      Class<? extends Message> pbClass = configuration.getClass(PB_CLASS_WRITE, null, Message.class);
-      if (pbClass != null) {
-        protoMessage = pbClass;
-      } else {
-        String msg = "Protocol buffer class not specified.";
-        String hint = " Please use method ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
-        throw new BadConfigurationException(msg + hint);
+    if (descriptor == null) {
+      if (protoMessage == null) {
+        Class<? extends Message> pbClass = configuration.getClass(PB_CLASS_WRITE, null, Message.class);
+        if (pbClass != null) {
+          protoMessage = pbClass;
+        } else {
+          String msg = "Protocol buffer class or descriptor not specified.";
+          String hint = " Please use method ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
+          throw new BadConfigurationException(msg + hint);
+        }
       }
+      descriptor = Protobufs.getMessageDescriptor(protoMessage);
+    } else {
+      //Assume no specific Message extending class, so use DynamicMessage
+      protoMessage = DynamicMessage.class;

Review Comment:
   Should this just be left null?  Having it set implies it usable, but it's only used to determine the descriptor and get a protoschemaconverter and then only in cases when we don't have a descriptor.
   
   I just worry that setting it could lead to assumptions later on that it's as usable as if it were a generated message and not a dynamic message.



##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -115,27 +120,32 @@ public void prepareForWrite(RecordConsumer recordConsumer) {
   public WriteContext init(Configuration configuration) {
 
     // if no protobuf descriptor was given in constructor, load descriptor from configuration (set with setProtobufClass)
-    if (protoMessage == null) {
-      Class<? extends Message> pbClass = configuration.getClass(PB_CLASS_WRITE, null, Message.class);
-      if (pbClass != null) {
-        protoMessage = pbClass;
-      } else {
-        String msg = "Protocol buffer class not specified.";
-        String hint = " Please use method ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
-        throw new BadConfigurationException(msg + hint);
+    if (descriptor == null) {
+      if (protoMessage == null) {
+        Class<? extends Message> pbClass = configuration.getClass(PB_CLASS_WRITE, null, Message.class);
+        if (pbClass != null) {
+          protoMessage = pbClass;
+        } else {
+          String msg = "Protocol buffer class or descriptor not specified.";
+          String hint = " Please use method ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
+          throw new BadConfigurationException(msg + hint);
+        }
       }
+      descriptor = Protobufs.getMessageDescriptor(protoMessage);
+    } else {
+      //Assume no specific Message extending class, so use DynamicMessage
+      protoMessage = DynamicMessage.class;

Review Comment:
   Should this just be left null?  Having it set implies it usable, but it's only used to determine the descriptor and get a protoschemaconverter and then only in cases when we don't have a descriptor.
   
   I just worry that setting it could lead to assumptions later on that it's as usable as if it were a generated message and not a dynamic message.





> Add support for Dynamic Messages in parquet-protobuf
> ----------------------------------------------------
>
>                 Key: PARQUET-1020
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1020
>             Project: Parquet
>          Issue Type: New Feature
>            Reporter: Alex Buck
>            Assignee: Alex Buck
>            Priority: Major
>
> Hello. We would like to pass in a DynamicMessage rather than using the generated protobuf classes to allow us to make our job very generic. 
> I think this could be achieved by setting the descriptor upfront, similarly to how there is a ProtoParquetOutputFormat today.
> In ProtoWriteSupport in the init method it could then generate the parquet schema created by ProtoSchemaConverter using the passed in descriptor, rather than taking it from the generated proto class.
> Would there be interest in incorporating this change? If so does the approach above sound sensible? I am happy to do a pull request
> initial PR here: https://github.com/apache/parquet-mr/pull/414



--
This message was sent by Atlassian Jira
(v8.20.7#820007)