You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2022/10/31 15:48:42 UTC

[GitHub] [parquet-mr] jinyius opened a new pull request, #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

jinyius opened a new pull request, #995:
URL: https://github.com/apache/parquet-mr/pull/995

   - This is an alternative approach to supporting recursion to apache#445 and apache#988.
   - This approach could address the other recursion related issues (PARQUET-129, PARQUET-554).
   - TODO: ReadSupport
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-1711
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
     - `ProtoSchemaConverterTest#test*Recursion`
     - `ProtoWriteSupportTest#test*Recursion`
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1272447374

   Mostly looks reasonable, I'm not too familiar with parquet-mr @shangxinli can you recommend someone who might be able to give a better review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990727498


##########
parquet-protobuf/src/test/resources/Trees.proto:
##########
@@ -0,0 +1,37 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+
+syntax = "proto3";
+
+package Trees;
+
+import "google/protobuf/any.proto";
+
+option java_package = "org.apache.parquet.proto.test";
+
+message BinaryTree {
+    google.protobuf.Any value = 1;

Review Comment:
   it would be good to verify that something like:
   
   message WrappedTree {
      google.protobuf.Any non_recursive = 1;
      BinaryTree tree = 2;
   }
   
   Also gives expected results (non_recursive doesn't accidentally trigger any of the recursio logic). 
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990914303


##########
parquet-protobuf/src/test/resources/BinaryTree.par:
##########
@@ -0,0 +1,50 @@
+message Trees.BinaryTree {
+  optional group value = 1 {

Review Comment:
   this is parquet schema, not proto.  protos should/would have a .proto suffix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984237501


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -124,35 +164,61 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addR
         .named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder) {
-    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result =
-      builder
+  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result = builder
         .group(Type.Repetition.OPTIONAL).as(listType())
         .group(Type.Repetition.REPEATED)
         .group(Type.Repetition.OPTIONAL);
 
-    convertFields(result, descriptor.getMessageType().getFields());
+    convertFields(result, descriptor.getMessageType().getFields(), seen, depth);
 
     return result.named("element").named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    // Prevent recursion by terminating with optional proto bytes.
+    depth += 1;
+    String typeName = getInnerTypeName(descriptor);
+    LOG.trace("addMessageField: " + descriptor.getFullName() + " type: " + typeName + " depth: " + depth);
+    if (typeName != null) {
+      if (seen.get(typeName).size() > maxRecursion) {
+        return builder.primitive(BINARY, Type.Repetition.OPTIONAL).as((LogicalTypeAnnotation) null);
+      }
+    }
+
     if (descriptor.isMapField() && parquetSpecsCompliant) {
       // the old schema style did not include the MAP wrapper around map groups
-      return addMapField(descriptor, builder);
+      return addMapField(descriptor, builder, seen, depth);
     }
+
+    seen = ImmutableSetMultimap.<String, Integer>builder().putAll(seen).put(typeName, depth).build();

Review Comment:
   if this gets modified every time through this method, is immutability useful?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990726095


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
     }
   }
 
-  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {

Review Comment:
   right, I was thinking of encapsulating this logic into its own class, so they can be recorded and updated together, to  1.  Reduce additional parameters that have to be passed through.
   2.  Encapsulate the logic behind more mnemonic method names (e.g. AddRecursiveStep())



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1298742515

   yeah, i still don't see a button to merge.  it now shows everything approved, checks passed, and no conflicts.
   
   i think a committer needs to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984838003


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -124,35 +164,61 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addR
         .named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder) {
-    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result =
-      builder
+  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result = builder
         .group(Type.Repetition.OPTIONAL).as(listType())
         .group(Type.Repetition.REPEATED)
         .group(Type.Repetition.OPTIONAL);
 
-    convertFields(result, descriptor.getMessageType().getFields());
+    convertFields(result, descriptor.getMessageType().getFields(), seen, depth);
 
     return result.named("element").named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    // Prevent recursion by terminating with optional proto bytes.
+    depth += 1;
+    String typeName = getInnerTypeName(descriptor);
+    LOG.trace("addMessageField: " + descriptor.getFullName() + " type: " + typeName + " depth: " + depth);
+    if (typeName != null) {
+      if (seen.get(typeName).size() > maxRecursion) {
+        return builder.primitive(BINARY, Type.Repetition.OPTIONAL).as((LogicalTypeAnnotation) null);
+      }
+    }
+
     if (descriptor.isMapField() && parquetSpecsCompliant) {
       // the old schema style did not include the MAP wrapper around map groups
-      return addMapField(descriptor, builder);
+      return addMapField(descriptor, builder, seen, depth);
     }
+
+    seen = ImmutableSetMultimap.<String, Integer>builder().putAll(seen).put(typeName, depth).build();

Review Comment:
   it's actually not as costly as you think.  guava's immutable structures are written to simply remove method access what not needed, and takes tries its best to avoid memory reallocations when using copyOf or builder patterns [[1](https://github.com/google/guava/wiki/ImmutableCollectionsExplained)][[2](https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/ImmutableSetMultimap.java#L365)][[3](https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/ImmutableSetMultimap.java#L306)][[4](https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/ImmutableSetMultimap.java#L291)].... [generally](https://stackoverflow.com/questions/1284727/mutable-or-immutable-class).  it's pretty [efficient](https://github.com/DimitrisAndreou/memory-measurer/blob/master/ElementCostInDataStructures.txt).  because of depth first traversal, we do want to "go back" and let the previous state of counts start again as the basis for other bra
 nch traversals.  this is exactly the benefit as it helps in avoiding defensive copying of mutable data structures or clearing of fields trying to use a single instance when traversing and going back up the stack.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1289372825

   can someone retry the github actions?  there seemed to have been a transient issue that caused one of the test/build targets to fail.  i'd like to get this change in this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] ggershinsky commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
ggershinsky commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1297193514

   yep, just the squash/merge button.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] ggershinsky commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
ggershinsky commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1298107615

   @shangxinli are you ok with this PR in its current form?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990725805


##########
parquet-protobuf/src/test/java/org/apache/parquet/proto/ProtoSchemaConverterTest.java:
##########
@@ -82,264 +93,447 @@ public void testConvertAllDatatypes() throws Exception {
    * Tests that all protocol buffer datatypes are converted to correct parquet datatypes.
    */
   @Test
-  public void testProto3ConvertAllDatatypes() throws Exception {
-    String expectedSchema =
-      "message TestProto3.SchemaConverterAllDatatypes {\n" +
-        "  optional double optionalDouble = 1;\n" +
-        "  optional float optionalFloat = 2;\n" +
-        "  optional int32 optionalInt32 = 3;\n" +
-        "  optional int64 optionalInt64 = 4;\n" +
-        "  optional int32 optionalUInt32 = 5;\n" +
-        "  optional int64 optionalUInt64 = 6;\n" +
-        "  optional int32 optionalSInt32 = 7;\n" +
-        "  optional int64 optionalSInt64 = 8;\n" +
-        "  optional int32 optionalFixed32 = 9;\n" +
-        "  optional int64 optionalFixed64 = 10;\n" +
-        "  optional int32 optionalSFixed32 = 11;\n" +
-        "  optional int64 optionalSFixed64 = 12;\n" +
-        "  optional boolean optionalBool = 13;\n" +
-        "  optional binary optionalString (UTF8) = 14;\n" +
-        "  optional binary optionalBytes = 15;\n" +
-        "  optional group optionalMessage = 16 {\n" +
-        "    optional int32 someId = 3;\n" +
-        "  }\n" +
-        "  optional binary optionalEnum (ENUM) = 18;" +
-        "  optional int32 someInt32 = 19;" +
-        "  optional binary someString (UTF8) = 20;" +
-        "  optional group optionalMap (MAP) = 21 {\n" +
-        "    repeated group key_value {\n" +
-        "      required int64 key;\n" +
-        "      optional group value {\n" +
-        "        optional int32 someId = 3;\n" +
-        "      }\n" +
-        "    }\n" +
-        "  }\n" +
-        "}";
+  public void testProto3ConvertAllDatatypes() {
+    String expectedSchema = JOINER.join(

Review Comment:
   is it possible to separate this tpe of code style cleanup from functional changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990726272


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -559,7 +564,14 @@ final void writeRawValue(Object value) {
   class BinaryWriter extends FieldWriter {
     @Override
     final void writeRawValue(Object value) {
-      ByteString byteString = (ByteString) value;
+      // Non-ByteString values can happen when recursions gets truncated.
+      ByteString byteString = value instanceof ByteString
+          ? (ByteString) value
+          // TODO: figure out a way to use MessageOrBuilder
+          : value instanceof Message
+          ? ((Message) value).toByteString()
+          // Worst-case, just dump as plain java string.
+          : ByteString.copyFromUtf8(value.toString());

Review Comment:
   is this actually an intended state?  If not it is probably better to raise an exception then writing data that could possibly be hard to recover.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990913344


##########
parquet-protobuf/src/test/java/org/apache/parquet/proto/ProtoSchemaConverterTest.java:
##########
@@ -82,264 +93,447 @@ public void testConvertAllDatatypes() throws Exception {
    * Tests that all protocol buffer datatypes are converted to correct parquet datatypes.
    */
   @Test
-  public void testProto3ConvertAllDatatypes() throws Exception {
-    String expectedSchema =
-      "message TestProto3.SchemaConverterAllDatatypes {\n" +
-        "  optional double optionalDouble = 1;\n" +
-        "  optional float optionalFloat = 2;\n" +
-        "  optional int32 optionalInt32 = 3;\n" +
-        "  optional int64 optionalInt64 = 4;\n" +
-        "  optional int32 optionalUInt32 = 5;\n" +
-        "  optional int64 optionalUInt64 = 6;\n" +
-        "  optional int32 optionalSInt32 = 7;\n" +
-        "  optional int64 optionalSInt64 = 8;\n" +
-        "  optional int32 optionalFixed32 = 9;\n" +
-        "  optional int64 optionalFixed64 = 10;\n" +
-        "  optional int32 optionalSFixed32 = 11;\n" +
-        "  optional int64 optionalSFixed64 = 12;\n" +
-        "  optional boolean optionalBool = 13;\n" +
-        "  optional binary optionalString (UTF8) = 14;\n" +
-        "  optional binary optionalBytes = 15;\n" +
-        "  optional group optionalMessage = 16 {\n" +
-        "    optional int32 someId = 3;\n" +
-        "  }\n" +
-        "  optional binary optionalEnum (ENUM) = 18;" +
-        "  optional int32 someInt32 = 19;" +
-        "  optional binary someString (UTF8) = 20;" +
-        "  optional group optionalMap (MAP) = 21 {\n" +
-        "    repeated group key_value {\n" +
-        "      required int64 key;\n" +
-        "      optional group value {\n" +
-        "        optional int32 someId = 3;\n" +
-        "      }\n" +
-        "    }\n" +
-        "  }\n" +
-        "}";
+  public void testProto3ConvertAllDatatypes() {
+    String expectedSchema = JOINER.join(

Review Comment:
   wdym by "tpe"?
   
   if this isn't blocking, i'd rather avoid the busy-work to undo and redo in a different branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1284596663

   > I would also like to recommend adding @matthieun as a co-author to this PR, per the discussion in the parallel PR.
   
   how do you do this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990914702


##########
parquet-protobuf/src/test/resources/Trees.proto:
##########
@@ -0,0 +1,37 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+
+syntax = "proto3";
+
+package Trees;
+
+import "google/protobuf/any.proto";
+
+option java_package = "org.apache.parquet.proto.test";
+
+message BinaryTree {
+    google.protobuf.Any value = 1;

Review Comment:
   i think the existing non-recursive proto tests exercise the existing and newly added (the skipping behavior) code paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990915628


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -559,7 +564,14 @@ final void writeRawValue(Object value) {
   class BinaryWriter extends FieldWriter {
     @Override
     final void writeRawValue(Object value) {
-      ByteString byteString = (ByteString) value;
+      // Non-ByteString values can happen when recursions gets truncated.
+      ByteString byteString = value instanceof ByteString
+          ? (ByteString) value
+          // TODO: figure out a way to use MessageOrBuilder
+          : value instanceof Message
+          ? ((Message) value).toByteString()

Review Comment:
   no, afaict: https://www.javadoc.io/doc/org.apache.parquet/parquet-column/latest/org/apache/parquet/io/api/RecordConsumer.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990727282


##########
parquet-protobuf/src/test/resources/BinaryTree.par:
##########
@@ -0,0 +1,50 @@
+message Trees.BinaryTree {
+  optional group value = 1 {

Review Comment:
   or is par not proto?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1247345339

   ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984236009


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -124,35 +164,61 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addR
         .named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder) {
-    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result =
-      builder
+  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result = builder
         .group(Type.Repetition.OPTIONAL).as(listType())
         .group(Type.Repetition.REPEATED)
         .group(Type.Repetition.OPTIONAL);
 
-    convertFields(result, descriptor.getMessageType().getFields());
+    convertFields(result, descriptor.getMessageType().getFields(), seen, depth);
 
     return result.named("element").named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    // Prevent recursion by terminating with optional proto bytes.
+    depth += 1;
+    String typeName = getInnerTypeName(descriptor);
+    LOG.trace("addMessageField: " + descriptor.getFullName() + " type: " + typeName + " depth: " + depth);

Review Comment:
   its been a while since I've done java but doesn't the string concatenation as a parameter, incur overhead even if the log level is is above trace (i.e. doesn't there need to be some string formatting that takes the string arguments as separate parameters?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] shangxinli commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
shangxinli commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1281237456

   @ggershinsky Can you have a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984839210


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
     }
   }
 
-  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {

Review Comment:
   ?
   
   the `seen` map does encode the seen fields along with their depth as a single datastructure.  `depth` being a separate arg is important b/c it's the current depth in the traversal, and is used to update the seen data structure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1296519347

   @ggershinsky what is the process to merge this?  Does parquet-mr just use the github UI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius closed pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius closed pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth
URL: https://github.com/apache/parquet-mr/pull/995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1297292700

   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] shangxinli commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
shangxinli commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1300567095

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] shangxinli merged pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
shangxinli merged PR #995:
URL: https://github.com/apache/parquet-mr/pull/995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1261287406

   fixed missing dep issue.  can someone approve the ci flow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984809641


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -124,35 +164,61 @@ private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addR
         .named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder) {
-    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result =
-      builder
+  private <T> GroupBuilder<GroupBuilder<T>> addRepeatedMessage(FieldDescriptor descriptor, GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result = builder
         .group(Type.Repetition.OPTIONAL).as(listType())
         .group(Type.Repetition.REPEATED)
         .group(Type.Repetition.OPTIONAL);
 
-    convertFields(result, descriptor.getMessageType().getFields());
+    convertFields(result, descriptor.getMessageType().getFields(), seen, depth);
 
     return result.named("element").named("list");
   }
 
-  private <T> GroupBuilder<GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addMessageField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
+    // Prevent recursion by terminating with optional proto bytes.
+    depth += 1;
+    String typeName = getInnerTypeName(descriptor);
+    LOG.trace("addMessageField: " + descriptor.getFullName() + " type: " + typeName + " depth: " + depth);

Review Comment:
   you're correct.  however, proto schema conversion shouldn't happen repeatedly in the greater flow of a processing job (ideally, just once), so this overhead isn't too bad.
   
   i'll move to the parameterized/formatted logging calls in the files i touch here.  i would suggest the rest of the codebase do the same to avoid this penalty as well, but it's beyond the scope of this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] ggershinsky commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
ggershinsky commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1283566340

   I would also like to recommend adding @matthieun as a co-author to this PR, per the discussion in the parallel PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1271822621

   ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1297295385

   @ggershinsky 
   
   i'd love to just hit the button.  i don't see it.  the workflow for travis ci had a failure due to a transient connection issue, and so it wasn't giving me the option to merge.  the ui messaging also states that "Only those with [write access](https://docs.github.com/articles/what-are-the-different-access-permissions) to this repository can merge pull requests." 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1298913644

   @jinyius only committers can see the button.  I was asking because different repos have different commit procedures.  Should be able to merge this soon as long as @shangxinli doesn't express concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1281213675

   > Mostly looks reasonable, I'm not too familiar with parquet-mr @shangxinli can you recommend someone who might be able to give a better review?
   
   pinging @shangxinli :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#issuecomment-1263878341

   thanks for the review.  updated to handle the logging perf concern as well as fixing the javadoc errors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984238238


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
     }
   }
 
-  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {

Review Comment:
   would it make sense to consolidate seen and depth into a single data-structure that can be passed through and abstract some of the direct access to the multimap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r984839210


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
     }
   }
 
-  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {

Review Comment:
   ?
   
   the seen map does encode the depth as a single datastructure.  the depth being a separate arg is important b/c it's the current depth in the traversal, and is used to update the seen data structure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990726331


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -559,7 +564,14 @@ final void writeRawValue(Object value) {
   class BinaryWriter extends FieldWriter {
     @Override
     final void writeRawValue(Object value) {
-      ByteString byteString = (ByteString) value;
+      // Non-ByteString values can happen when recursions gets truncated.
+      ByteString byteString = value instanceof ByteString
+          ? (ByteString) value
+          // TODO: figure out a way to use MessageOrBuilder
+          : value instanceof Message
+          ? ((Message) value).toByteString()

Review Comment:
   does recordconsumer offer a stream API or something else to avoid the additional array/bytestring copies?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] emkornfield commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990727030


##########
parquet-protobuf/src/test/resources/BinaryTree.par:
##########
@@ -0,0 +1,50 @@
+message Trees.BinaryTree {
+  optional group value = 1 {

Review Comment:
   Aren't groups [deprecated](https://developers.google.com/protocol-buffers/docs/proto#groups)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990914134


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##########
@@ -559,7 +564,14 @@ final void writeRawValue(Object value) {
   class BinaryWriter extends FieldWriter {
     @Override
     final void writeRawValue(Object value) {
-      ByteString byteString = (ByteString) value;
+      // Non-ByteString values can happen when recursions gets truncated.
+      ByteString byteString = value instanceof ByteString
+          ? (ByteString) value
+          // TODO: figure out a way to use MessageOrBuilder
+          : value instanceof Message
+          ? ((Message) value).toByteString()
+          // Worst-case, just dump as plain java string.
+          : ByteString.copyFromUtf8(value.toString());

Review Comment:
   this is intended.  for a real-time, production pipeline i'm working on, losing data as it passes through or killing the job b/c of an uncaught exception is problematic as it could lead to data loss and down time.  this way, there's some way to know what the problematic data was and fix it properly asap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-mr] jinyius commented on a diff in pull request #995: PARQUET-1711: support recursive proto schemas by limiting recursion depth

Posted by GitBox <gi...@apache.org>.
jinyius commented on code in PR #995:
URL: https://github.com/apache/parquet-mr/pull/995#discussion_r990912850


##########
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##########
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
     }
   }
 
-  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
+  private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {

Review Comment:
   i'm not sure encapsulation helps with readability or protection in this case.  they are really tracking different things, and should be understood by readers of the traversal code to know how each piece of state is used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org