You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/08/24 12:54:02 UTC

[GitHub] [parquet-mr] qinghui-xu opened a new pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

qinghui-xu opened a new pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-679108339


   Trying to re-trigger Travis by close-reopen


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] Fokko commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
Fokko commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-678960902


   @qinghui-xu The CI is experiencing some connectivity issues, could you rebase against master to retrigger the build?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] qinghui-xu commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
qinghui-xu commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670526172


   Hey, guys.
   Will be glad to work on it. Would try to push another patch at the beginning of next week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] qinghui-xu commented on a change in pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
qinghui-xu commented on a change in pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#discussion_r467628479



##########
File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java
##########
@@ -130,13 +137,36 @@ public WriteContext init(Configuration configuration) {
 
     this.messageWriter = new MessageWriter(messageDescriptor, rootSchema);
 
-    Map<String, String> extraMetaData = new HashMap<String, String>();
+    Map<String, String> extraMetaData = new HashMap<>();
     extraMetaData.put(ProtoReadSupport.PB_CLASS, protoMessage.getName());
     extraMetaData.put(ProtoReadSupport.PB_DESCRIPTOR, serializeDescriptor(protoMessage));
     extraMetaData.put(PB_SPECS_COMPLIANT_WRITE, String.valueOf(writeSpecsCompliant));
     return new WriteContext(rootSchema, extraMetaData);
   }
 
+  @Override
+  public FinalizedWriteContext finalizeWrite() {
+    Map<String, String> protoMetadata = new HashMap<>();
+    for (Map.Entry<String, Map<String, Integer>> enumNameNumberMapping : protoEnumBookKeeper.entrySet()) {
+      StringBuilder nameNumberPairs = new StringBuilder();
+      if (enumNameNumberMapping.getValue().isEmpty()) {
+        // No enum is ever written to any column of this file, put an empty string as the value in the metadata
+        LOG.info("No enum is written for " + enumNameNumberMapping.getKey());
+      }
+      int idx = 0;
+      for (Map.Entry<String, Integer> nameNumberPair : enumNameNumberMapping.getValue().entrySet()) {

Review comment:
       Sorry for the late reply. @costimuraru
   I refactor out all the logic of building enum metadata in a method. I think this would be more clear.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky merged pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
gszadovszky merged pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] browles edited a comment on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
browles edited a comment on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670219375


   Hey guys, what is the status on this PR? We are currently experiencing this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670462074


   Seems like the guys missed to push it after approving. 
   @qinghui-xu, are you willing to pick this up again and resolve the conflicts? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] browles commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
browles commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-670219375


   Hey guys, what is the status on this PR? We currently experiencing this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky closed pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
gszadovszky closed pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
gszadovszky commented on a change in pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#discussion_r467864366



##########
File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -222,11 +263,37 @@ public ProtoEnumConverter(ParentValueContainer parent, Descriptors.FieldDescript
       Descriptors.EnumValueDescriptor protoValue = enumLookup.get(binaryValue);
 
       if (protoValue == null) {
-        Set<Binary> knownValues = enumLookup.keySet();
-        String msg = "Illegal enum value \"" + binaryValue + "\""
-                + " in protocol buffer \"" + fieldType.getFullName() + "\""
-                + " legal values are: \"" + knownValues + "\"";
-        throw new InvalidRecordException(msg);
+        // in case of unknown enum value, protobuf is creating new EnumValueDescriptor with the unknown number
+        // and name as following "UNKNOWN_ENUM_VALUE_" + parent.getName() + "_" + number
+        // so the idea is to parse the name for data created by parquet-proto before this patch
+        String unknownLabel = new String(binaryValue.getBytes());

Review comment:
       I would suggest using `Binary.toStringUsingUTF8()` instead.

##########
File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -190,25 +211,45 @@ private Converter newScalarConverter(ParentValueContainer pvc, Message.Builder p
     private final Map<Binary, Descriptors.EnumValueDescriptor> enumLookup;
     private Descriptors.EnumValueDescriptor[] dict;
     private final ParentValueContainer parent;
+    private final Descriptors.EnumDescriptor enumType;
+    private final String unknownEnumPrefix;
+    private final boolean acceptUnknownEnum;
 
     public ProtoEnumConverter(ParentValueContainer parent, Descriptors.FieldDescriptor fieldType) {
       this.parent = parent;
       this.fieldType = fieldType;
-      this.enumLookup = makeLookupStructure(fieldType);
+      this.enumType = fieldType.getEnumType();
+      this.enumLookup = makeLookupStructure(enumType);
+      unknownEnumPrefix = "UNKNOWN_ENUM_VALUE_" + enumType.getName() + "_";
+      acceptUnknownEnum = conf.getBoolean(CONFIG_ACCEPT_UNKNOWN_ENUM, false);
     }
 
     /**
      * Fills lookup structure for translating between parquet enum values and Protocol buffer enum values.
      * */
-    private Map<Binary, Descriptors.EnumValueDescriptor> makeLookupStructure(Descriptors.FieldDescriptor enumFieldType) {
-      Descriptors.EnumDescriptor enumType = enumFieldType.getEnumType();
+    private Map<Binary, Descriptors.EnumValueDescriptor> makeLookupStructure(Descriptors.EnumDescriptor enumType) {
       Map<Binary, Descriptors.EnumValueDescriptor> lookupStructure = new HashMap<Binary, Descriptors.EnumValueDescriptor>();
 
-      List<Descriptors.EnumValueDescriptor> enumValues = enumType.getValues();
+      if (extraMetadata.containsKey(METADATA_ENUM_PREFIX + enumType.getFullName())) {
+        String enumNameNumberPairs = extraMetadata.get(METADATA_ENUM_PREFIX + enumType.getFullName());
+        if (StringUtils.isBlank(enumNameNumberPairs)) {
+          LOG.info("No enum is written for " + enumType.getFullName());

Review comment:
       As far as I understand this one will be logged quite often. We've already had some about logging too much. I would suggest having this on debug level instead.

##########
File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -22,7 +22,10 @@
 import com.google.protobuf.Descriptors;
 import com.google.protobuf.Message;
 import com.twitter.elephantbird.util.Protobufs;
+import org.apache.commons.lang.StringUtils;

Review comment:
       There is no direct dependency to `commons-lang` in `parquet-protobuf` so I would not start using it for such a simple functionality.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] qinghui-xu commented on pull request #561: PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value

Posted by GitBox <gi...@apache.org>.
qinghui-xu commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-679101925


   @Fokko It's already on top of the master branch. Could you retrigger the CI by hand somehow?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org