You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/07/31 20:33:08 UTC

[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-667344642


   > I think, directly using `ExtType` is not really user friendly. As the "metadata" for a type is currently used only for encryption I would not expose it directly to the user. There are a couple of ways for creating a parquet schema (e.g. converting from an avro schema, using the type builders, parsing it from a string etc.). So the user needs an easy way to mark the columns to be encrypted and set the related encryption key.
   > I would suggest having a utility method that allows to set only the required data for an addressed field in a schema. For example: `XXXUtility.encryptColumn(MessageType schema, ColumnPath column, String encKey)`. The logic behind this utility can be hidden from the user.
   > 
   > I don't really like the decorator pattern. First, it requires a lot of boilerplate code to have all the methods being delegated. Second, if the super class is modified by having a new method it will not generate any errors/warnings if the decorator would not be updated but may cause serious issues at runtime.
   > Why do you not want to extend `Type` (or more `PrimitiveType`) directly? If you not want to bother the users with this metadata that is not tightly connected to the schema you might define it _package private_ and let only the mentioned utility reach it.
   > BTW, do we need this _metadata_ for the primitive types or it is enough to have a specific object that contains only the required values for encryption?
   
   Thanks for the comments! The idea is to piggyback the existing WriteSupport's schema conversion from other type’s schema e,g avro to Parquet schema to pass down the crypto settings. By doing this way, when we build  FileEncryptionProperties object, we don't need RPC calls anymore to get it somewhere.  Here is an example of Apache Hudi's WriteSupport example https://github.com/shangxinli/parquet-write-supports/blob/master/src/main/java/com/uber/hoodie/avro/CryptoHoodieAvroWriteSupport.java.  It extends HoodieAvroWriteSupport to translate the crypto settings from avro schema's metadata to Parquet schema’s metadata. The CryptoPropertiesFactory objects can use it like this https://github.com/apache/parquet-mr/pull/808/commits/9adb75a2356147a204d76c82eb39f43e9ee72b58#diff-886dda84b09a5aaf0bf7c1e718c1be02R61
   
   We have built the extended WriteSupport classes for each system like Hudi, Spark, Hive into a library along with the CryptoPropertiesFactory implementation classes into a library. That library is added at cluster-wide and all the jobs can have it without developers changing their code. The table owner only need to change schema settings in metastore like HMS or other centralized metastore by tagging a column is to be encrypted. For a company that has a lot of existing tables to be onboarded to this feature, it would be much easier. We also got an agreement with ORC on defining the format of the encryption settings https://issues.apache.org/jira/browse/HIVE-21848. 
   
   What we need from the Parquet Schema is a place to keep the metadata. A schema definition usually has a place to do so like Avro schema, Spark Schema(StructField), but I don't find Parquet schema has it. So it could be used for other purposes. 
   
   I agreed with you on the opinion of the decoration pattern. But I think adding metadata to the class Type directly is risky. I am afraid of breaking existing functionality as it is used so widely. Your concern of changing the class Type and forgetting to change ExtType class is valid, but for this particular case, we didn’t just add or reference existing methods without changing. If the methods of Type is changed, ExtType shouldn’t need to be changed. If there is adding or deletion of methods in Type, there will be build error so that people know to fix it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org