You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/08/06 06:15:47 UTC
[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-669725393


   > extend ParquetWriteSupport which converts crypto setting in schema..
   
   @shangxinli your previous pull request (already merged) added a very nice Encryption Factory interface, that allows for transparent activation of encryption via existing parameters (such as schema and hadoop configuration) - without making any modifications in the frameworks, including their Parquet glue code (write support). The current pull request breaks this model, since it introduces new configuration channels, and requires changes in the frameworks on top. These changes can also introduce a confusion that schema config changes is the way to activate Parquet encryption - while in fact the frameworks can activate Parquet encryption via eg the existing Hadoop config (with any proper implementation of EncryptionPropertiesFactory, such as PropertiesDrivenCryptoFactory). For example, Spark just needs to update its Parquet version, no other changes are needed. Lastly, this PR might break future interoperability with C++-based frameworks (such as pyarrow and pandas) - we're adding a
  high-level encryption support there, based on the standard EncryptionPropertiesFactory model.
   
   As mentioned above, this pull request is not a Schema activated encryption. The encryption  here is activated by a collection of custom parameter maps, piggy-backed on the Schema elements. Since such custom maps don't exist today, this PR adds them.
   
   Fortunately, there is a simple solution that allows to support your organization's requirements without adding such maps and without breaking the Crypto Factory model. Currently, you have your crypto parameters passed in custom <String,String> maps. But Hadoop config is also a <String,String> map. You already use it for file-wide parameters (such as encryption algorithm). You can also use it for column parameters - by adding them with column names, or simply by concatenating them into a single string parameter. See HIVE-21848 for an example (or PropertiesDrivenCryptoFactory, that implements HIVE-21848).
   
   > The way it works is SchemaCryptoPropertiesFactory should have RPC call with KMS ...
   > We don't want to release helper function like setKey() etc because that will require code changes in the existing pipelines. 
   
   You might go a step further, and utilize the PropertiesDrivenCryptoFactory directly in your deployment. This factory is designed for efficient KMS support, minimizing RPC calls to the absolute minimum. It sets keys and metadata without requiring code changes in the existing pipelines. This factory also implements the cryptography best practices, eg preventing mistakes like using the same data key for many files. Moreover, if using it, you will tap into community support and future improvements in this stuff. 
   But, this is only a recommendation; crypto factories are pluggable, and can be swapped with any custom implementation, as long as it follows the defined model.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org