You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/09 23:28:37 UTC

[GitHub] [iceberg] jackye1995 commented on pull request #2444: Core: add API for table metadata file encryption

jackye1995 commented on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-817024583


   > I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas
   
   @johnclara I was thinking something similar, and in fact the original API was:
   
   ```
   OutputFile encrypt(TableIdentifier tableId, OutputFile rawOutput);
   ```
   
   So that you can derive the key ID based on some information like the table identifier, which can be used in your use case. But the table operation does not have a good way to consistently get the table ID, catalogs such as `NessieCatalog` do not have this information. 
   
   So with the current encrypt API, what I think are 2 approaches to achieve one KEK per table use case:
   
   1. derive table ID based on `rawOutput.location()`, and as long as it follows the standard naming structure, we can get the namespace and table name and retrieve the KEK ID and generate a new DEK to encrypt a new metadata file.
   2. take `TableIdentifier` as an input of the `TableMetadataEncryptionManager` implementation, so that that implementation can reference this information in the encrypt method.
   
   The encrypted DEK then can be stored in 2 ways:
   1. in an external system, for example you can have a very simple DynamoDB table that stores the key value pair of (metadataLocation, encryptedDek). Then decrypt method can easily decrypt based on that mapping information plus table ID derived from any of the 2 method above.
   2. as what you suggested, adding it into the historical metadata log list. 
   
   However, I don't like the second approach personally due to the following reasons:
   1. this again places dependency on the table metadata. For each decryption, we need to call get table, check if it is the latest table metadata, and then choose to open table metadata or not to get the correct encrypted DEK. 
   2. for every write of new table metadata, it needs to get externally managed KEK + encrypted DEK and store it as a historical log entry, which would require a lot of hooks in different places to achieve this goal and satisfy both single and double wrap encryption.
   
   What I hope is to have a clean cut between what keys are managed by Iceberg and what keys are completely managed externally, so that `TableMetadataEncryptionManager` is completely independent of `TableMetadata`, and `EncryptionManager` can fully depend on `TableMetadata`. 
   
   If you have a good suggestion for 2, please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org