You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/09 02:56:16 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2444: Core: add API for table metadata file encryption

jackye1995 opened a new pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444


   This is a PR based on design doc https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit#
   
   It introduces a new set of encryption APIs for table metadata files, so that the entire Iceberg metadata tree is fully encrypted. Because this is the top level of an Iceberg metadata tree, the newly introduced `TableMetadataEncryptionManager` does not need to maintain encryption metadata internally, and will directly produce decrypting/encrypting stream from input/output files.
   
   I will introduce an actual implementation in AWS in another PR.
   
   @rdblue @ggershinsky @shangxinli @andersonm-ibm @yyanyy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#discussion_r624789764



##########
File path: api/src/main/java/org/apache/iceberg/encryption/TableMetadataEncryptionManager.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.encryption;
+
+import java.util.Map;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+
+/**
+ * An encryption manager to handle top level table metadata file encryption.
+ * <p>
+ * Unlike other Iceberg metadata such as manifest list and manifest,
+ * because table metadata is the top level Iceberg metadata,
+ * we do not leverage any other file to store its encryption information.
+ * Therefore, this encryption manager assumes all encryption metadata are externally managed,
+ * and directly transforms the file to its encrypting or decrypting form.
+ */
+public interface TableMetadataEncryptionManager {

Review comment:
       Rather than doing this, why not use the existing encryption APIs and create the encryption manager in the catalog? The catalog is where all of this should get plugged in, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-817024583


   > I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas
   
   @johnclara I was thinking something similar, and in fact the original API was:
   
   ```
   OutputFile encrypt(TableIdentifier tableId, OutputFile rawOutput);
   ```
   
   So that you can derive the key ID based on some information like the table identifier, which can be used in your use case. But the table operation does not have a good way to consistently get the table ID, catalogs such as `NessieCatalog` do not have this information. 
   
   So with the current encrypt API, what I think are 2 approaches to achieve one KEK per table use case:
   
   1. derive table ID based on `rawOutput.location()`, and as long as it follows the standard naming structure, we can get the namespace and table name and retrieve the KEK ID and generate a new DEK to encrypt a new metadata file.
   2. take `TableIdentifier` as an input of the `TableMetadataEncryptionManager` implementation, so that that implementation can reference this information in the encrypt method.
   
   The encrypted DEK then can be stored in 2 ways:
   1. in an external system, for example you can have a very simple DynamoDB table that stores the key value pair of (metadataLocation, encryptedDek). Then decrypt method can easily decrypt based on that mapping information plus table ID derived from any of the 2 method above.
   2. as what you suggested, adding it into the historical metadata log list. 
   
   However, I don't like the second approach personally due to the following reasons:
   1. this again places dependency on the table metadata. For each decryption, we need to call get table, check if it is the latest table metadata, and then choose to open table metadata or not to get the correct encrypted DEK. 
   2. for every write of new table metadata, it needs to get externally managed KEK + encrypted DEK and store it as a historical log entry, which would require a lot of hooks in different places to achieve this goal and satisfy both single and double wrap encryption.
   
   What I hope is to have a clean cut between what keys are managed by Iceberg and what keys are completely managed externally, so that `TableMetadataEncryptionManager` is completely independent of `TableMetadata`, and `EncryptionManager` can fully depend on `TableMetadata`. 
   
   If you have a good suggestion for 2, please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   Thank you for tackling this!
   
   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots' TableMetadata files, the keyMaterials could be stored along side the metadataLocation in the metadata-log section of the TableMetadata file.
   https://github.com/apache/iceberg/blob/master/core/src/test/resources/TableMetadataV2Valid.json#L87


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot, and then within the TableMetadata for previous snapshots?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots, the keyMaterials could be stored along side the metadataLocation in the previous snapshots section of the TableMetadata file.
   https://github.com/apache/iceberg/blob/master/core/src/test/resources/TableMetadataV2Valid.json#L85


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara commented on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara commented on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active file, and then within the TableMetadata for previous snapshots?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots, the keyMaterials could be stored along side the metadataLocation in the previous snapshots section of the TableMetadata file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot, and then within the TableMetadata for previous snapshots?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots, the keyMaterials could be stored along side the metadataLocation in the previous snapshots section of the TableMetadata file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots' TableMetadata files, the keyMaterials could be stored along side the metadataLocation in the previous snapshots section of the TableMetadata file.
   https://github.com/apache/iceberg/blob/master/core/src/test/resources/TableMetadataV2Valid.json#L87


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ggershinsky commented on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
ggershinsky commented on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-817487539


   hi guys, regarding the table KEK (or MEK). I think we should always have an option (might be the default) to keep the master keys in a KMS, so they can be stored in the safe HSM modules, with their access control managed by the production-grade IAM systems, etc.
   
   Not all KMS systems support arbitrary key IDs. Some generate master keys with a system-specific ID, that then can be used by us for table encryption. In other words, we should have an option to take external key ID as an input (instead of generating the ID), and store it in table's configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
jackye1995 edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-817024583


   > I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas
   
   @johnclara I was thinking something similar, and in fact the original API was:
   
   ```
   OutputFile encrypt(TableIdentifier tableId, OutputFile rawOutput);
   ```
   
   So that you can derive the key ID based on some information like the table identifier, which can be used in your use case. But the table operation does not have a good way to consistently get the table ID, catalogs such as `NessieCatalog` do not have this information. 
   
   So with the current encrypt API, what I think are 2 approaches to achieve one KEK per table use case:
   
   1. derive table ID based on `rawOutput.location()`, and as long as it follows the standard naming structure, we can get the namespace and table name and retrieve the KEK ID and generate a new DEK to encrypt a new metadata file.
   2. take `TableIdentifier` as an input of the `TableMetadataEncryptionManager` implementation, so that that implementation can reference this information in the encrypt method.
   
   The encrypted DEK then can be stored in 2 ways:
   1. in an external system, for example you can have a very simple DynamoDB table that stores the key value pair of (metadataLocation, encryptedDek). Then decrypt method can easily decrypt based on that mapping information plus table ID derived from any of the 2 method above.
   2. as what you suggested, adding it into the historical metadata log list. 
   
   However, I don't like the second approach personally due to the following reasons:
   1. this again places dependency on the table metadata. For each decryption, we need to call get table, check if it is the latest table metadata, and then choose to open table metadata or not to get the correct encrypted DEK. 
   2. for every write of new table metadata, it needs to get externally managed KEK + encrypted DEK and store it as a historical log entry, which would require a lot of hooks in different places to achieve this goal and satisfy both single and double wrap encryption.
   
   What I hope is to have a clean cut between what keys are managed by Iceberg and what keys are completely managed externally, so that `TableMetadataEncryptionManager` is completely independent of `TableMetadata`, and `EncryptionManager` can fully depend on `TableMetadata`. 
   
   If you have a good suggestion for storing encrypted DEK approach 2, please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#discussion_r632881341



##########
File path: api/src/main/java/org/apache/iceberg/encryption/TableMetadataEncryptionManager.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.encryption;
+
+import java.util.Map;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+
+/**
+ * An encryption manager to handle top level table metadata file encryption.
+ * <p>
+ * Unlike other Iceberg metadata such as manifest list and manifest,
+ * because table metadata is the top level Iceberg metadata,
+ * we do not leverage any other file to store its encryption information.
+ * Therefore, this encryption manager assumes all encryption metadata are externally managed,
+ * and directly transforms the file to its encrypting or decrypting form.
+ */
+public interface TableMetadataEncryptionManager {

Review comment:
       @rdblue thanks for the review, I finally got some time to pick this up.
   
   > why not use the existing encryption APIs and create the encryption manager in the catalog
   
   Yes that is another option to go, but by keeping everything in the encryption manager, the encryption manager has to be "partially" initialized so that it can only encrypt and decrypt table metadata files, and then be initialized again with table metadata to be able to encrypt and decrypt other files.
   
   Comparing with that experience, I think adding another manager seems to be cleaner, and also easier for users to extend for any custom need while still using the same encryption manager for other files.
   
   With that being said, I don't have a strong opinion on this, if we think it is important to keep everything in one place and allow having that partial initialization experience.
   
   I have updated this PR to include all the changes needed when going with the current approach, it should provide a better view of what is going on.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots' TableMetadata files, the keyMaterials could be stored along side the metadataLocation in the previous snapshots section of the TableMetadata file.
   https://github.com/apache/iceberg/blob/master/core/src/test/resources/TableMetadataV2Valid.json#L85


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] johnclara edited a comment on pull request #2444: Core: add API for table metadata file encryption

Posted by GitBox <gi...@apache.org>.
johnclara edited a comment on pull request #2444:
URL: https://github.com/apache/iceberg/pull/2444#issuecomment-816982342


   How would you recommend storing the key materials associated with the metadata file? (for example the kmsid of the key used to encrypt the TableMetadata file?)
   
   I was thinking one option would be along side the TableMetadata location within the external metastore for the active snapshot's TableMetadata, and then within the TableMetadata log for previous TableMetadatas?
   
   For instance my team uses DynamoDB as an external catalog with schema:
   `icebergTableName, metadataLocation`
   
   We could add another column:
   `icebergTableName, metadataLocation, keyMaterials`
   
   For loading the current snapshot of the table, we could use the key materials within the external metastore to read the TableMetadata file.
   
   In order to look at previous snapshots' TableMetadata files, the keyMaterials could be stored along side the metadataLocation in the metadata-log section of the TableMetadata file.
   https://github.com/apache/iceberg/blob/master/core/src/test/resources/TableMetadataV2Valid.json#L87


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org