You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Jack Ye <ye...@gmail.com> on 2021/03/22 17:25:18 UTC

Extending Apache Iceberg Encryption Module

Hi everyone,

To continue the discussion in the last sync meeting about encryption in
Iceberg, here is the document for a proposal:

https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing

Would be very appreciated for any feedback.

Best,
Jack Ye

Re: Extending Apache Iceberg Encryption Module

Posted by Gidon Gershinsky <gg...@gmail.com>.
I must say I'm impressed with the level of constructiveness and
technical quality in this discussion, we're off to a good start in this
project.

*For POC, I think what you conclude is mostly correct, I am currently
implementing the encryption spec, general encrypted file stream with KMS
API, and I would expect the low level file encryption integration to take
place separately and we can meet in the middle. For key rotation and AAD, I
think we can discuss more details in the doc first before proceeding
forward, they are not blocking tasks anyway.*

Sounds good to me on all points. Lets indeed work on our respective parts
in the POC, coordinating the design (when needed) via the doc. Once we need
a deeper coordination, we might set up a meeting, but I agree this is not
pressing, can wait a few weeks.

*“there is an intermediate approach, where (the many) DEKs are encrypted
with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
this can be immutable, as long as the KEKs are encrypted with MEKs and
stored in a mutable medium that can be replaced/updated upon MEK rotation.”*

*That is doable as we store the KEKs in spec. In that case, a MEK rotation
would perform a spec update. But it implies KEK is static just like DEK,
and we will only rotate MEK and not rotate KEK. I thought we also need to
rotate KEKs that’s why I did not consider this approach. I do not have
enough experience in a double-wrap system, but does the security standard
still hold in this case without KEK rotation? Or is there a separated
process to handle KEK rotation?*

It's one of those borderline areas... I have an opinion on this, but to be
on the safe side, I'll send a question to the community - maybe there are
folks who can shed additional light on the trade-offs in this situation, or
can point us to other contacts or sources.

Cheers, Gidon


On Wed, Mar 24, 2021 at 11:46 PM Ye, Jack <yz...@amazon.com.invalid>
wrote:

> Sounds good, lets continue with some discussions through the doc. For POC,
> I think what you conclude is mostly correct, I am currently implementing
> the encryption spec, general encrypted file stream with KMS API, and I
> would expect the low level file encryption integration to take place
> separately and we can meet in the middle. For key rotation and AAD, I think
> we can discuss more details in the doc first before proceeding forward,
> they are not blocking tasks anyway.
>
>
>
> “there is an intermediate approach, where (the many) DEKs are encrypted
> with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
> this can be immutable, as long as the KEKs are encrypted with MEKs and
> stored in a mutable medium that can be replaced/updated upon MEK rotation.”
>
>
>
> That is doable as we store the KEKs in spec. In that case, a MEK rotation
> would perform a spec update. But it implies KEK is static just like DEK,
> and we will only rotate MEK and not rotate KEK. I thought we also need to
> rotate KEKs that’s why I did not consider this approach. I do not have
> enough experience in a double-wrap system, but does the security standard
> still hold in this case without KEK rotation? Or is there a separated
> process to handle KEK rotation?
>
>
>
> “(1) is a direct DEK passing; we've considered it for Parquet, but decided
> against it, because it can lead to unsafe situations”
>
>
>
> Nice, I think I also mentioned in the doc that I am against using this
> scheme, so we can focus more on supporting the single and double wrapping
> use case.
>
>
>
> -Jack
>
>
>
>
>
> *From: *Gidon Gershinsky <gg...@gmail.com>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Date: *Wednesday, March 24, 2021 at 05:19
> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Subject: *RE: [EXTERNAL] Extending Apache Iceberg Encryption Module
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Sounds good, thanks.
>
> Responding to the points below:
>
>
>
> *"we can choose to store the encrypted DEKs inside the manifest or as a
> separated instruction file with a pointer in key_metadata, and there are
> tradeoffs for those approaches"*
>
>
>
> For the latter, we are running a similar mechanism in Parquet encryption,
> where we keep the key material in separate json files, and a pointer to it
> inside the parquet file footer key_metadata fields. This works; but for
> Iceberg integration, there are advantages in using the manifest files (or
> other managed medium) instead. The trade-offs (inc size additions,
> consistency, management) TBD.
>
> Btw, there is a intermediate approach, where (the many) DEKs are encrypted
> with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
> this can be immutable, as long as the KEKs are encrypted with MEKs and
> stored in a mutable medium that can be replaced/updated upon MEK rotation.
>
>
>
> *"3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK
> ID + encrypted KEK + encrypted DEK, and that should be enough to cover most
> of the use cases with different types of KMS"*
>
>
>
> Yep, (2) and (3) are the single and double wrapping, respectively, which
> covers our usecases; (1) is a direct DEK passing; we've considered it for
> Parquet, but decided against it, because it can lead to unsafe situations
> where an inexperienced user will pass the same DEK to many files (which can
> break the GCM cipher, even with one table). But we might try to enable it
> in Iceberg with strong preventive measures (if possible), TBD.
>
>
>
> *"DDL clauses for encryption and key rotation*
>
> *These definitely make sense to me. I will add a list of the DDL clauses I
> was thinking about to the doc.*
>
> *Cryptographic integrity of Data Tables*
>
> *Yes, I think in this doc at least the location and structure of AAD
> prefix should be discussed, so hopefully we can reach some general
> consensus for integrity support for Iceberg tables and make sure the right
> information is in place or can be added later."*
>
>
>
> SGTM.
>
>
>
> *"I am also working on a POC to flush out some details for the aspects
> described in the doc, I will update in this thread once I publish that."*
>
>
>
> We too work on a POC of this technology. I guess we're working at
> different corners at the moment, as we're mostly focused on Parquet
> encryption integration, parts of key rotation and on GCM streams with AAD
> Prefixes for table integrity; while you probably are working on the Catalog
> metadata, general encrypted file streams and key management API. But since
> there is a high potential for overlaps, I'd suggest we'd coordinate the POC
> work; what would be the best way of doing that?
>
>
>
> Cheers, Gidon
>
>
>
>
>
> On Tue, Mar 23, 2021 at 11:50 PM Jack Ye <ye...@gmail.com> wrote:
>
> Thanks for the feedback to the doc, we are also closely following the
> Parquet encryption work and would like to have that in Iceberg as soon as
> possible with the right architecture. Here are some brief thoughts for the
> points you mentioned in the email, I will add more details in the google
> doc:
>
>
>
> *Key rotation*
>
> My initial thought was to consider key rotation as a separated process and
> DEK rewrapping can be done with a Spark stored procedure, that's why I did
> not add any detail for it. But your point about the work needed to rewrite
> and clean up manifests is a really good point that I should fully describe
> the details.
>
> For instance, we can choose to store the encrypted DEKs inside the
> manifest or as a separated instruction file with a pointer in key_metadata,
> and there are tradeoffs for those approaches. I will update the doc for
> these details.
>
>
>
> *Acceleration of KMS interactions*
>
> Thanks for bringing up double wrapping, I was hesitant to mention that in
> the initial version of the doc because it would add complexity for
> understanding the overall architecture. And for the use cases I have seen
> with AWS KMS, people are all using single-wrapping and the service was able
> to handle generation of millions of DEKs, and it seems like there was no
> complaint about it.
>
> I think the right way to go is to support the 3 common cases: (1) direct
> DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted
> DEK, and that should be enough to cover most of the use cases with
> different types of KMS. I will update the encryption spec with more details
> on that.
>
>
>
> *DDL clauses for encryption and key rotation*
>
> These definitely make sense to me. I will add a list of the DDL clauses I
> was thinking about to the doc.
>
>
>
> *Cryptographic integrity of Data Tables*
>
> Yes, I think in this doc at least the location and structure of AAD prefix
> should be discussed, so hopefully we can reach some general consensus for
> integrity support for Iceberg tables and make sure the right information is
> in place or can be added later.
>
>
>
> I am also working on a POC to flush out some details for the aspects
> described in the doc, I will update in this thread once I publish that.
>
>
>
> Best,
>
> Jack Ye
>
>
>
> On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> Hi Jack,
>
>
>
> We're working on Parquet encryption, which is about to be released in the
> upcoming parquet-mr-1.12 version. Recently, we've started to look into its
> integration in Iceberg. It became immediately clear we need to take a wider
> view that covers other types of encryption in Iceberg (file streams and
> ORC); otherwise, we'd end up with a number of silos.
>
> At the time, there was no top-down design for data encryption in Iceberg,
> so we've started to tinker with it. But now we can base this on your
> document. I really liked it, a solid foundation.
>
>
>
> There are a number of high-level concepts I believe we'd need to add there:
>
>
>
> - Key rotation in Iceberg
>
> (Not just in KMS). The envelope encryption practice requires periodic (or
> on-demand) re-wrapping of DEKs with new versions of master keys. KMS
> generates the new versions, and keeps the master key history, but the
> re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
> kept in manifest files, this means all manifest files must be deleted
> (because they keep DEKs wrapped with the previous master key version, which
> is not safe anymore), and created again with the updated key_metadata
> field. We've quickly discussed this with Anton, seems to be feasible, but
> there are other alternatives. We need to decide if manifests are the right
> place to store all key_metadata; and to design a mechanism (potentially
> with a DDL clause) to perform the rotation operation.
>
>
>
> - Acceleration of KMS interactions
>
> KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
> KEK is stored in a key management service (KMS) to control access and key
> rotation." We should not fetch secret keys from KMS, because this exposes
> them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
> without ever exposing the master keys. But since we have to generate a DEK
> per file/column, we'll end up with many KMS wrap calls when writing the
> data (and many unwrap calls when reading the data). That's why Parquet
> encryption uses a concept of double wrapping, where DEKs are wrapped with
> KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
> stored/managed inside KMS.
>
>
>
> - DDL clauses for encryption and key rotation, such as
>
> ALTER TABLE .. KEY_ROTATION (params)
>
> ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
> files) - Russell's proposal
>
> CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
>
> Btw, we can re-use the joint ORC/Parquet column encryption parameter
> format, defined in this jira discussion started by Xinli -
>
> https://issues.apache.org/jira/browse/HIVE-21848
>
>
>
> - Cryptographic integrity of Data Tables
>
> Besides protecting data confidentiality, we need to protect its integrity
> against tampering attacks. This one is a longer term work item, based on
> these tickets:
>
> https://github.com/apache/iceberg/issues/44,
> https://github.com/apache/iceberg/issues/2060,
> https://github.com/apache/iceberg/issues/2073
>
> We'll work on these at a later stage, after the confidentiality basis is
> ready; but we need to make sure the current work on confidentiality enables
> (or at least doesn't block) the future integrity work. For example, we can
> start using https://github.com/apache/iceberg/issues/2060 sooner rather
> than later, for encrypting the Iceberg metadata files and Avro data files.
>
>
>
> That was a high level description, I'll add detailed comments inside the
> design googledoc.
>
>
>
> Cheers, Gidon
>
>
>
>
>
> On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <ye...@gmail.com> wrote:
>
> Hi everyone,
>
>
>
> To continue the discussion in the last sync meeting about encryption in
> Iceberg, here is the document for a proposal:
>
>
>
>
> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>
>
>
> Would be very appreciated for any feedback.
>
>
>
> Best,
>
> Jack Ye
>
>

Re: Extending Apache Iceberg Encryption Module

Posted by "Ye, Jack" <yz...@amazon.com.INVALID>.
Sounds good, lets continue with some discussions through the doc. For POC, I think what you conclude is mostly correct, I am currently implementing the encryption spec, general encrypted file stream with KMS API, and I would expect the low level file encryption integration to take place separately and we can meet in the middle. For key rotation and AAD, I think we can discuss more details in the doc first before proceeding forward, they are not blocking tasks anyway.

“there is an intermediate approach, where (the many) DEKs are encrypted with (a few) KEKs, and stored inside manifest files (key_metadata fields) - this can be immutable, as long as the KEKs are encrypted with MEKs and stored in a mutable medium that can be replaced/updated upon MEK rotation.”

That is doable as we store the KEKs in spec. In that case, a MEK rotation would perform a spec update. But it implies KEK is static just like DEK, and we will only rotate MEK and not rotate KEK. I thought we also need to rotate KEKs that’s why I did not consider this approach. I do not have enough experience in a double-wrap system, but does the security standard still hold in this case without KEK rotation? Or is there a separated process to handle KEK rotation?

“(1) is a direct DEK passing; we've considered it for Parquet, but decided against it, because it can lead to unsafe situations”

Nice, I think I also mentioned in the doc that I am against using this scheme, so we can focus more on supporting the single and double wrapping use case.

-Jack


From: Gidon Gershinsky <gg...@gmail.com>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Date: Wednesday, March 24, 2021 at 05:19
To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Subject: RE: [EXTERNAL] Extending Apache Iceberg Encryption Module


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Sounds good, thanks.
Responding to the points below:

"we can choose to store the encrypted DEKs inside the manifest or as a separated instruction file with a pointer in key_metadata, and there are tradeoffs for those approaches"

For the latter, we are running a similar mechanism in Parquet encryption, where we keep the key material in separate json files, and a pointer to it inside the parquet file footer key_metadata fields. This works; but for Iceberg integration, there are advantages in using the manifest files (or other managed medium) instead. The trade-offs (inc size additions, consistency, management) TBD.
Btw, there is a intermediate approach, where (the many) DEKs are encrypted with (a few) KEKs, and stored inside manifest files (key_metadata fields) - this can be immutable, as long as the KEKs are encrypted with MEKs and stored in a mutable medium that can be replaced/updated upon MEK rotation.

"3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted DEK, and that should be enough to cover most of the use cases with different types of KMS"

Yep, (2) and (3) are the single and double wrapping, respectively, which covers our usecases; (1) is a direct DEK passing; we've considered it for Parquet, but decided against it, because it can lead to unsafe situations where an inexperienced user will pass the same DEK to many files (which can break the GCM cipher, even with one table). But we might try to enable it in Iceberg with strong preventive measures (if possible), TBD.

"DDL clauses for encryption and key rotation
These definitely make sense to me. I will add a list of the DDL clauses I was thinking about to the doc.
Cryptographic integrity of Data Tables
Yes, I think in this doc at least the location and structure of AAD prefix should be discussed, so hopefully we can reach some general consensus for integrity support for Iceberg tables and make sure the right information is in place or can be added later."

SGTM.

"I am also working on a POC to flush out some details for the aspects described in the doc, I will update in this thread once I publish that."

We too work on a POC of this technology. I guess we're working at different corners at the moment, as we're mostly focused on Parquet encryption integration, parts of key rotation and on GCM streams with AAD Prefixes for table integrity; while you probably are working on the Catalog metadata, general encrypted file streams and key management API. But since there is a high potential for overlaps, I'd suggest we'd coordinate the POC work; what would be the best way of doing that?

Cheers, Gidon


On Tue, Mar 23, 2021 at 11:50 PM Jack Ye <ye...@gmail.com>> wrote:
Thanks for the feedback to the doc, we are also closely following the Parquet encryption work and would like to have that in Iceberg as soon as possible with the right architecture. Here are some brief thoughts for the points you mentioned in the email, I will add more details in the google doc:

Key rotation
My initial thought was to consider key rotation as a separated process and DEK rewrapping can be done with a Spark stored procedure, that's why I did not add any detail for it. But your point about the work needed to rewrite and clean up manifests is a really good point that I should fully describe the details.
For instance, we can choose to store the encrypted DEKs inside the manifest or as a separated instruction file with a pointer in key_metadata, and there are tradeoffs for those approaches. I will update the doc for these details.

Acceleration of KMS interactions
Thanks for bringing up double wrapping, I was hesitant to mention that in the initial version of the doc because it would add complexity for understanding the overall architecture. And for the use cases I have seen with AWS KMS, people are all using single-wrapping and the service was able to handle generation of millions of DEKs, and it seems like there was no complaint about it.
I think the right way to go is to support the 3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted DEK, and that should be enough to cover most of the use cases with different types of KMS. I will update the encryption spec with more details on that.

DDL clauses for encryption and key rotation
These definitely make sense to me. I will add a list of the DDL clauses I was thinking about to the doc.

Cryptographic integrity of Data Tables
Yes, I think in this doc at least the location and structure of AAD prefix should be discussed, so hopefully we can reach some general consensus for integrity support for Iceberg tables and make sure the right information is in place or can be added later.

I am also working on a POC to flush out some details for the aspects described in the doc, I will update in this thread once I publish that.

Best,
Jack Ye

On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg...@gmail.com>> wrote:
Hi Jack,

We're working on Parquet encryption, which is about to be released in the upcoming parquet-mr-1.12 version. Recently, we've started to look into its integration in Iceberg. It became immediately clear we need to take a wider view that covers other types of encryption in Iceberg (file streams and ORC); otherwise, we'd end up with a number of silos.
At the time, there was no top-down design for data encryption in Iceberg, so we've started to tinker with it. But now we can base this on your document. I really liked it, a solid foundation.

There are a number of high-level concepts I believe we'd need to add there:

- Key rotation in Iceberg
(Not just in KMS). The envelope encryption practice requires periodic (or on-demand) re-wrapping of DEKs with new versions of master keys. KMS generates the new versions, and keeps the master key history, but the re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is kept in manifest files, this means all manifest files must be deleted (because they keep DEKs wrapped with the previous master key version, which is not safe anymore), and created again with the updated key_metadata field. We've quickly discussed this with Anton, seems to be feasible, but there are other alternatives. We need to decide if manifests are the right place to store all key_metadata; and to design a mechanism (potentially with a DDL clause) to perform the rotation operation.

- Acceleration of KMS interactions
KMSs can be very slow, especially when backed by HSMs. Per the doc, "The KEK is stored in a key management service (KMS) to control access and key rotation." We should not fetch secret keys from KMS, because this exposes them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server, without ever exposing the master keys. But since we have to generate a DEK per file/column, we'll end up with many KMS wrap calls when writing the data (and many unwrap calls when reading the data). That's why Parquet encryption uses a concept of double wrapping, where DEKs are wrapped with KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are stored/managed inside KMS.

- DDL clauses for encryption and key rotation, such as
ALTER TABLE .. KEY_ROTATION (params)
ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext files) - Russell's proposal
CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
Btw, we can re-use the joint ORC/Parquet column encryption parameter format, defined in this jira discussion started by Xinli -
https://issues.apache.org/jira/browse/HIVE-21848

- Cryptographic integrity of Data Tables
Besides protecting data confidentiality, we need to protect its integrity against tampering attacks. This one is a longer term work item, based on these tickets:
https://github.com/apache/iceberg/issues/44, https://github.com/apache/iceberg/issues/2060, https://github.com/apache/iceberg/issues/2073
We'll work on these at a later stage, after the confidentiality basis is ready; but we need to make sure the current work on confidentiality enables (or at least doesn't block) the future integrity work. For example, we can start using https://github.com/apache/iceberg/issues/2060 sooner rather than later, for encrypting the Iceberg metadata files and Avro data files.

That was a high level description, I'll add detailed comments inside the design googledoc.

Cheers, Gidon


On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <ye...@gmail.com>> wrote:
Hi everyone,

To continue the discussion in the last sync meeting about encryption in Iceberg, here is the document for a proposal:

https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing

Would be very appreciated for any feedback.

Best,
Jack Ye

Re: Extending Apache Iceberg Encryption Module

Posted by Gidon Gershinsky <gg...@gmail.com>.
Sounds good, thanks.
Responding to the points below:

*"we can choose to store the encrypted DEKs inside the manifest or as a
separated instruction file with a pointer in key_metadata, and there are
tradeoffs for those approaches"*

For the latter, we are running a similar mechanism in Parquet encryption,
where we keep the key material in separate json files, and a pointer to it
inside the parquet file footer key_metadata fields. This works; but for
Iceberg integration, there are advantages in using the manifest files (or
other managed medium) instead. The trade-offs (inc size additions,
consistency, management) TBD.
Btw, there is a intermediate approach, where (the many) DEKs are encrypted
with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
this can be immutable, as long as the KEKs are encrypted with MEKs and
stored in a mutable medium that can be replaced/updated upon MEK rotation.

*"3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID
+ encrypted KEK + encrypted DEK, and that should be enough to cover most of
the use cases with different types of KMS"*

Yep, (2) and (3) are the single and double wrapping, respectively, which
covers our usecases; (1) is a direct DEK passing; we've considered it for
Parquet, but decided against it, because it can lead to unsafe situations
where an inexperienced user will pass the same DEK to many files (which can
break the GCM cipher, even with one table). But we might try to enable it
in Iceberg with strong preventive measures (if possible), TBD.

*"DDL clauses for encryption and key rotation*
*These definitely make sense to me. I will add a list of the DDL clauses I
was thinking about to the doc.*
*Cryptographic integrity of Data Tables*
*Yes, I think in this doc at least the location and structure of AAD prefix
should be discussed, so hopefully we can reach some general consensus for
integrity support for Iceberg tables and make sure the right information is
in place or can be added later."*

SGTM.

*"I am also working on a POC to flush out some details for the aspects
described in the doc, I will update in this thread once I publish that."*

We too work on a POC of this technology. I guess we're working at different
corners at the moment, as we're mostly focused on Parquet encryption
integration, parts of key rotation and on GCM streams with AAD Prefixes for
table integrity; while you probably are working on the Catalog metadata,
general encrypted file streams and key management API. But since there is a
high potential for overlaps, I'd suggest we'd coordinate the POC work; what
would be the best way of doing that?

Cheers, Gidon


On Tue, Mar 23, 2021 at 11:50 PM Jack Ye <ye...@gmail.com> wrote:

> Thanks for the feedback to the doc, we are also closely following the
> Parquet encryption work and would like to have that in Iceberg as soon as
> possible with the right architecture. Here are some brief thoughts for the
> points you mentioned in the email, I will add more details in the google
> doc:
>
>
>
> *Key rotation*
>
> My initial thought was to consider key rotation as a separated process and
> DEK rewrapping can be done with a Spark stored procedure, that's why I did
> not add any detail for it. But your point about the work needed to rewrite
> and clean up manifests is a really good point that I should fully describe
> the details.
>
> For instance, we can choose to store the encrypted DEKs inside the
> manifest or as a separated instruction file with a pointer in key_metadata,
> and there are tradeoffs for those approaches. I will update the doc for
> these details.
>
>
>
> *Acceleration of KMS interactions*
>
> Thanks for bringing up double wrapping, I was hesitant to mention that in
> the initial version of the doc because it would add complexity for
> understanding the overall architecture. And for the use cases I have seen
> with AWS KMS, people are all using single-wrapping and the service was able
> to handle generation of millions of DEKs, and it seems like there was no
> complaint about it.
>
> I think the right way to go is to support the 3 common cases: (1) direct
> DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted
> DEK, and that should be enough to cover most of the use cases with
> different types of KMS. I will update the encryption spec with more
> details on that.
>
>
> *DDL clauses for encryption and key rotation*
>
> These definitely make sense to me. I will add a list of the DDL clauses I
> was thinking about to the doc.
>
>
> *Cryptographic integrity of Data Tables*
>
> Yes, I think in this doc at least the location and structure of AAD prefix
> should be discussed, so hopefully we can reach some general consensus for
> integrity support for Iceberg tables and make sure the right information is
> in place or can be added later.
>
>
>
> I am also working on a POC to flush out some details for the aspects
> described in the doc, I will update in this thread once I publish that.
>
>
> Best,
>
> Jack Ye
>
> On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi Jack,
>>
>> We're working on Parquet encryption, which is about to be released in the
>> upcoming parquet-mr-1.12 version. Recently, we've started to look into its
>> integration in Iceberg. It became immediately clear we need to take a wider
>> view that covers other types of encryption in Iceberg (file streams and
>> ORC); otherwise, we'd end up with a number of silos.
>> At the time, there was no top-down design for data encryption in Iceberg,
>> so we've started to tinker with it. But now we can base this on your
>> document. I really liked it, a solid foundation.
>>
>> There are a number of high-level concepts I believe we'd need to add
>> there:
>>
>> - Key rotation in Iceberg
>> (Not just in KMS). The envelope encryption practice requires periodic (or
>> on-demand) re-wrapping of DEKs with new versions of master keys. KMS
>> generates the new versions, and keeps the master key history, but the
>> re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
>> kept in manifest files, this means all manifest files must be deleted
>> (because they keep DEKs wrapped with the previous master key version, which
>> is not safe anymore), and created again with the updated key_metadata
>> field. We've quickly discussed this with Anton, seems to be feasible, but
>> there are other alternatives. We need to decide if manifests are the right
>> place to store all key_metadata; and to design a mechanism (potentially
>> with a DDL clause) to perform the rotation operation.
>>
>> - Acceleration of KMS interactions
>> KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
>> KEK is stored in a key management service (KMS) to control access and key
>> rotation." We should not fetch secret keys from KMS, because this exposes
>> them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
>> without ever exposing the master keys. But since we have to generate a DEK
>> per file/column, we'll end up with many KMS wrap calls when writing the
>> data (and many unwrap calls when reading the data). That's why Parquet
>> encryption uses a concept of double wrapping, where DEKs are wrapped with
>> KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
>> stored/managed inside KMS.
>>
>> - DDL clauses for encryption and key rotation, such as
>> ALTER TABLE .. KEY_ROTATION (params)
>> ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
>> files) - Russell's proposal
>> CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
>> Btw, we can re-use the joint ORC/Parquet column encryption parameter
>> format, defined in this jira discussion started by Xinli -
>> https://issues.apache.org/jira/browse/HIVE-21848
>>
>> - Cryptographic integrity of Data Tables
>> Besides protecting data confidentiality, we need to protect its integrity
>> against tampering attacks. This one is a longer term work item, based on
>> these tickets:
>> https://github.com/apache/iceberg/issues/44,
>> https://github.com/apache/iceberg/issues/2060,
>> https://github.com/apache/iceberg/issues/2073
>> We'll work on these at a later stage, after the confidentiality basis is
>> ready; but we need to make sure the current work on confidentiality enables
>> (or at least doesn't block) the future integrity work. For example, we can
>> start using https://github.com/apache/iceberg/issues/2060 sooner rather
>> than later, for encrypting the Iceberg metadata files and Avro data files.
>>
>> That was a high level description, I'll add detailed comments inside the
>> design googledoc.
>>
>> Cheers, Gidon
>>
>>
>> On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> To continue the discussion in the last sync meeting about encryption in
>>> Iceberg, here is the document for a proposal:
>>>
>>>
>>> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>>>
>>> Would be very appreciated for any feedback.
>>>
>>> Best,
>>> Jack Ye
>>>
>>

Re: Extending Apache Iceberg Encryption Module

Posted by Jack Ye <ye...@gmail.com>.
Thanks for the feedback to the doc, we are also closely following the
Parquet encryption work and would like to have that in Iceberg as soon as
possible with the right architecture. Here are some brief thoughts for the
points you mentioned in the email, I will add more details in the google
doc:



*Key rotation*

My initial thought was to consider key rotation as a separated process and
DEK rewrapping can be done with a Spark stored procedure, that's why I did
not add any detail for it. But your point about the work needed to rewrite
and clean up manifests is a really good point that I should fully describe
the details.

For instance, we can choose to store the encrypted DEKs inside the manifest
or as a separated instruction file with a pointer in key_metadata, and
there are tradeoffs for those approaches. I will update the doc for these
details.



*Acceleration of KMS interactions*

Thanks for bringing up double wrapping, I was hesitant to mention that in
the initial version of the doc because it would add complexity for
understanding the overall architecture. And for the use cases I have seen
with AWS KMS, people are all using single-wrapping and the service was able
to handle generation of millions of DEKs, and it seems like there was no
complaint about it.

I think the right way to go is to support the 3 common cases: (1) direct
DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted
DEK, and that should be enough to cover most of the use cases with
different types of KMS. I will update the encryption spec with more details
on that.


*DDL clauses for encryption and key rotation*

These definitely make sense to me. I will add a list of the DDL clauses I
was thinking about to the doc.


*Cryptographic integrity of Data Tables*

Yes, I think in this doc at least the location and structure of AAD prefix
should be discussed, so hopefully we can reach some general consensus for
integrity support for Iceberg tables and make sure the right information is
in place or can be added later.



I am also working on a POC to flush out some details for the aspects
described in the doc, I will update in this thread once I publish that.


Best,

Jack Ye

On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi Jack,
>
> We're working on Parquet encryption, which is about to be released in the
> upcoming parquet-mr-1.12 version. Recently, we've started to look into its
> integration in Iceberg. It became immediately clear we need to take a wider
> view that covers other types of encryption in Iceberg (file streams and
> ORC); otherwise, we'd end up with a number of silos.
> At the time, there was no top-down design for data encryption in Iceberg,
> so we've started to tinker with it. But now we can base this on your
> document. I really liked it, a solid foundation.
>
> There are a number of high-level concepts I believe we'd need to add there:
>
> - Key rotation in Iceberg
> (Not just in KMS). The envelope encryption practice requires periodic (or
> on-demand) re-wrapping of DEKs with new versions of master keys. KMS
> generates the new versions, and keeps the master key history, but the
> re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
> kept in manifest files, this means all manifest files must be deleted
> (because they keep DEKs wrapped with the previous master key version, which
> is not safe anymore), and created again with the updated key_metadata
> field. We've quickly discussed this with Anton, seems to be feasible, but
> there are other alternatives. We need to decide if manifests are the right
> place to store all key_metadata; and to design a mechanism (potentially
> with a DDL clause) to perform the rotation operation.
>
> - Acceleration of KMS interactions
> KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
> KEK is stored in a key management service (KMS) to control access and key
> rotation." We should not fetch secret keys from KMS, because this exposes
> them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
> without ever exposing the master keys. But since we have to generate a DEK
> per file/column, we'll end up with many KMS wrap calls when writing the
> data (and many unwrap calls when reading the data). That's why Parquet
> encryption uses a concept of double wrapping, where DEKs are wrapped with
> KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
> stored/managed inside KMS.
>
> - DDL clauses for encryption and key rotation, such as
> ALTER TABLE .. KEY_ROTATION (params)
> ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
> files) - Russell's proposal
> CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
> Btw, we can re-use the joint ORC/Parquet column encryption parameter
> format, defined in this jira discussion started by Xinli -
> https://issues.apache.org/jira/browse/HIVE-21848
>
> - Cryptographic integrity of Data Tables
> Besides protecting data confidentiality, we need to protect its integrity
> against tampering attacks. This one is a longer term work item, based on
> these tickets:
> https://github.com/apache/iceberg/issues/44,
> https://github.com/apache/iceberg/issues/2060,
> https://github.com/apache/iceberg/issues/2073
> We'll work on these at a later stage, after the confidentiality basis is
> ready; but we need to make sure the current work on confidentiality enables
> (or at least doesn't block) the future integrity work. For example, we can
> start using https://github.com/apache/iceberg/issues/2060 sooner rather
> than later, for encrypting the Iceberg metadata files and Avro data files.
>
> That was a high level description, I'll add detailed comments inside the
> design googledoc.
>
> Cheers, Gidon
>
>
> On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> To continue the discussion in the last sync meeting about encryption in
>> Iceberg, here is the document for a proposal:
>>
>>
>> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>>
>> Would be very appreciated for any feedback.
>>
>> Best,
>> Jack Ye
>>
>

Re: Extending Apache Iceberg Encryption Module

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Jack,

We're working on Parquet encryption, which is about to be released in the
upcoming parquet-mr-1.12 version. Recently, we've started to look into its
integration in Iceberg. It became immediately clear we need to take a wider
view that covers other types of encryption in Iceberg (file streams and
ORC); otherwise, we'd end up with a number of silos.
At the time, there was no top-down design for data encryption in Iceberg,
so we've started to tinker with it. But now we can base this on your
document. I really liked it, a solid foundation.

There are a number of high-level concepts I believe we'd need to add there:

- Key rotation in Iceberg
(Not just in KMS). The envelope encryption practice requires periodic (or
on-demand) re-wrapping of DEKs with new versions of master keys. KMS
generates the new versions, and keeps the master key history, but the
re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
kept in manifest files, this means all manifest files must be deleted
(because they keep DEKs wrapped with the previous master key version, which
is not safe anymore), and created again with the updated key_metadata
field. We've quickly discussed this with Anton, seems to be feasible, but
there are other alternatives. We need to decide if manifests are the right
place to store all key_metadata; and to design a mechanism (potentially
with a DDL clause) to perform the rotation operation.

- Acceleration of KMS interactions
KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
KEK is stored in a key management service (KMS) to control access and key
rotation." We should not fetch secret keys from KMS, because this exposes
them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
without ever exposing the master keys. But since we have to generate a DEK
per file/column, we'll end up with many KMS wrap calls when writing the
data (and many unwrap calls when reading the data). That's why Parquet
encryption uses a concept of double wrapping, where DEKs are wrapped with
KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
stored/managed inside KMS.

- DDL clauses for encryption and key rotation, such as
ALTER TABLE .. KEY_ROTATION (params)
ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
files) - Russell's proposal
CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
Btw, we can re-use the joint ORC/Parquet column encryption parameter
format, defined in this jira discussion started by Xinli -
https://issues.apache.org/jira/browse/HIVE-21848

- Cryptographic integrity of Data Tables
Besides protecting data confidentiality, we need to protect its integrity
against tampering attacks. This one is a longer term work item, based on
these tickets:
https://github.com/apache/iceberg/issues/44,
https://github.com/apache/iceberg/issues/2060,
https://github.com/apache/iceberg/issues/2073
We'll work on these at a later stage, after the confidentiality basis is
ready; but we need to make sure the current work on confidentiality enables
(or at least doesn't block) the future integrity work. For example, we can
start using https://github.com/apache/iceberg/issues/2060 sooner rather
than later, for encrypting the Iceberg metadata files and Avro data files.

That was a high level description, I'll add detailed comments inside the
design googledoc.

Cheers, Gidon


On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <ye...@gmail.com> wrote:

> Hi everyone,
>
> To continue the discussion in the last sync meeting about encryption in
> Iceberg, here is the document for a proposal:
>
>
> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>
> Would be very appreciated for any feedback.
>
> Best,
> Jack Ye
>