You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2021/08/25 11:19:16 UTC

Data encryption in Iceberg

Hi all,

We have briefly discussed this subject in a June sync, with a decision to
continue via the mailing list.
There are a number of pull requests from Jack and myself that implement a
set of disjoint elements from the high-level design
<https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
Some low-level details, such as generation and propagation of data keys,
are not covered in this document.
I have created a short (and hopefully simple) doc

https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
 that focuses on these details and describes the bottom-up approach to
generation of data keys, encryption of data/delete files, and
options/phases for optimization of key management. The scope of the
document is intentionally narrow, and currently focuses on the minimal
simplest option. Reviews are very welcome. Later, this doc will be merged
in (or referenced from) the master design document.

A PR with a basic encryption DDL has been sent recently by Huaxin, you can
find it here <https://github.com/apache/iceberg/pull/3013>. Next week, I'll
send a pull request with an implementation of the minimal encryption
option. This pull request collects the basics from my PRs 2639, 2638, 2640
and Jack's PR 2443; adding the key generation and other code that creates
an end-to-end implementation of the minimal design
<https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
This PR comes with an example proposed by Ryan - using a table encryption
key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
for symmetric keys).
Besides the minimal version, I have a draft implementation of more advanced
data encryption options (including per-column keys, double wrapping and
two-tier management - all described in the master design doc) - but let's
take this one step at a time, starting with the simplest option.

Cheers, Gidon

Fwd: Data encryption in Iceberg

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi all,

The encryption sync is set for next Tue, October 5, at 9am PDT.
Additional folks interested to join - let me know in a direct message.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Wed, Sep 1, 2021 at 9:10 PM
Subject: Fwd: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi all,

Per the sync this morning, we'll have a meeting on encryption-related
efforts in Iceberg. Before we discuss the day/time options, let us know
who's interested to join, please respond here or send a direct message to
Ryan, Jack or myself.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Mon, Aug 30, 2021 at 5:57 PM
Subject: Re: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later, this doc will be merged
>> in (or referenced from) the master design document.
>>
>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>> week, I'll send a pull request with an implementation of the minimal
>> encryption option. This pull request collects the basics from my PRs 2639,
>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>> that creates an end-to-end implementation of the minimal design
>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>> This PR comes with an example proposed by Ryan - using a table encryption
>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>> for symmetric keys).
>> Besides the minimal version, I have a draft implementation of more
>> advanced data encryption options (including per-column keys, double
>> wrapping and two-tier management - all described in the master design doc)
>> - but let's take this one step at a time, starting with the simplest option.
>>
>> Cheers, Gidon
>>
>

Re: Data encryption in Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
Hi Gideon, please add me to the meeting.

On Wed, Sep 1, 2021 at 11:34 AM Maya Anderson <ma...@gmail.com>
wrote:

> Hi Gidon,
>
> I would like to join the meeting.
>
> Thanks,
> Maya Anderson
>
>
> ---------- Forwarded message ---------
>> From: Gidon Gershinsky <gg...@gmail.com>
>> Date: Wed, Sep 1, 2021 at 9:11 PM
>> Subject: Fwd: Data encryption in Iceberg
>> To: <de...@iceberg.apache.org>
>>
>>
>> Hi all,
>>
>> Per the sync this morning, we'll have a meeting on encryption-related
>> efforts in Iceberg. Before we discuss the day/time options, let us know
>> who's interested to join, please respond here or send a direct message to
>> Ryan, Jack or myself.
>>
>> Cheers, Gidon
>>
>>
>> ---------- Forwarded message ---------
>> From: Gidon Gershinsky <gg...@gmail.com>
>> Date: Mon, Aug 30, 2021 at 5:57 PM
>> Subject: Re: Data encryption in Iceberg
>> To: <de...@iceberg.apache.org>
>>
>>
>> Hi Jack,
>>
>> Thank you. We've been indeed busy with building the Iceberg data
>> encryption code, since we have quite a demand for this functionality (with
>> timeline requirements..).
>> I've published an initial end-to-end implementation (PR 3053), comprised
>> of a new code that handles the generation of data keys, and of the existing
>> code (with some modifications) from the current PRs listed below (so this
>> is a joint work, with contributions from both of us; I'm sure there are
>> ways to recognize PR co-authorship :).
>>
>> As I mentioned, this is the simplest version (without double wrapping,
>> column-specific master keys and two-tier key management). I got a prototype
>> for these advanced data encryption features, but thought it might be best
>> to start with an MVP - easier to digest by the community, and allows for a
>> gradual layer-by-layer implementation. In my understanding, MVP can start
>> without key rotation - because the latter has two parts, with the main one
>> (key rotation in KMS) being totally transparent to Iceberg; the other part
>> (re-wrapping of key_metadata and re-writing of manifest files and manifest
>> lists) is required in threat models that cover a risk of master keys being
>> compromised/leaked - so this is a less universal requirement and can be
>> added post-MVP. But if you hold a different view on this, or need the
>> second part of key rotation now, I'm sure this is doable; I just hope it
>> won't slow down the MVP work.
>>
>> Having said that - there is a feature I believe would be a really good
>> addition to the MVP. This is the encryption of manifests and manifest
>> lists. I presume you refer to it in your mail. If you have an internal
>> branch with its implementation - porting this to open source will be much
>> appreciated. We need this capability (yes, the data is encrypted; but the
>> stats are not.. which is not great, even if they actually are highly
>> aggregated, a sort of a range mask).
>>
>> We can chat about this at the upcoming sync, but I support the suggestion
>> to set up a more detailed discussion to align the encryption-related
>> efforts.
>>
>> Cheers, Gidon
>>
>>
>> On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Hi Gidon and Huaxin,
>>>
>>> Thanks for continuing with the effort in Iceberg encryption support. I
>>> did not get enough time to work on this area since the design discussion,
>>> so far I only managed to add key metadata for manifest file, and there are
>>> quite a few changes in our internal branch that I need to port to open
>>> source. I will start to do it in the next few days.
>>>
>>> Regarding the design, I wonder if we should first start with defining
>>> the actions API with a Spark implementation for file encryption key
>>> rotation, and then discuss the user experience.
>>>
>>> In the original design document, I think we did not reach a consensus
>>> with the community around the actual way to expose key rotation
>>> functionalities. In Spark, we can either do it through DDL extension, or
>>> implement it as a procedure. Given that this is a long-running distributed
>>> procedure, my feeling is that the community will lean towards a procedure
>>> call.
>>>
>>> We can continue with the discussion around this while first doing the
>>> detailed implementation. Let's set up a discussion around this so that we
>>> can align the efforts.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We have briefly discussed this subject in a June sync, with a
>>>> decision to continue via the mailing list.
>>>> There are a number of pull requests from Jack and myself that implement
>>>> a set of disjoint elements from the high-level design
>>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>>>> Some low-level details, such as generation and propagation of data keys,
>>>> are not covered in this document.
>>>> I have created a short (and hopefully simple) doc
>>>>
>>>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>>>  that focuses on these details and describes the bottom-up approach to
>>>> generation of data keys, encryption of data/delete files, and
>>>> options/phases for optimization of key management. The scope of the
>>>> document is intentionally narrow, and currently focuses on the minimal
>>>> simplest option. Reviews are very welcome. Later, this doc will be merged
>>>> in (or referenced from) the master design document.
>>>>
>>>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>>>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>>>> week, I'll send a pull request with an implementation of the minimal
>>>> encryption option. This pull request collects the basics from my PRs 2639,
>>>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>>>> that creates an end-to-end implementation of the minimal design
>>>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>>>> This PR comes with an example proposed by Ryan - using a table encryption
>>>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>>>> for symmetric keys).
>>>> Besides the minimal version, I have a draft implementation of more
>>>> advanced data encryption options (including per-column keys, double
>>>> wrapping and two-tier management - all described in the master design doc)
>>>> - but let's take this one step at a time, starting with the simplest option.
>>>>
>>>> Cheers, Gidon
>>>>
>>>
>
> --
> Regards,
> Maya
>
-- 
Best,

Yufei

`This is not a contribution`

Re: Data encryption in Iceberg

Posted by Maya Anderson <ma...@gmail.com>.
Hi Gidon,

I would like to join the meeting.

Thanks,
Maya Anderson


---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Wed, Sep 1, 2021 at 9:11 PM
> Subject: Fwd: Data encryption in Iceberg
> To: <de...@iceberg.apache.org>
>
>
> Hi all,
>
> Per the sync this morning, we'll have a meeting on encryption-related
> efforts in Iceberg. Before we discuss the day/time options, let us know
> who's interested to join, please respond here or send a direct message to
> Ryan, Jack or myself.
>
> Cheers, Gidon
>
>
> ---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Mon, Aug 30, 2021 at 5:57 PM
> Subject: Re: Data encryption in Iceberg
> To: <de...@iceberg.apache.org>
>
>
> Hi Jack,
>
> Thank you. We've been indeed busy with building the Iceberg data
> encryption code, since we have quite a demand for this functionality (with
> timeline requirements..).
> I've published an initial end-to-end implementation (PR 3053), comprised
> of a new code that handles the generation of data keys, and of the existing
> code (with some modifications) from the current PRs listed below (so this
> is a joint work, with contributions from both of us; I'm sure there are
> ways to recognize PR co-authorship :).
>
> As I mentioned, this is the simplest version (without double wrapping,
> column-specific master keys and two-tier key management). I got a prototype
> for these advanced data encryption features, but thought it might be best
> to start with an MVP - easier to digest by the community, and allows for a
> gradual layer-by-layer implementation. In my understanding, MVP can start
> without key rotation - because the latter has two parts, with the main one
> (key rotation in KMS) being totally transparent to Iceberg; the other part
> (re-wrapping of key_metadata and re-writing of manifest files and manifest
> lists) is required in threat models that cover a risk of master keys being
> compromised/leaked - so this is a less universal requirement and can be
> added post-MVP. But if you hold a different view on this, or need the
> second part of key rotation now, I'm sure this is doable; I just hope it
> won't slow down the MVP work.
>
> Having said that - there is a feature I believe would be a really good
> addition to the MVP. This is the encryption of manifests and manifest
> lists. I presume you refer to it in your mail. If you have an internal
> branch with its implementation - porting this to open source will be much
> appreciated. We need this capability (yes, the data is encrypted; but the
> stats are not.. which is not great, even if they actually are highly
> aggregated, a sort of a range mask).
>
> We can chat about this at the upcoming sync, but I support the suggestion
> to set up a more detailed discussion to align the encryption-related
> efforts.
>
> Cheers, Gidon
>
>
> On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Hi Gidon and Huaxin,
>>
>> Thanks for continuing with the effort in Iceberg encryption support. I
>> did not get enough time to work on this area since the design discussion,
>> so far I only managed to add key metadata for manifest file, and there are
>> quite a few changes in our internal branch that I need to port to open
>> source. I will start to do it in the next few days.
>>
>> Regarding the design, I wonder if we should first start with defining the
>> actions API with a Spark implementation for file encryption key rotation,
>> and then discuss the user experience.
>>
>> In the original design document, I think we did not reach a consensus
>> with the community around the actual way to expose key rotation
>> functionalities. In Spark, we can either do it through DDL extension, or
>> implement it as a procedure. Given that this is a long-running distributed
>> procedure, my feeling is that the community will lean towards a procedure
>> call.
>>
>> We can continue with the discussion around this while first doing the
>> detailed implementation. Let's set up a discussion around this so that we
>> can align the efforts.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> We have briefly discussed this subject in a June sync, with a
>>> decision to continue via the mailing list.
>>> There are a number of pull requests from Jack and myself that implement
>>> a set of disjoint elements from the high-level design
>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>>> Some low-level details, such as generation and propagation of data keys,
>>> are not covered in this document.
>>> I have created a short (and hopefully simple) doc
>>>
>>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>>  that focuses on these details and describes the bottom-up approach to
>>> generation of data keys, encryption of data/delete files, and
>>> options/phases for optimization of key management. The scope of the
>>> document is intentionally narrow, and currently focuses on the minimal
>>> simplest option. Reviews are very welcome. Later, this doc will be merged
>>> in (or referenced from) the master design document.
>>>
>>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>>> week, I'll send a pull request with an implementation of the minimal
>>> encryption option. This pull request collects the basics from my PRs 2639,
>>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>>> that creates an end-to-end implementation of the minimal design
>>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>>> This PR comes with an example proposed by Ryan - using a table encryption
>>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>>> for symmetric keys).
>>> Besides the minimal version, I have a draft implementation of more
>>> advanced data encryption options (including per-column keys, double
>>> wrapping and two-tier management - all described in the master design doc)
>>> - but let's take this one step at a time, starting with the simplest option.
>>>
>>> Cheers, Gidon
>>>
>>

-- 
Regards,
Maya

Fwd: Data encryption in Iceberg

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi all,

Per the sync this morning, we'll have a meeting on encryption-related
efforts in Iceberg. Before we discuss the day/time options, let us know
who's interested to join, please respond here or send a direct message to
Ryan, Jack or myself.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Mon, Aug 30, 2021 at 5:57 PM
Subject: Re: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later, this doc will be merged
>> in (or referenced from) the master design document.
>>
>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>> week, I'll send a pull request with an implementation of the minimal
>> encryption option. This pull request collects the basics from my PRs 2639,
>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>> that creates an end-to-end implementation of the minimal design
>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>> This PR comes with an example proposed by Ryan - using a table encryption
>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>> for symmetric keys).
>> Besides the minimal version, I have a draft implementation of more
>> advanced data encryption options (including per-column keys, double
>> wrapping and two-tier management - all described in the master design doc)
>> - but let's take this one step at a time, starting with the simplest option.
>>
>> Cheers, Gidon
>>
>

Re: Data encryption in Iceberg

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later, this doc will be merged
>> in (or referenced from) the master design document.
>>
>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>> week, I'll send a pull request with an implementation of the minimal
>> encryption option. This pull request collects the basics from my PRs 2639,
>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>> that creates an end-to-end implementation of the minimal design
>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>> This PR comes with an example proposed by Ryan - using a table encryption
>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>> for symmetric keys).
>> Besides the minimal version, I have a draft implementation of more
>> advanced data encryption options (including per-column keys, double
>> wrapping and two-tier management - all described in the master design doc)
>> - but let's take this one step at a time, starting with the simplest option.
>>
>> Cheers, Gidon
>>
>

Re: Data encryption in Iceberg

Posted by Jack Ye <ye...@gmail.com>.
Hi Gidon and Huaxin,

Thanks for continuing with the effort in Iceberg encryption support. I did
not get enough time to work on this area since the design discussion, so
far I only managed to add key metadata for manifest file, and there are
quite a few changes in our internal branch that I need to port to open
source. I will start to do it in the next few days.

Regarding the design, I wonder if we should first start with defining the
actions API with a Spark implementation for file encryption key rotation,
and then discuss the user experience.

In the original design document, I think we did not reach a consensus with
the community around the actual way to expose key rotation functionalities.
In Spark, we can either do it through DDL extension, or implement it as a
procedure. Given that this is a long-running distributed procedure, my
feeling is that the community will lean towards a procedure call.

We can continue with the discussion around this while first doing the
detailed implementation. Let's set up a discussion around this so that we
can align the efforts.

Best,
Jack Ye


On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi all,
>
> We have briefly discussed this subject in a June sync, with a decision to
> continue via the mailing list.
> There are a number of pull requests from Jack and myself that implement a
> set of disjoint elements from the high-level design
> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
> Some low-level details, such as generation and propagation of data keys,
> are not covered in this document.
> I have created a short (and hopefully simple) doc
>
> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>  that focuses on these details and describes the bottom-up approach to
> generation of data keys, encryption of data/delete files, and
> options/phases for optimization of key management. The scope of the
> document is intentionally narrow, and currently focuses on the minimal
> simplest option. Reviews are very welcome. Later, this doc will be merged
> in (or referenced from) the master design document.
>
> A PR with a basic encryption DDL has been sent recently by Huaxin, you can
> find it here <https://github.com/apache/iceberg/pull/3013>. Next week,
> I'll send a pull request with an implementation of the minimal encryption
> option. This pull request collects the basics from my PRs 2639, 2638, 2640
> and Jack's PR 2443; adding the key generation and other code that creates
> an end-to-end implementation of the minimal design
> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
> This PR comes with an example proposed by Ryan - using a table encryption
> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
> for symmetric keys).
> Besides the minimal version, I have a draft implementation of more
> advanced data encryption options (including per-column keys, double
> wrapping and two-tier management - all described in the master design doc)
> - but let's take this one step at a time, starting with the simplest option.
>
> Cheers, Gidon
>