You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2021/09/01 18:10:40 UTC

Fwd: Data encryption in Iceberg

Hi all,

Per the sync this morning, we'll have a meeting on encryption-related
efforts in Iceberg. Before we discuss the day/time options, let us know
who's interested to join, please respond here or send a direct message to
Ryan, Jack or myself.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Mon, Aug 30, 2021 at 5:57 PM
Subject: Re: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later, this doc will be merged
>> in (or referenced from) the master design document.
>>
>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>> week, I'll send a pull request with an implementation of the minimal
>> encryption option. This pull request collects the basics from my PRs 2639,
>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>> that creates an end-to-end implementation of the minimal design
>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>> This PR comes with an example proposed by Ryan - using a table encryption
>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>> for symmetric keys).
>> Besides the minimal version, I have a draft implementation of more
>> advanced data encryption options (including per-column keys, double
>> wrapping and two-tier management - all described in the master design doc)
>> - but let's take this one step at a time, starting with the simplest option.
>>
>> Cheers, Gidon
>>
>

Fwd: Data encryption in Iceberg

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi all,

The encryption sync is set for next Tue, October 5, at 9am PDT.
Additional folks interested to join - let me know in a direct message.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Wed, Sep 1, 2021 at 9:10 PM
Subject: Fwd: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi all,

Per the sync this morning, we'll have a meeting on encryption-related
efforts in Iceberg. Before we discuss the day/time options, let us know
who's interested to join, please respond here or send a direct message to
Ryan, Jack or myself.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Mon, Aug 30, 2021 at 5:57 PM
Subject: Re: Data encryption in Iceberg
To: <de...@iceberg.apache.org>


Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later, this doc will be merged
>> in (or referenced from) the master design document.
>>
>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>> week, I'll send a pull request with an implementation of the minimal
>> encryption option. This pull request collects the basics from my PRs 2639,
>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>> that creates an end-to-end implementation of the minimal design
>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>> This PR comes with an example proposed by Ryan - using a table encryption
>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>> for symmetric keys).
>> Besides the minimal version, I have a draft implementation of more
>> advanced data encryption options (including per-column keys, double
>> wrapping and two-tier management - all described in the master design doc)
>> - but let's take this one step at a time, starting with the simplest option.
>>
>> Cheers, Gidon
>>
>

Re: Data encryption in Iceberg

Posted by Yufei Gu <fl...@gmail.com>.
Hi Gideon, please add me to the meeting.

On Wed, Sep 1, 2021 at 11:34 AM Maya Anderson <ma...@gmail.com>
wrote:

> Hi Gidon,
>
> I would like to join the meeting.
>
> Thanks,
> Maya Anderson
>
>
> ---------- Forwarded message ---------
>> From: Gidon Gershinsky <gg...@gmail.com>
>> Date: Wed, Sep 1, 2021 at 9:11 PM
>> Subject: Fwd: Data encryption in Iceberg
>> To: <de...@iceberg.apache.org>
>>
>>
>> Hi all,
>>
>> Per the sync this morning, we'll have a meeting on encryption-related
>> efforts in Iceberg. Before we discuss the day/time options, let us know
>> who's interested to join, please respond here or send a direct message to
>> Ryan, Jack or myself.
>>
>> Cheers, Gidon
>>
>>
>> ---------- Forwarded message ---------
>> From: Gidon Gershinsky <gg...@gmail.com>
>> Date: Mon, Aug 30, 2021 at 5:57 PM
>> Subject: Re: Data encryption in Iceberg
>> To: <de...@iceberg.apache.org>
>>
>>
>> Hi Jack,
>>
>> Thank you. We've been indeed busy with building the Iceberg data
>> encryption code, since we have quite a demand for this functionality (with
>> timeline requirements..).
>> I've published an initial end-to-end implementation (PR 3053), comprised
>> of a new code that handles the generation of data keys, and of the existing
>> code (with some modifications) from the current PRs listed below (so this
>> is a joint work, with contributions from both of us; I'm sure there are
>> ways to recognize PR co-authorship :).
>>
>> As I mentioned, this is the simplest version (without double wrapping,
>> column-specific master keys and two-tier key management). I got a prototype
>> for these advanced data encryption features, but thought it might be best
>> to start with an MVP - easier to digest by the community, and allows for a
>> gradual layer-by-layer implementation. In my understanding, MVP can start
>> without key rotation - because the latter has two parts, with the main one
>> (key rotation in KMS) being totally transparent to Iceberg; the other part
>> (re-wrapping of key_metadata and re-writing of manifest files and manifest
>> lists) is required in threat models that cover a risk of master keys being
>> compromised/leaked - so this is a less universal requirement and can be
>> added post-MVP. But if you hold a different view on this, or need the
>> second part of key rotation now, I'm sure this is doable; I just hope it
>> won't slow down the MVP work.
>>
>> Having said that - there is a feature I believe would be a really good
>> addition to the MVP. This is the encryption of manifests and manifest
>> lists. I presume you refer to it in your mail. If you have an internal
>> branch with its implementation - porting this to open source will be much
>> appreciated. We need this capability (yes, the data is encrypted; but the
>> stats are not.. which is not great, even if they actually are highly
>> aggregated, a sort of a range mask).
>>
>> We can chat about this at the upcoming sync, but I support the suggestion
>> to set up a more detailed discussion to align the encryption-related
>> efforts.
>>
>> Cheers, Gidon
>>
>>
>> On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Hi Gidon and Huaxin,
>>>
>>> Thanks for continuing with the effort in Iceberg encryption support. I
>>> did not get enough time to work on this area since the design discussion,
>>> so far I only managed to add key metadata for manifest file, and there are
>>> quite a few changes in our internal branch that I need to port to open
>>> source. I will start to do it in the next few days.
>>>
>>> Regarding the design, I wonder if we should first start with defining
>>> the actions API with a Spark implementation for file encryption key
>>> rotation, and then discuss the user experience.
>>>
>>> In the original design document, I think we did not reach a consensus
>>> with the community around the actual way to expose key rotation
>>> functionalities. In Spark, we can either do it through DDL extension, or
>>> implement it as a procedure. Given that this is a long-running distributed
>>> procedure, my feeling is that the community will lean towards a procedure
>>> call.
>>>
>>> We can continue with the discussion around this while first doing the
>>> detailed implementation. Let's set up a discussion around this so that we
>>> can align the efforts.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We have briefly discussed this subject in a June sync, with a
>>>> decision to continue via the mailing list.
>>>> There are a number of pull requests from Jack and myself that implement
>>>> a set of disjoint elements from the high-level design
>>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>>>> Some low-level details, such as generation and propagation of data keys,
>>>> are not covered in this document.
>>>> I have created a short (and hopefully simple) doc
>>>>
>>>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>>>  that focuses on these details and describes the bottom-up approach to
>>>> generation of data keys, encryption of data/delete files, and
>>>> options/phases for optimization of key management. The scope of the
>>>> document is intentionally narrow, and currently focuses on the minimal
>>>> simplest option. Reviews are very welcome. Later, this doc will be merged
>>>> in (or referenced from) the master design document.
>>>>
>>>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>>>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>>>> week, I'll send a pull request with an implementation of the minimal
>>>> encryption option. This pull request collects the basics from my PRs 2639,
>>>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>>>> that creates an end-to-end implementation of the minimal design
>>>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>>>> This PR comes with an example proposed by Ryan - using a table encryption
>>>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>>>> for symmetric keys).
>>>> Besides the minimal version, I have a draft implementation of more
>>>> advanced data encryption options (including per-column keys, double
>>>> wrapping and two-tier management - all described in the master design doc)
>>>> - but let's take this one step at a time, starting with the simplest option.
>>>>
>>>> Cheers, Gidon
>>>>
>>>
>
> --
> Regards,
> Maya
>
-- 
Best,

Yufei

`This is not a contribution`

Re: Data encryption in Iceberg

Posted by Maya Anderson <ma...@gmail.com>.
Hi Gidon,

I would like to join the meeting.

Thanks,
Maya Anderson


---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Wed, Sep 1, 2021 at 9:11 PM
> Subject: Fwd: Data encryption in Iceberg
> To: <de...@iceberg.apache.org>
>
>
> Hi all,
>
> Per the sync this morning, we'll have a meeting on encryption-related
> efforts in Iceberg. Before we discuss the day/time options, let us know
> who's interested to join, please respond here or send a direct message to
> Ryan, Jack or myself.
>
> Cheers, Gidon
>
>
> ---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Mon, Aug 30, 2021 at 5:57 PM
> Subject: Re: Data encryption in Iceberg
> To: <de...@iceberg.apache.org>
>
>
> Hi Jack,
>
> Thank you. We've been indeed busy with building the Iceberg data
> encryption code, since we have quite a demand for this functionality (with
> timeline requirements..).
> I've published an initial end-to-end implementation (PR 3053), comprised
> of a new code that handles the generation of data keys, and of the existing
> code (with some modifications) from the current PRs listed below (so this
> is a joint work, with contributions from both of us; I'm sure there are
> ways to recognize PR co-authorship :).
>
> As I mentioned, this is the simplest version (without double wrapping,
> column-specific master keys and two-tier key management). I got a prototype
> for these advanced data encryption features, but thought it might be best
> to start with an MVP - easier to digest by the community, and allows for a
> gradual layer-by-layer implementation. In my understanding, MVP can start
> without key rotation - because the latter has two parts, with the main one
> (key rotation in KMS) being totally transparent to Iceberg; the other part
> (re-wrapping of key_metadata and re-writing of manifest files and manifest
> lists) is required in threat models that cover a risk of master keys being
> compromised/leaked - so this is a less universal requirement and can be
> added post-MVP. But if you hold a different view on this, or need the
> second part of key rotation now, I'm sure this is doable; I just hope it
> won't slow down the MVP work.
>
> Having said that - there is a feature I believe would be a really good
> addition to the MVP. This is the encryption of manifests and manifest
> lists. I presume you refer to it in your mail. If you have an internal
> branch with its implementation - porting this to open source will be much
> appreciated. We need this capability (yes, the data is encrypted; but the
> stats are not.. which is not great, even if they actually are highly
> aggregated, a sort of a range mask).
>
> We can chat about this at the upcoming sync, but I support the suggestion
> to set up a more detailed discussion to align the encryption-related
> efforts.
>
> Cheers, Gidon
>
>
> On Sun, Aug 29, 2021 at 11:08 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Hi Gidon and Huaxin,
>>
>> Thanks for continuing with the effort in Iceberg encryption support. I
>> did not get enough time to work on this area since the design discussion,
>> so far I only managed to add key metadata for manifest file, and there are
>> quite a few changes in our internal branch that I need to port to open
>> source. I will start to do it in the next few days.
>>
>> Regarding the design, I wonder if we should first start with defining the
>> actions API with a Spark implementation for file encryption key rotation,
>> and then discuss the user experience.
>>
>> In the original design document, I think we did not reach a consensus
>> with the community around the actual way to expose key rotation
>> functionalities. In Spark, we can either do it through DDL extension, or
>> implement it as a procedure. Given that this is a long-running distributed
>> procedure, my feeling is that the community will lean towards a procedure
>> call.
>>
>> We can continue with the discussion around this while first doing the
>> detailed implementation. Let's set up a discussion around this so that we
>> can align the efforts.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky <gg...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> We have briefly discussed this subject in a June sync, with a
>>> decision to continue via the mailing list.
>>> There are a number of pull requests from Jack and myself that implement
>>> a set of disjoint elements from the high-level design
>>> <https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing>.
>>> Some low-level details, such as generation and propagation of data keys,
>>> are not covered in this document.
>>> I have created a short (and hopefully simple) doc
>>>
>>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>>  that focuses on these details and describes the bottom-up approach to
>>> generation of data keys, encryption of data/delete files, and
>>> options/phases for optimization of key management. The scope of the
>>> document is intentionally narrow, and currently focuses on the minimal
>>> simplest option. Reviews are very welcome. Later, this doc will be merged
>>> in (or referenced from) the master design document.
>>>
>>> A PR with a basic encryption DDL has been sent recently by Huaxin, you
>>> can find it here <https://github.com/apache/iceberg/pull/3013>. Next
>>> week, I'll send a pull request with an implementation of the minimal
>>> encryption option. This pull request collects the basics from my PRs 2639,
>>> 2638, 2640 and Jack's PR 2443; adding the key generation and other code
>>> that creates an end-to-end implementation of the minimal design
>>> <https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing>.
>>> This PR comes with an example proposed by Ryan - using a table encryption
>>> key from a keyfile ("pkcs12" format - the closest thing to the "pem" format
>>> for symmetric keys).
>>> Besides the minimal version, I have a draft implementation of more
>>> advanced data encryption options (including per-column keys, double
>>> wrapping and two-tier management - all described in the master design doc)
>>> - but let's take this one step at a time, starting with the simplest option.
>>>
>>> Cheers, Gidon
>>>
>>

-- 
Regards,
Maya