You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gidon Gershinsky (Jira)" <ji...@apache.org> on 2021/04/18 05:45:00 UTC
[jira] [Commented] (SPARK-33966) Two-tier encryption key management

    [ https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324421#comment-17324421 ] 

Gidon Gershinsky commented on SPARK-33966:
------------------------------------------

This Jira (and its subtasks) require considerable changes/additions in Spark and underlying format libraries. We are working on the design, but it (and the implementation) won't be ready in time for the 3.2.0 release.

> Two-tier encryption key management
> ----------------------------------
>
>                 Key: SPARK-33966
>                 URL: https://issues.apache.org/jira/browse/SPARK-33966
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Gidon Gershinsky
>            Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column encryption capability. The data protection follows the practice of envelope encryption, where the Data Encryption Key (DEK) is freshly generated for each file/column, and is encrypted with a master key (or an intermediate key, that is in turn encrypted with a master key). The master keys are kept in a centralized Key Management Service (KMS) - meaning that each Spark worker needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one hand preserves the best practice of generating fresh encryption keys for each data file/column, and on the other hand allows Spark clusters to have a scalable interaction with a KMS server, by delegating it to the application driver. This is done via two-tier management of the keys, where a random Key Encryption Key (KEK) is generated by the driver, encrypted by the master key in the KMS, and distributed by the driver to the workers, so they can use it to encrypt the DEKs, generated there by Parquet or ORC libraries. In the workers, the KEKs are distributed to the executors/threads in the write path. In the read path, the encrypted KEKs are fetched by workers from file metadata, decrypted via interaction with the driver, and shared among the executors/threads.
> The KEK layer further improves scalability of the key management, because neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks (e.g., Presto, pandas) must be able to read/decrypt the files, written/encrypted by this Spark-driven key management mechanism - and vice-versa. [of course, only if both sides have proper authorisation for using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org