You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org> on 2014/04/11 08:05:29 UTC

[jira] [Comment Edited] (HADOOP-10150) Hadoop cryptographic file system

    [ https://issues.apache.org/jira/browse/HADOOP-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966140#comment-13966140 ] 

Uma Maheswara Rao G edited comment on HADOOP-10150 at 4/11/14 6:03 AM:
-----------------------------------------------------------------------

Todd, thanks for your comments.

{quote}A few questions here...
First, let me confirm my understanding of the key structure and storage:
Client master key: this lives on the Key Management Server, and might be different from application to application. {quote} 
Yes.

{quote}In many cases there may be just one per cluster, though in a multitenant cluster, perhaps we could have one per tenant.{quote} 
It depends on the KeyProvider implementation, these kinds of details can be encapsulated into the KeyProvider implementation which could be pluggable in CFS. Thus, the user can use their own strategy to deploy one master key or multiple master key, by application or by user-group etc.


{quote}Data key: this is set per encrypted directory. This key is stored in the directory xattr on the NN, but encrypted by the client master key (which the NN doesn't know).{quote} 
Yes.

{quote}So, when a client wants to read a file, the following is the process:
  1) Notices that the file is in an encrypted directory. Fetches the encrypted data key from the NN's xattr on the directory.
  2) Somehow associates this encrypted data key with the master key that was used to encrypt it (perhaps it's tagged with some identifier). > Fetches the appropriate master key from the key store.
  2a) The keystore somehow authenticates and authorizes the client's access to this key
  3) The client decrypts the data key using the master key, and is now able to set up a decrypting stream for the file itself. (I've ignored the IV here, but assume it's also stored in an xattr) {quote} 
Yes.


{quote}In terms of attack vectors:
  let's say that the NN disk is stolen. The thief now has access to a bunch of keys, but they're all encrypted by various master keys. So we're OK.{quote} 
Yes.

{quote}let's say that a client is malicious. It can get whichever master keys it has access to from the KMS. If we only have one master key per cluster, then the combination of one malicious client plus stealing the fsimage will give up all the keys{quote} 
When a client get access to master key and fsimage, there is nothing we can do to protected those data. The separation of data encryption key and master key is for master key rotation. So that one does not need to decrypt all data file then encrypt it with new encryption key again. 

{quote}let's say that a client has escalated to root access on one of the slave nodes in the cluster, or otherwise has malicious access to a NodeManager process. By looking at a running MR task, it could steal whatever credentials the task is using to access the KMS, and/or dump the memory of the client process in order to give up the master key above.{quote} 
When a client has root access, all information can be dumped from any process, right? I remember Nicholas asked the similar question on HDFS-6134. If a client has escalated to root access on slave nodes, how can we assume the namenode, standby namenode/secondary namenode are secure in the same cluster? On the other hand, as long as data keys remain in encrypted form in the process memory of the NameNode and DataNodes, and they don't have access to the wrapping keys, then there is no attack vector there. 

{quote}How does the MR task in this context get the credentials to fetch keys from the KMS? If the KMS accepts the same authentication tokens as the NameNode, then is there any reason that this is more secure than having the NameNode supply the keys? Or is it just that decoupling the NameNode and the key server allows this approach to work for non-HDFS filesystems, at the expense of an additional daemon running a key distribution service?{quote}
It is a good question. Securely distributing the secrets as you mentioned among the cluster nodes will always be a hard problem to solve. Without adequate hardware support, it could possibly be a weak point during operations like unwrapping key. We want to leave options to KeyProvider implementation to decouple the key protection mechanism and data encryption mechanism, and to make above two work on top of any filesystem. It is possible to have a KeyProvider implementation which use NN as KMS as we already discussed, and leave room for other parties to plug their own solution?


was (Author: hitliuyi):
Todd, thanks for your comments.

{quote}A few questions here...
First, let me confirm my understanding of the key structure and storage:
Client master key: this lives on the Key Management Server, and might be different from application to application. {quote} 
Yes.

{quote}In many cases there may be just one per cluster, though in a multitenant cluster, perhaps we could have one per tenant.{quote} 
It depends on the KeyProvider implementation, these kinds of details can be encapsulated into the KeyProvider implementation which could be pluggable in CFS. Thus, customer can use their own strategy to deploy one master key or multiple master key, by application or by user-group etc.


{quote}Data key: this is set per encrypted directory. This key is stored in the directory xattr on the NN, but encrypted by the client master key (which the NN doesn't know).{quote} 
Yes.

{quote}So, when a client wants to read a file, the following is the process:
  1) Notices that the file is in an encrypted directory. Fetches the encrypted data key from the NN's xattr on the directory.
  2) Somehow associates this encrypted data key with the master key that was used to encrypt it (perhaps it's tagged with some identifier). > Fetches the appropriate master key from the key store.
  2a) The keystore somehow authenticates and authorizes the client's access to this key
  3) The client decrypts the data key using the master key, and is now able to set up a decrypting stream for the file itself. (I've ignored the IV here, but assume it's also stored in an xattr) {quote} 
Yes.


{quote}In terms of attack vectors:
  let's say that the NN disk is stolen. The thief now has access to a bunch of keys, but they're all encrypted by various master keys. So we're OK.{quote} 
Yes.

{quote}let's say that a client is malicious. It can get whichever master keys it has access to from the KMS. If we only have one master key per cluster, then the combination of one malicious client plus stealing the fsimage will give up all the keys{quote} 
When a client get access to master key and fsimage, there is nothing we can do to protected those data. The separation of data encryption key and master key is for master key rotation. So that one does not need to decrypt all data file then encrypt it with new encryption key again. 

{quote}let's say that a client has escalated to root access on one of the slave nodes in the cluster, or otherwise has malicious access to a NodeManager process. By looking at a running MR task, it could steal whatever credentials the task is using to access the KMS, and/or dump the memory of the client process in order to give up the master key above.{quote} 
When a client has root access, all information can be dumped from any process, right? I remember Nicholas asked the similar question on HDFS-6134. If a client has escalated to root access on slave nodes, how can we assume the namenode, standby namenode/secondary namenode are secure in the same cluster? On the other hand, as long as data keys remain in encrypted form in the process memory of the NameNode and DataNodes, and they don't have access to the wrapping keys, then there is no attack vector there. 

{quote}How does the MR task in this context get the credentials to fetch keys from the KMS? If the KMS accepts the same authentication tokens as the NameNode, then is there any reason that this is more secure than having the NameNode supply the keys? Or is it just that decoupling the NameNode and the key server allows this approach to work for non-HDFS filesystems, at the expense of an additional daemon running a key distribution service?{quote}
It is a good question. Securely distributing the secrets as you mentioned among the cluster nodes will always be a hard problem to solve. Without adequate hardware support, it could possibly be a weak point during operations like unwrapping key. We want to leave options to KeyProvider implementation to decouple the key protection mechanism and data encryption mechanism, and to make above two work on top of any filesystem. It is possible to have a KeyProvider implementation which use NN as KMS as we already discussed, and leave room for other parties to plug their own solution?

> Hadoop cryptographic file system
> --------------------------------
>
>                 Key: HADOOP-10150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10150
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: security
>    Affects Versions: 3.0.0
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>              Labels: rhino
>             Fix For: 3.0.0
>
>         Attachments: CryptographicFileSystem.patch, HADOOP cryptographic file system-V2.docx, HADOOP cryptographic file system.pdf, cfs.patch, extended information based on INode feature.patch
>
>
> There is an increasing need for securing data when Hadoop customers use various upper layer applications, such as Map-Reduce, Hive, Pig, HBase and so on.
> HADOOP CFS (HADOOP Cryptographic File System) is used to secure data, based on HADOOP “FilterFileSystem” decorating DFS or other file systems, and transparent to upper layer applications. It’s configurable, scalable and fast.
> High level requirements:
> 1.	Transparent to and no modification required for upper layer applications.
> 2.	“Seek”, “PositionedReadable” are supported for input stream of CFS if the wrapped file system supports them.
> 3.	Very high performance for encryption and decryption, they will not become bottleneck.
> 4.	Can decorate HDFS and all other file systems in Hadoop, and will not modify existing structure of file system, such as namenode and datanode structure if the wrapped file system is HDFS.
> 5.	Admin can configure encryption policies, such as which directory will be encrypted.
> 6.	A robust key management framework.
> 7.	Support Pread and append operations if the wrapped file system supports them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)