You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Benoy Antony (JIRA)" <ji...@apache.org> on 2012/07/27 03:25:34 UTC

[jira] [Created] (MAPREDUCE-4491) Encryption and Key Protection

Benoy Antony created MAPREDUCE-4491:
---------------------------------------

             Summary: Encryption and Key Protection
                 Key: MAPREDUCE-4491
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Components: documentation, security, task-controller, tasktracker
            Reporter: Benoy Antony
            Assignee: Benoy Antony


When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external secure keystore machine. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425320#comment-13425320 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

To Alejandro's questions:

1) If using compression codec for encryption, are you losing the compression capabilities if doing using encryption or will it work as a composition?
What I have done is to first compress and then encrypt. I have hardcoded to ZIP. I can expose this as a configuration with a choice of {UNCOMPRESSED, ZIP, ZLIB, BZIP2}. This is an enhancement that I can add.
I have also provided a DistributedSplitter  so that files can be split into smaller files.
I am not aware of an ability to chain multiple compression Codecs, though it was a desirable capability in this case. 

2) For the keystores, are you proposing to store them in HDFS use file system permissions to protect them?

Actually, I am not proposing to store them in HDFS. The keystores themselves are encrypted and a password is required to read keys from them. 

In the use cases that I have encountered, the keystores were external to the cluster. They were either on the CLI machine from where the jobs were submitted or on a separate machine from where the keys were retrieved based on user's credentials. (Alfredo was used in this regard to fetch keys via webservice)
So they were two schemes that I have supported -
  1) reading keys from Java keystore
  2) reading keys from a web Service based keystore  ("Safe")




                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442987#comment-13442987 ] 

Konstantin Shvachko commented on MAPREDUCE-4491:
------------------------------------------------

Edited previous comment. Was: crypto.* Changed to: hadoop.crypto.*
Similar to hadoop.security
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Attachment:     (was: MR_4491_trunk.patch)
    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Attachment: Hadoop_Encryption.pdf
    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442982#comment-13442982 ] 

Konstantin Shvachko commented on MAPREDUCE-4491:
------------------------------------------------

Benoy. I went over your design document. Pretty comprehensive description. 
Want to clarify couple of things. 
# Do I understand correctly that your approach can be used to securely store (encrypt) data even on non-secure (security=simple) clusters?
# So JobClient uses current user credentials to obtain keys from the KeyStore, encrypts them with cluster-public-key and sends to the cluster along with the user credentials. JobTracker has nothing to do with the keys and passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts the user keys using private-cluster-key and handles them to the local tasks, which is secure as keys don't travel over the wires. Is it right so far?
# TT should be using user credentials to decrypt the blob of keys somehow? Or does it authenticate the user and then decrypts if authentication passes? I did not find it in your document.
# How cluster-private-key is delivered to TTs?
# I think configuration parameters naming need some changes. They should not start with {{mapreduce.job}}. Based on your examples you can just encrypt a HDFS file without spawning any actual jobs. In this case seeing {{mapreduce.job.*}} seems confusing.
My suggestion is to prefix all parameters with simply {{crypto.*}} Then you can use e.g. full word "keystore" instead of "ks".

I plan to get into reviewing the implementation soon.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Attachment: Hadoop_Encryption.pdf
                MR_4491_1.1.patch
                MR_4491_trunk.patch

Attaching the initial patches for trunk and branch-1.1. Please review and let me know the comments. 

Did minor updates in the design document.

One of the test cases in the patch depends on a test class which will be part of another jira (yet to be filed due to the ASF Jira problem)
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf, MR_4491_1.1.patch, MR_4491_trunk.patch
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Attachment:     (was: MR_4491_1.1.patch)
    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433237#comment-13433237 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

To make the reviewing this patch easier, I am dividing this patch  into smaller patches. I am opening sub tasks under this jira issue and attaching the patches to those liras.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf, MR_4491_1.1.patch, MR_4491_trunk.patch
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Jerry Chen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jerry Chen updated MAPREDUCE-4491:
----------------------------------

    Attachment: crypto_abstractions.zip

Proposed interfaces and stuctures for crypto abstraction for Hadoop Core and Map Reduce layer.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442982#comment-13442982 ] 

Konstantin Shvachko edited comment on MAPREDUCE-4491 at 8/28/12 5:25 PM:
-------------------------------------------------------------------------

Benoy. I went over your design document. Pretty comprehensive description. 
Want to clarify couple of things. 
# Do I understand correctly that your approach can be used to securely store (encrypt) data even on non-secure (security=simple) clusters?
# So JobClient uses current user credentials to obtain keys from the KeyStore, encrypts them with cluster-public-key and sends to the cluster along with the user credentials. JobTracker has nothing to do with the keys and passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts the user keys using private-cluster-key and handles them to the local tasks, which is secure as keys don't travel over the wires. Is it right so far?
# TT should be using user credentials to decrypt the blob of keys somehow? Or does it authenticate the user and then decrypts if authentication passes? I did not find it in your document.
# How cluster-private-key is delivered to TTs?
# I think configuration parameters naming need some changes. They should not start with {{mapreduce.job}}. Based on your examples you can just encrypt a HDFS file without spawning any actual jobs. In this case seeing {{mapreduce.job.*}} seems confusing.
My suggestion is to prefix all parameters with simply {{hadoop.crypto.*}} Then you can use e.g. full word "keystore" instead of "ks".

I plan to get into reviewing the implementation soon.
                
      was (Author: shv):
    Benoy. I went over your design document. Pretty comprehensive description. 
Want to clarify couple of things. 
# Do I understand correctly that your approach can be used to securely store (encrypt) data even on non-secure (security=simple) clusters?
# So JobClient uses current user credentials to obtain keys from the KeyStore, encrypts them with cluster-public-key and sends to the cluster along with the user credentials. JobTracker has nothing to do with the keys and passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts the user keys using private-cluster-key and handles them to the local tasks, which is secure as keys don't travel over the wires. Is it right so far?
# TT should be using user credentials to decrypt the blob of keys somehow? Or does it authenticate the user and then decrypts if authentication passes? I did not find it in your document.
# How cluster-private-key is delivered to TTs?
# I think configuration parameters naming need some changes. They should not start with {{mapreduce.job}}. Based on your examples you can just encrypt a HDFS file without spawning any actual jobs. In this case seeing {{mapreduce.job.*}} seems confusing.
My suggestion is to prefix all parameters with simply {{crypto.*}} Then you can use e.g. full word "keystore" instead of "ks".

I plan to get into reviewing the implementation soon.
                  
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478230#comment-13478230 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

+1 . I agree. A more generic framework is useful in addressing encryption in components other than MR. Let us work on it together.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425286#comment-13425286 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

To Rob's questions :

Different Encryption Keys for Different files:  At this point, the PGPCodec supports only one secret key/Key Pair  for all input files. 
What we need is the ability to specify secret keys/key pair per input file. 
Another enhancement will be to specify secret keys/key pair per each phase like map->output , reduce->output .
As you mentioned, this mapping has to specified via configuration.
I'll try to add these two enhancements. 

Decryption/Encryption of different columns within the same file: This is actually left to the mapreduce programmer as he has to do the Decryption/Encryption of the fields programmatically. The programmer can choose to use different keys  for different fields in the mapreduce program. Multiple keys can be retrieved from the keystore and these keys can be retrieved in the mapper/reducer using the credentials API.  
In a higher level interface like Hive, it may be possible to add additional metadata information to specify the key name. Another reviewer also has recommended to add this capability Hive to identify an encryption field and specify the key (name of the key)  to be used to decrypt/encrypt it.

Thanks for the review and recommendations, Rob. Please let me know if I have not answered the question correctly.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Description: 
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








  was:
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external secure keystore machine. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449260#comment-13449260 ] 

Aaron T. Myers commented on MAPREDUCE-4491:
-------------------------------------------

bq. This is an important point as we do not want Tasktracker to decrypt the blob of keys and blindly hand over to Tasks. The JobClient stores JobId along with keys as part of the encrypted blob. The taskTracker decrypts the encrypted blob, verifies that the JobId in the encrypted blob matches JobId of the task. The keys are handed over to Tasks only if the JobId verification is successful. This ensures that keys are handed over to the correct tasks.

Unless I'm missing something, this seems to be insecure unless secure authentication (i.e. Kerberos) is enabled, since someone could connect to the TT from a different task and simply report a different JobId. Or do I misunderstand somehow?
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425244#comment-13425244 ] 

Alejandro Abdelnur commented on MAPREDUCE-4491:
-----------------------------------------------

Benoy, I've done a quick read to the doc. A couple of initial questions:

* If using compression codec for encryption, are you losing the compression capabilities if doing using encryption or will it work as a composition?
* For the keystores, are you proposing to store them in HDFS use file system permissions to protect them? I'm not sure if I understood this part correctly. If that is the case, then HDFS-3637 would ensure secure transfer.

I'll read the design doc in more detail later this week.


                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Work started] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAPREDUCE-4491 started by Benoy Antony.

> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Rob Weltman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424216#comment-13424216 ] 

Rob Weltman commented on MAPREDUCE-4491:
----------------------------------------

If you want to use different encryption keys for different files (or even for different columns within the same file), how do you identify the right key from the Safe or Keystore, i.e. where is the mapping maintained? Would that be an additional layer on top of this?

                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Description: 
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.

Update: The patches are uploaded to subtasks. 








  was:
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoy Antony updated MAPREDUCE-4491:
------------------------------------

    Description: 
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








  was:
When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 

The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
The feature adds PGP encryption as codec and additional utilities to perform encryption related steps.


The design document is attached. It explains the requirement, design and use cases.
Kindly review and comment. Collaboration is very much welcome.

I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 








    
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451050#comment-13451050 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

Key Protection is simple to explain.
JobClient retrieves keys from a configured Keystore ,encrypts the keys along with jobId  using cluster public key , submits the encrypted blob 
as part of the job credentials. 
TaskTrackers decrypts the encrypted blob using cluster private key during job localization, verifies that jobId inside the encrypted blob matches the JobId of the task. During Task Launch, the keys are made available to the  child (task) process as an environment variable.

Since the JobId is part of the encrypted blob, the replay attack is prevented with the JobId verification. It is easy to add integrity protection also.

Now, the scheme was designed to be used in a secure cluster. It is good to explore whether it can be used in a non-secure cluster. 

One issue was with the cluster private key. It should be made accessible only to TaskTracker process. If the access is determined by the user's permissions, then tasks should be run as a different user. But it need not be the job owner. It can be a fixed user. 

I believe , you are bringing up another issue in this regard.  
If a rogue task can  make a TT launch another rogue task with a jobId matching the one inside encrypted blob, then the keys area available to the newly launched rogue task.
That's a good point. Basically the rogue task is acting as a JT/AppMaster. I am not sure whether that is possible. Even if its possible, there should be ways to detect it. 




                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433401#comment-13433401 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

One of the goals of this feature is to achieve encryption of files in transit and at rest(when stored on disk). One way to achieve this goal is to depend on a software/hardware which allows encryption in the local file system plus rely on HDFS-3637  and MR shuffle encryption.

This jira  explores an alternative approach to the problem without depending on s special software to do local file system encryption. 

The key advantages of this approach over the local file system encryption approach are

1)  A file can be decrypted only if the user provides the correct key. So even if someone managed to read the file, he cannot read its contents without key. So user's possession of the key is required in addition to his read permission. So there are two levels of protection. 

There could be cases where a user accidentally set "read" permissions for everyone. There could be cases where a superuser reads the file. But  this scheme protects the data.

2) No dependency on local file system encryption software.  This approach allows encryption without such special setup.

3) A file is decrypted/encrypted only during processing and not when it is read.  So this results in a less number of encryption/decryption.


Other key points will be :

1) Encrypted and plain text files can coexist in a normal file system. 

2) Developers can plugin other encryption algorithms/standards - CMS, AES, custom encryption and thus have more flexibility.

3) Allows transporting keys/password/tokens  from JobClient to tasks for use cases other than encryption like connecting to a webservice . MAPREDUCE-4491 adds keyProtection and encryption uses it.

4) Can manage keys in one central location. JobClient  gets on behalf of user like any other application. 

If we look at these two approaches from a higher level, we can see that one local file system approach is an internal approach to encryption and MAPREDUCE-4491 approach is an external approach. These two choices are available in normal (non-distributed) application development also where developers can rely on the file system to provide encryption or do encryption themselves. There are tradeoffs and flexibilities in the both the approaches and we choose it based on our use cases and needs.  So I believe , we should provide  these two alternatives  in Hadoop.

In addition, this feature allows key protection in general, which can be used for purposes other than encryption. The keys also will be encrypted when stored on disk and decrypted only in memory.

                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Jerry Chen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477634#comment-13477634 ] 

Jerry Chen commented on MAPREDUCE-4491:
---------------------------------------

Hi Benoy,
I am Haifeng from Intel. and we was discussing offline as to this feature. And I really apperciate your initiation of this work. And we also see the importance of encryption and decryption in Hadoop when we are deasling with sensitive data. 

Just as you pointed out, the functionalities requirements are more or less same. For hadoop community, we wish to get a high level abstraction that basically provide a foundation for these requirements in different hadoop components (such as HDFS, MapReduce, HBase) while enable different implementations such as different encryption algorithms or different ways of key management of different parts / companies so that not bounding a concept on a specific implementation.  Just as we disuccssed offline, the driving force for such a abstraction is summarized  as following:

1. Encryption and decryption need to be supported in different components and usage models. For example, We may use HDFS Client API and Codec directly to encrypt and decrypt HDFS file; We may use MapReduce to processing a encrypted file and output a encrypted file; And also, the HBase may needs to store its files (such as hfiles) in an encrypted way.

2. The community may have different implemenation of encryption codecs and different ways of providing keys. CompressionCodec provides us a foundation for related work. But CompressionCodec are not enough for encryption and decryption because CompressionCodec assumes to initilize from hadoop Configuration while encryption/decryption may needs a per file crypto context such as the Key. With an abtraction layer of crypto, we can share the common featurs such as "Provide different keys for different input files of a MapReduce job." other than each implementation get his own way in MapReduce core and finally becames into a mess.

Based on these driving forces, your work done and our offline discussions, we refined our work and would like to propose the following,

1. For Hadoop common, a new CryptoCodec interface which extends CompressionCodec, which adding the methods of getCryptoContext/setCryptoContext. Just as CompressionCodec, it will initialize its global settings from Configuration. But CryptoCodec will receive its crypto context (the Key, for example) through CryptoContext object setting by setCryptoContext, allowing different usage cases such as "direct use CryptoCodec to encrypt/decrypt a HDFS file by direct providing the CryptoContext(Key)" or "Map Reduce way of using CryptoCodec that a CryptoContext(Key) is choosed per file based on some policy".

Any specific crypto implementation are under this umbrella and will implement CryptoCodec. The PGPCodec is pretty good fit into a implementation of CryptoCodec. And we also are able to implements our splittable CryptoCodec.

2. For MapReduce, use CryptoContextProvider interface to abstract implementation specific service and allowing the MapReduce core is able to written shared code of retrieveing the CryptoContext of a specific file from a CryptoContextProvider and pass to the CryptoCodec in using. Different CryptoContextProvider implementations can implement different ways of deciding the CryptoContext and different ways of retrieving Keys from different Key Stores. We can provide basic and common implementations of CryptoContextProviders such as "A CryptoContextProvider provides CryptoContext for a file by regular expression matching the file path and get the key from a java KeyStore" while not preventing users to implement or extends their own if existing implementation doesn't satisfy their requirements.

CryptoContextProvider configurations are passed by hadoop JobConfig and credentials (credential secret keys) and the implementation of CryptoContextProvider can choose whether or not to encrypt the secret keys stored in job Credentials.

I attched the java files of these interfaces and basic strucutes in Attachments section for demonstrating the concepts and I wish to have a design document for these high level things when we have enough discussion and come to an agreement.

Again, thanks for your patient and time. 

                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Benoy Antony (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443864#comment-13443864 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

1.	Do I understand correctly that your approach can be used to securely store (encrypt) data even on non-secure (security=simple) clusters?   
        
You are right!! If TaskTracker and Task processes are owned by different users, then it is possible to use this approach to encrypt/decrypt data in a non-secure cluster.  This  does not require each task to be run as job owner, instead a fixed user other than TT user is sufficient.  The cluster private key can be made readable/accessible only by TaskTracker user. In this way, the Tasks cannot get hold of the cluster private key. But it requires the use of LinuxTaskController to spawn  tasks as a different user. It also requires some code changes to enable this via configuration. 
            
2.	So JobClient uses current user credentials to obtain keys from the KeyStore, encrypts them with cluster-public-key and sends to the cluster along with the user credentials. JobTracker has nothing to do with the keys and passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts the user keys using private-cluster-key and handles them to the local tasks, which is secure as keys don't travel over the wires. Is it right so far?
	
That is correct. Its a clear and concise  explanation of this straight forward approach.  Please note that though the design is described in terms of TaskTrackers and TaskControllers (1.0 terminology) , the implementation is available for both 1.0 and 2.0 .

3.	TT should be using user credentials to decrypt the blob of keys somehow? Or does it authenticate the user and then decrypts if authentication passes? I did not find it in your document.
		
This is an important point as we do not want Tasktracker to decrypt the blob of keys and blindly hand over to Tasks. The JobClient stores JobId along with keys as part of the encrypted blob. The taskTracker decrypts the encrypted blob, verifies that the JobId in the encrypted blob matches  JobId of the task. The keys are handed over to Tasks only if the JobId verification is successful. This ensures that keys are handed over to the correct tasks.

4.	How cluster-private-key is delivered to TTs?

The TTs can use an implementation of the KeyProvider interface to retrieve keys. The implementation can be configured as a cluster configuration. The default Key provider is Java keystore based key provider in which private key is stored in a Java keystore file on the TT machines. This is the same scheme used by web servers to store their private keys. It is possible to plugin more complex KeyStorage mechanisms via configuration.

5. I think configuration parameters naming need some changes. They should not start with mapreduce.job. Based on your examples you can just encrypt a HDFS file without spawning any actual jobs. In this case seeing mapreduce.job.* seems confusing.
My suggestion is to prefix all parameters with simply hadoop.crypto.* Then you can use e.g. full word "keystore" instead of "ks".

The distributed utility to encrypt/decrypt an HDFS file actually spawns map jobs. Irrespective of that, I think it make perfect sense to rename the configurations as hadoop.crypto  as this approach is useful in non-mapreduce situations. I'll change the configuration names.

I plan to get into reviewing the implementation soon.  

Thanks and please post your comments.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

Posted by "Plamen Jeliazkov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447981#comment-13447981 ] 

Plamen Jeliazkov commented on MAPREDUCE-4491:
---------------------------------------------

Great work, Benoy!

This looks like a very neat feature to add. I am all in support. I like your similarity with the compressor / decompressor interfaces and the ease of the implementation to plug-in any keystores.

I am in the midst of applying your patches and doing a small test locally and will reply back with any results I find.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted wherever it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores, support for Java KeyStore, read keys from keystores, and transport keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to perform encryption related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira