You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Amandeep Khurana <am...@gmail.com> on 2009/03/18 01:43:09 UTC

Design for security in Hadoop

Hi

I've been working on security in Hadoop and have come up with a design for
the same. I ran some basic experiments to evaluate the design. Here's the
report for the same.

Feedback/comments/discussions on this would be great.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Design for security in Hadoop

Posted by Steve Loughran <st...@apache.org>.
Amandeep Khurana wrote:
> Thanks for the feedback Steve.
> 
> My response on the points that you have mentioned are written inline below.
> 
> Amandeep
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> On Thu, Mar 19, 2009 at 4:31 AM, Steve Loughran <st...@apache.org> wrote:
> 
>> Amandeep Khurana wrote:
>>
>>> Apparently, the file attached was striped off. Here's the link for where
>>> you
>>> can get it:
>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>>
>>> Amandeep
>>>
>>>
>>>
>> This is a good paper with test data to go alongside the theory
>> Introduction
>> ========
>> -I'd cite NFS as a good equivalent design, the same "we trust you to be who
>> you say you are" protocol, similar assumptions about the network ("only
>> trusted machines get on it")
>> -If EC2 does not meet these requirements, you could argue it's  fault of
>> EC2; there's no fundamental reason why it can't offer private VPNs for
>> clusters the way other infrastructure (VMWare) can
>> -the whoami call is done by the command line client; different clients
>> don't even have to do that. Mine doesn't.
>> -it is not the "superuser" in unix sense, "root", that runs jobs, it is
>> whichever user started hadoop on that node. It can still be a locked down
>> user with limited machine rights.
> 
> 
> I'll look into the NFS security stuff in detail and then add it later.


The key point about NFS security is there was none, because the early 
eighties, the idea of a linux laptop getting on your wifi network was 
not conceivable, so you really could trust workstations. It was only 
with PC-NFS that the assumptions started to fail.

> 
> Where did EC2 come into picture?

Its an example of a place where Hadoop is deployed where the assumption 
that only trusted users have network access (and/or only fixed IP 
addresses can join the cluster) don't hold.

> 
> Yes, the whoami can be bypassed, thats why the whole thing around
> authentication.
> 
> By superuser, I meant the user who starts the hadoop instance... Will make
> it clearer in the writing.

OK

> 
> 
>>
>> Attacks
>> ====
>> Add
>>  -unauthorised nodes spoofing other IP addresses (via ARP attacks) and
>> becoming nodes in the cluster. You could acquire and then keep or destroy
>> data, or pretend to do work and return false values.  Or come up as a spoof
>> namenode datanode and disrupt all work.
>> -denial of service attacks: too many heartbeats, etc
>> -spoof clients running malicious code on the tasktrackers.
> 
> 
> I havent looked these attacks. This paper is not focussing on that. This can
> definitely be looked at and incorporated at a later stage. Lets go step by
> step. (Debatable)

I was just broadening the list of attacks. Spoofing joining the cluster 
is something to fear.
> 
>>
>> Protocol
>> ======
>> -SSL does need to deal with trust; unless you want to pay for every server
>> certificate (you may be able to share them), you'll
>> need to set up your own CA and issuing private certs -leaving you with the
>> problem of securiing distributing CA public keys and getting SSL private
>> keys out to nodes securely (and not anything on the net trying to use your
>> kickstart server to boot a VM with the same mac address as a trusted server
>> just to get at those keys)
> 
> 
> SSL is a possible solution but the details arent the focus of this design.
> Regarding the other keys, there is a format around which they are created
> and you dont need a CA for that.
> 
> 
>>
>> -I'll have to get somebody who understands security protocols to review the
>> paper. One area I'd flag as trouble is that on virtual machines, clock drift
>> can be choppy and non-linear. You also have to worry about clients not being
>> in the right time zone. It is good for everything to work off one clock (say
>> the namenode) rather than their own. Amazon's S3 authentication protocol has
>> this bug, as do the bits of WS-DM which take absolute times rather than
>> relative ones (presumably to make operations idempotent). A the very least,
>> the namenode needs an operation to return its current time, which callers
>> can then work off
> 
> 
> The time issue is definitely a concern and has to be somehow cracked. The
> namenode giving its time is a good idea. But the sync would still be
> important. There is a way to sync the time across the cluster. I dont
> remember it clearly, but I have it on my "little" cluster. I'll look that
> up.
> 

NTP is the normal protocol, everyone tries to use it. But asking the NN 
for its clock would avoid having to rely on everything being in sync at 
the OS level -and would let the client detect when its clock had drifted 
too far off for a conversation. One recurrent problem of mine is 
machines that are on NTP but whose time zones are wrong; they are 
perfectly accurate to the second but 8 hours out.

-steve

Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
Thanks for the feedback Steve.

My response on the points that you have mentioned are written inline below.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Mar 19, 2009 at 4:31 AM, Steve Loughran <st...@apache.org> wrote:

> Amandeep Khurana wrote:
>
>> Apparently, the file attached was striped off. Here's the link for where
>> you
>> can get it:
>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>
>> Amandeep
>>
>>
>>
> This is a good paper with test data to go alongside the theory
> Introduction
> ========
> -I'd cite NFS as a good equivalent design, the same "we trust you to be who
> you say you are" protocol, similar assumptions about the network ("only
> trusted machines get on it")
> -If EC2 does not meet these requirements, you could argue it's  fault of
> EC2; there's no fundamental reason why it can't offer private VPNs for
> clusters the way other infrastructure (VMWare) can
> -the whoami call is done by the command line client; different clients
> don't even have to do that. Mine doesn't.
> -it is not the "superuser" in unix sense, "root", that runs jobs, it is
> whichever user started hadoop on that node. It can still be a locked down
> user with limited machine rights.


I'll look into the NFS security stuff in detail and then add it later.

Where did EC2 come into picture?

Yes, the whoami can be bypassed, thats why the whole thing around
authentication.

By superuser, I meant the user who starts the hadoop instance... Will make
it clearer in the writing.


>
>
> Attacks
> ====
> Add
>  -unauthorised nodes spoofing other IP addresses (via ARP attacks) and
> becoming nodes in the cluster. You could acquire and then keep or destroy
> data, or pretend to do work and return false values.  Or come up as a spoof
> namenode datanode and disrupt all work.
> -denial of service attacks: too many heartbeats, etc
> -spoof clients running malicious code on the tasktrackers.


I havent looked these attacks. This paper is not focussing on that. This can
definitely be looked at and incorporated at a later stage. Lets go step by
step. (Debatable)

>
>
> Protocol
> ======
> -SSL does need to deal with trust; unless you want to pay for every server
> certificate (you may be able to share them), you'll
> need to set up your own CA and issuing private certs -leaving you with the
> problem of securiing distributing CA public keys and getting SSL private
> keys out to nodes securely (and not anything on the net trying to use your
> kickstart server to boot a VM with the same mac address as a trusted server
> just to get at those keys)


SSL is a possible solution but the details arent the focus of this design.
Regarding the other keys, there is a format around which they are created
and you dont need a CA for that.


>
>
> -I'll have to get somebody who understands security protocols to review the
> paper. One area I'd flag as trouble is that on virtual machines, clock drift
> can be choppy and non-linear. You also have to worry about clients not being
> in the right time zone. It is good for everything to work off one clock (say
> the namenode) rather than their own. Amazon's S3 authentication protocol has
> this bug, as do the bits of WS-DM which take absolute times rather than
> relative ones (presumably to make operations idempotent). A the very least,
> the namenode needs an operation to return its current time, which callers
> can then work off


The time issue is definitely a concern and has to be somehow cracked. The
namenode giving its time is a good idea. But the sync would still be
important. There is a way to sync the time across the cluster. I dont
remember it clearly, but I have it on my "little" cluster. I'll look that
up.


>
>
> Implementation
> -any  implementation should be allowed to use different (userid,
> credentials)  than (whoami , ~/.hadoop). This is to allow workflow servers
> and the like to schedule work as different users.
> -server side should log success/failures to different Log categories; with
> that an JMX instrumentation you can track security attacks.


Yes, thats the intention. So, you log into the system by giving a command
like
bin/hadoop login <userid>
Namenode asks for a password and it authenticates it with the underlying
unix system (or a separate user oracle if we want that).


>
>
> Overall, a nice paper. Do you have the patches to try it out on a bigger
> cluster?
>
>
Thanks! Just my first attempt at writing a paper. Glad you like it and gave
some valuable feedback.

The code that I added is kind of crude right now. It can be tested on a
large cluster, but I'd rather wait for some more inputs from others who've
been working on security or have thoughts around it. If this design is
accepted by everyone, I can go ahead and write up the code properly and we
can test it thereafter.

Re: Design for security in Hadoop

Posted by Steve Loughran <st...@apache.org>.
Amandeep Khurana wrote:
> Apparently, the file attached was striped off. Here's the link for where you
> can get it:
> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf
> 
> Amandeep
> 
> 

This is a good paper with test data to go alongside the theory
Introduction
========
-I'd cite NFS as a good equivalent design, the same "we trust you to be 
who you say you are" protocol, similar assumptions about the network 
("only trusted machines get on it")
-If EC2 does not meet these requirements, you could argue it's  fault of 
EC2; there's no fundamental reason why it can't offer private VPNs for 
clusters the way other infrastructure (VMWare) can
-the whoami call is done by the command line client; different clients 
don't even have to do that. Mine doesn't.
-it is not the "superuser" in unix sense, "root", that runs jobs, it is 
whichever user started hadoop on that node. It can still be a locked 
down user with limited machine rights.

Attacks
====
Add
  -unauthorised nodes spoofing other IP addresses (via ARP attacks) and 
becoming nodes in the cluster. You could acquire and then keep or 
destroy data, or pretend to do work and return false values.  Or come up 
as a spoof namenode datanode and disrupt all work.
-denial of service attacks: too many heartbeats, etc
-spoof clients running malicious code on the tasktrackers.

Protocol
======
-SSL does need to deal with trust; unless you want to pay for every 
server certificate (you may be able to share them), you'll
need to set up your own CA and issuing private certs -leaving you with 
the problem of securiing distributing CA public keys and getting SSL 
private keys out to nodes securely (and not anything on the net trying 
to use your kickstart server to boot a VM with the same mac address as a 
trusted server just to get at those keys)

-I'll have to get somebody who understands security protocols to review 
the paper. One area I'd flag as trouble is that on virtual machines, 
clock drift can be choppy and non-linear. You also have to worry about 
clients not being in the right time zone. It is good for everything to 
work off one clock (say the namenode) rather than their own. Amazon's S3 
authentication protocol has this bug, as do the bits of WS-DM which take 
absolute times rather than relative ones (presumably to make operations 
idempotent). A the very least, the namenode needs an operation to return 
its current time, which callers can then work off

Implementation
-any  implementation should be allowed to use different (userid, 
credentials)  than (whoami , ~/.hadoop). This is to allow workflow 
servers and the like to schedule work as different users.
-server side should log success/failures to different Log categories; 
with that an JMX instrumentation you can track security attacks.

Overall, a nice paper. Do you have the patches to try it out on a bigger 
cluster?




Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
>
>
>
>
> On 3/25/09 12:12 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
>
> >>
> >>
> >> On 3/20/09 2:47 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> >>
> >>>
> >>> 2. The Jira doesnt have cover the access control aspect of things. As a
> >>> client, I can skip talking to the NN and get blocks from the DN
> straight
> >>> away. There is no way to prevent it. This paper takes care of that
> aspect
> >> as
> >>> well.
> >>>
> >>
> >> Have you looked at HADOOP-4359? In that JIRA, we discussed the idea of
> >> using
> >> public-key signed capabilities and dismissed it in favor of
> symmetric-key
> >> based capabilities. That said, you're welcome to explore the public-key
> >> idea
> >> further.
> >
> >
> > Yes, I read through that. The issue with that approach is that the moment
> a
> > single DN gets compromised somehow (which isnt a big deal in a big system
> > containing 1000s of nodes), the symmetric key gets exposed and the entire
> > system is compromised. The whole idea of asymmetric key crypto is to
> allow
> > only a single authorized prinicipal to sign stuff.
> >
> Yes, I discussed this point in the JIRA. It's a trade-off between security
> and performance and I think it's worth taking for our cluster setup. In our
> setup, all the nodes of a cluster are located in the same datacenter and
> managed in the same way. While securing 1000 nodes is certain harder than
> securing one node, it's not like you have 1000 desktops spread around.
> You're welcome to submit a patch for the public-key solution. It can be
> useful for some other cluster setups.
>

Makes sense... Performance definitely is a concern but if you look at the
results that I got out of the basic testing I did, its really not big.


>
> Kan
>
>

Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.


On 3/25/09 12:12 PM, "Amandeep Khurana" <am...@gmail.com> wrote:

>> 
>> 
>> On 3/20/09 2:47 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
>> 
>>> 
>>> 2. The Jira doesnt have cover the access control aspect of things. As a
>>> client, I can skip talking to the NN and get blocks from the DN straight
>>> away. There is no way to prevent it. This paper takes care of that aspect
>> as
>>> well.
>>> 
>> 
>> Have you looked at HADOOP-4359? In that JIRA, we discussed the idea of
>> using
>> public-key signed capabilities and dismissed it in favor of symmetric-key
>> based capabilities. That said, you're welcome to explore the public-key
>> idea
>> further.
> 
> 
> Yes, I read through that. The issue with that approach is that the moment a
> single DN gets compromised somehow (which isnt a big deal in a big system
> containing 1000s of nodes), the symmetric key gets exposed and the entire
> system is compromised. The whole idea of asymmetric key crypto is to allow
> only a single authorized prinicipal to sign stuff.
> 
Yes, I discussed this point in the JIRA. It's a trade-off between security
and performance and I think it's worth taking for our cluster setup. In our
setup, all the nodes of a cluster are located in the same datacenter and
managed in the same way. While securing 1000 nodes is certain harder than
securing one node, it's not like you have 1000 desktops spread around.
You're welcome to submit a patch for the public-key solution. It can be
useful for some other cluster setups.

Kan


Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
>
>
> On 3/20/09 2:47 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
>
> >
> > 2. The Jira doesnt have cover the access control aspect of things. As a
> > client, I can skip talking to the NN and get blocks from the DN straight
> > away. There is no way to prevent it. This paper takes care of that aspect
> as
> > well.
> >
>
> Have you looked at HADOOP-4359? In that JIRA, we discussed the idea of
> using
> public-key signed capabilities and dismissed it in favor of symmetric-key
> based capabilities. That said, you're welcome to explore the public-key
> idea
> further.


Yes, I read through that. The issue with that approach is that the moment a
single DN gets compromised somehow (which isnt a big deal in a big system
containing 1000s of nodes), the symmetric key gets exposed and the entire
system is compromised. The whole idea of asymmetric key crypto is to allow
only a single authorized prinicipal to sign stuff.


> Kan
>
>

Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.
Yes, an additional benefit of using Hadoop proprietary "delegation tokens"
for delegation as described in HADOOP-4343, as opposed to using Kerberos
TGT/Service tickets, is that Kerberos is only used at the "edge" of Hadoop.
Delegation tokens don't depend on Kerberos and can be coupled with
non-Kerberos authentication mechanisms (such as SSL) used at the edge.

Kan


On 3/24/09 4:37 PM, "Brian Bockelman" <bb...@cse.unl.edu> wrote:

> A related meta comment.
> 
> Our community uses X509 for a single-sign-on solution for a few
> thousand physicists.  There's been increased interest in HDFS lately,
> and it would be very attractive to this community if Hadoop used a
> lightweight, but secure solution based upon Kerberos as in HADOOP-4343
> (something like kerberos to initialize a session token and use that
> with the service).
> 
> This would be especially useful because the likely implementation
> would use JSSE - we'd be able to replace the kerberos implementation
> and, with a little work, drop the Globus implementation into place.
> We'd be able to use our single-sign-on and make the organization very
> happy.
> 
> Brian
> 
> On Mar 24, 2009, at 11:29 PM, Raghu Angadi wrote:
> 
>> 
>> I haven't looked into the proposal, but a meta comment:
>> 
>> I don't think there is a real reason for Hadoop to favor this design
>> or only stay with HADOOP-4343 or another proposal at this state. It
>> is healthy if we have different designs and implementation proceed
>> independently. If you are willing to, I think you should proceed
>> with a prototype so that others interested can play with. This is
>> true not just for this feature, but many others as well.
>> 
>> This of course should not discourage others from reviewing your
>> design.
>> 
>> Raghu.
>> 
>> Amandeep Khurana wrote:
>>> Bouncing the thread... Waiting to hear from people about the
>>> proposal.
>>> Amandeep Khurana
>>> Computer Science Graduate Student
>>> University of California, Santa Cruz
>>> On Fri, Mar 20, 2009 at 2:47 PM, Amandeep Khurana
>>> <am...@gmail.com> wrote:
>>>> 1. The Jira covers only authentication using Kerberos. I dont think
>>>> Kerberos is the best way to do it since I feel the scalability is
>>>> limited.
>>>> All keys have to be negotiated by the Kerberos server. The design
>>>> in the
>>>> paper has a little different protocol for authentication.
>>>> 
>>>> 2. The Jira doesnt have cover the access control aspect of things.
>>>> As a
>>>> client, I can skip talking to the NN and get blocks from the DN
>>>> straight
>>>> away. There is no way to prevent it. This paper takes care of that
>>>> aspect as
>>>> well.
>>>> 
>>>> 
>>>> Amandeep Khurana
>>>> Computer Science Graduate Student
>>>> University of California, Santa Cruz
>>>> 
>>>> 
>>>> On Fri, Mar 20, 2009 at 12:54 PM, Doug Cutting
>>>> <cu...@apache.org> wrote:
>>>> 
>>>>> Amandeep Khurana wrote:
>>>>> 
>>>>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc
>>>>>> .edu/%7Eakhurana/Hadoop_Security.pdf
>>>>>>> 
>>>>>> 
>>>>> How does this relate to the current proposal in Jira?
>>>>> 
>>>>> https://issues.apache.org/jira/browse/HADOOP-4343
>>>>> 
>>>>> Doug
>>>>> 
>>>> 
> 


Re: Design for security in Hadoop

Posted by Brian Bockelman <bb...@cse.unl.edu>.
A related meta comment.

Our community uses X509 for a single-sign-on solution for a few  
thousand physicists.  There's been increased interest in HDFS lately,  
and it would be very attractive to this community if Hadoop used a  
lightweight, but secure solution based upon Kerberos as in HADOOP-4343  
(something like kerberos to initialize a session token and use that  
with the service).

This would be especially useful because the likely implementation  
would use JSSE - we'd be able to replace the kerberos implementation  
and, with a little work, drop the Globus implementation into place.   
We'd be able to use our single-sign-on and make the organization very  
happy.

Brian

On Mar 24, 2009, at 11:29 PM, Raghu Angadi wrote:

>
> I haven't looked into the proposal, but a meta comment:
>
> I don't think there is a real reason for Hadoop to favor this design  
> or only stay with HADOOP-4343 or another proposal at this state. It  
> is healthy if we have different designs and implementation proceed  
> independently. If you are willing to, I think you should proceed  
> with a prototype so that others interested can play with. This is  
> true not just for this feature, but many others as well.
>
> This of course should not discourage others from reviewing your  
> design.
>
> Raghu.
>
> Amandeep Khurana wrote:
>> Bouncing the thread... Waiting to hear from people about the  
>> proposal.
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>> On Fri, Mar 20, 2009 at 2:47 PM, Amandeep Khurana  
>> <am...@gmail.com> wrote:
>>> 1. The Jira covers only authentication using Kerberos. I dont think
>>> Kerberos is the best way to do it since I feel the scalability is  
>>> limited.
>>> All keys have to be negotiated by the Kerberos server. The design  
>>> in the
>>> paper has a little different protocol for authentication.
>>>
>>> 2. The Jira doesnt have cover the access control aspect of things.  
>>> As a
>>> client, I can skip talking to the NN and get blocks from the DN  
>>> straight
>>> away. There is no way to prevent it. This paper takes care of that  
>>> aspect as
>>> well.
>>>
>>>
>>> Amandeep Khurana
>>> Computer Science Graduate Student
>>> University of California, Santa Cruz
>>>
>>>
>>> On Fri, Mar 20, 2009 at 12:54 PM, Doug Cutting  
>>> <cu...@apache.org> wrote:
>>>
>>>> Amandeep Khurana wrote:
>>>>
>>>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf 
>>>>> >
>>>>>
>>>> How does this relate to the current proposal in Jira?
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-4343
>>>>
>>>> Doug
>>>>
>>>


Re: Design for security in Hadoop

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
I haven't looked into the proposal, but a meta comment:

I don't think there is a real reason for Hadoop to favor this design or 
only stay with HADOOP-4343 or another proposal at this state. It is 
healthy if we have different designs and implementation proceed 
independently. If you are willing to, I think you should proceed with a 
prototype so that others interested can play with. This is true not just 
for this feature, but many others as well.

This of course should not discourage others from reviewing your design.

Raghu.

Amandeep Khurana wrote:
> Bouncing the thread... Waiting to hear from people about the proposal.
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> On Fri, Mar 20, 2009 at 2:47 PM, Amandeep Khurana <am...@gmail.com> wrote:
> 
>> 1. The Jira covers only authentication using Kerberos. I dont think
>> Kerberos is the best way to do it since I feel the scalability is limited.
>> All keys have to be negotiated by the Kerberos server. The design in the
>> paper has a little different protocol for authentication.
>>
>> 2. The Jira doesnt have cover the access control aspect of things. As a
>> client, I can skip talking to the NN and get blocks from the DN straight
>> away. There is no way to prevent it. This paper takes care of that aspect as
>> well.
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Fri, Mar 20, 2009 at 12:54 PM, Doug Cutting <cu...@apache.org> wrote:
>>
>>> Amandeep Khurana wrote:
>>>
>>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>>>
>>> How does this relate to the current proposal in Jira?
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-4343
>>>
>>> Doug
>>>
>>
> 


Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
Bouncing the thread... Waiting to hear from people about the proposal.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Mar 20, 2009 at 2:47 PM, Amandeep Khurana <am...@gmail.com> wrote:

> 1. The Jira covers only authentication using Kerberos. I dont think
> Kerberos is the best way to do it since I feel the scalability is limited.
> All keys have to be negotiated by the Kerberos server. The design in the
> paper has a little different protocol for authentication.
>
> 2. The Jira doesnt have cover the access control aspect of things. As a
> client, I can skip talking to the NN and get blocks from the DN straight
> away. There is no way to prevent it. This paper takes care of that aspect as
> well.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Mar 20, 2009 at 12:54 PM, Doug Cutting <cu...@apache.org> wrote:
>
>> Amandeep Khurana wrote:
>>
>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>>
>>
>> How does this relate to the current proposal in Jira?
>>
>> https://issues.apache.org/jira/browse/HADOOP-4343
>>
>> Doug
>>
>
>

Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.


On 3/20/09 2:47 PM, "Amandeep Khurana" <am...@gmail.com> wrote:

> 
> 2. The Jira doesnt have cover the access control aspect of things. As a
> client, I can skip talking to the NN and get blocks from the DN straight
> away. There is no way to prevent it. This paper takes care of that aspect as
> well.
> 

Have you looked at HADOOP-4359? In that JIRA, we discussed the idea of using
public-key signed capabilities and dismissed it in favor of symmetric-key
based capabilities. That said, you're welcome to explore the public-key idea
further.

Kan


Re: Design for security in Hadoop

Posted by Steve Loughran <st...@apache.org>.
Amandeep Khurana wrote:
> On Wed, Mar 25, 2009 at 12:23 PM, Kan Zhang <ka...@yahoo-inc.com> wrote:
> 
>>
>>
>> On 3/25/09 2:49 AM, "Doug Cutting" <cu...@apache.org> wrote:
>>
>>>> 2. The Jira doesnt have cover the access control aspect of things. As a
>>>> client, I can skip talking to the NN and get blocks from the DN straight
>>>> away. There is no way to prevent it. This paper takes care of that
>> aspect as
>>>> well.
>>> The intent is that access to a block on a datanode will require
>>> authentication.  Currently it does not, but as security features are
>>> added this clearly must change.  HADOOP-4343 does not mention how this
>>> will be done, but I believe it must be implemented in the same timeframe
>>> as namenode authentication.
>>>
>> We plan to use capability tokens issued by NN to control accesses to DN
>> (see
>> HADOOP-4359). If DN authenticates users, those capability tokens can be
>> made
>> non-transferable. This will improve security since stolen tokens can't be
>> used by the attacker. Another benefit of having authentication is to be
>> able
>> to establish an encrypted communication channel afterwards (if the
>> authentication protocol used supports it). However, I think DN user
>> authentication may not be necessary for many use cases and can be addressed
>> after NN authentication is done.
> 
> 
> Got it. There is no user authentication at the DN. I'm not sure why you got
> that impression. Authentication is done only once by the NN. Thereafter its
> only capabilities being passed around. However, there are 2 main
> differences:
> 1. You plan to use symmetric key and I proposed asymmetric key.
> 2. The authentation protocol you plan to use is Kerberos and I dont think
> thats scalable. Hence a different one that my paper talks about.

Brian's points about x509 integration are relevant -they are people who 
have to worry about trust.

There's a separate issue bubbling up here and that is US government 
export rules regarding encryption and the like. Apache has to deal with 
that already, and has a page covering the status:
http://www.apache.org/licenses/exports/

generally, if you use jsch or the bouncy castle implementations of JSSE 
then it's not your project's problem. Building security and encryption 
support more directly into the app is something that needs to be looked 
at very carefully.  It's where legal issues take priority over coding ones.



Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
On Wed, Mar 25, 2009 at 12:23 PM, Kan Zhang <ka...@yahoo-inc.com> wrote:

>
>
>
> On 3/25/09 2:49 AM, "Doug Cutting" <cu...@apache.org> wrote:
>
> >> 2. The Jira doesnt have cover the access control aspect of things. As a
> >> client, I can skip talking to the NN and get blocks from the DN straight
> >> away. There is no way to prevent it. This paper takes care of that
> aspect as
> >> well.
> >
> > The intent is that access to a block on a datanode will require
> > authentication.  Currently it does not, but as security features are
> > added this clearly must change.  HADOOP-4343 does not mention how this
> > will be done, but I believe it must be implemented in the same timeframe
> > as namenode authentication.
> >
>
> We plan to use capability tokens issued by NN to control accesses to DN
> (see
> HADOOP-4359). If DN authenticates users, those capability tokens can be
> made
> non-transferable. This will improve security since stolen tokens can't be
> used by the attacker. Another benefit of having authentication is to be
> able
> to establish an encrypted communication channel afterwards (if the
> authentication protocol used supports it). However, I think DN user
> authentication may not be necessary for many use cases and can be addressed
> after NN authentication is done.


Got it. There is no user authentication at the DN. I'm not sure why you got
that impression. Authentication is done only once by the NN. Thereafter its
only capabilities being passed around. However, there are 2 main
differences:
1. You plan to use symmetric key and I proposed asymmetric key.
2. The authentation protocol you plan to use is Kerberos and I dont think
thats scalable. Hence a different one that my paper talks about.





>
>
> Kan
>
>

Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
On Wed, Mar 25, 2009 at 1:43 PM, Kan Zhang <ka...@yahoo-inc.com> wrote:

>
>
>
> On 3/25/09 1:04 PM, "Kan Zhang" <ka...@yahoo-inc.com> wrote:
>
> >
> >
> >
> > On 3/25/09 12:15 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> >
> >> On Wed, Mar 25, 2009 at 2:49 AM, Doug Cutting <cu...@apache.org>
> wrote:
> >>
> >>> Amandeep Khurana wrote:
> >>>
> >>>> 1. The Jira covers only authentication using Kerberos. I dont think
> >>>> Kerberos
> >>>> is the best way to do it since I feel the scalability is limited. All
> keys
> >>>> have to be negotiated by the Kerberos server.
> >>>>
> >>>
> >>> The design in HADOOP-4343 seeks to minimize the number of key
> negotiations.
> >>>  Do you think that's insufficient?  If so, please add a comment on that
> >>> issue.
> >>
> >>
> >> The NN doing key negotiations is fundamentally not feasible. Thats the
> >> limitation of Kerberos and there's only a certain degree to which it can
> be
> >> optimized. The design I proposed in the paper is a little different from
> >> Kerberos, where the clients negotiate the keys. This frees up the NN
> from
> >> the responsibility to do this task.
> >>
> > You've lost me. What are you referring to when you say key negotiations?
> As
> > far as I read from your paper, you didn't propose anything new for the
> > authentication between NN and the user, simply mentioning it will be a
> > Kerberos like protocol. If you are referring to those capabilities for
> > accessing DN, those are issued by NN, right?
> >
> My bad. I read your doc again and I guess you are referring to the protocol
> you proposed in the paper for the authentication to datanode using namenode
> as a trusted third-party. But the namenode is certainly involved in the
> issuing of the ticket, right? Whereas if you use Kerberos, that task can be
> off-loaded to the Kerberos KDC.


The NN issues a ticket to a client once and the client goes ahead and
negotiates the keys. So, we dont need a Kerberos KDC and no other principal
in the system is loaded... At the same time, the NN has full control over
who gets into the system.

>
>
> Kan
>
>

Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.


On 3/25/09 1:04 PM, "Kan Zhang" <ka...@yahoo-inc.com> wrote:

> 
> 
> 
> On 3/25/09 12:15 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> 
>> On Wed, Mar 25, 2009 at 2:49 AM, Doug Cutting <cu...@apache.org> wrote:
>> 
>>> Amandeep Khurana wrote:
>>> 
>>>> 1. The Jira covers only authentication using Kerberos. I dont think
>>>> Kerberos
>>>> is the best way to do it since I feel the scalability is limited. All keys
>>>> have to be negotiated by the Kerberos server.
>>>> 
>>> 
>>> The design in HADOOP-4343 seeks to minimize the number of key negotiations.
>>>  Do you think that's insufficient?  If so, please add a comment on that
>>> issue.
>> 
>> 
>> The NN doing key negotiations is fundamentally not feasible. Thats the
>> limitation of Kerberos and there's only a certain degree to which it can be
>> optimized. The design I proposed in the paper is a little different from
>> Kerberos, where the clients negotiate the keys. This frees up the NN from
>> the responsibility to do this task.
>> 
> You've lost me. What are you referring to when you say key negotiations? As
> far as I read from your paper, you didn't propose anything new for the
> authentication between NN and the user, simply mentioning it will be a
> Kerberos like protocol. If you are referring to those capabilities for
> accessing DN, those are issued by NN, right?
> 
My bad. I read your doc again and I guess you are referring to the protocol
you proposed in the paper for the authentication to datanode using namenode
as a trusted third-party. But the namenode is certainly involved in the
issuing of the ticket, right? Whereas if you use Kerberos, that task can be
off-loaded to the Kerberos KDC.

Kan


Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.


On 3/25/09 12:15 PM, "Amandeep Khurana" <am...@gmail.com> wrote:

> On Wed, Mar 25, 2009 at 2:49 AM, Doug Cutting <cu...@apache.org> wrote:
> 
>> Amandeep Khurana wrote:
>> 
>>> 1. The Jira covers only authentication using Kerberos. I dont think
>>> Kerberos
>>> is the best way to do it since I feel the scalability is limited. All keys
>>> have to be negotiated by the Kerberos server.
>>> 
>> 
>> The design in HADOOP-4343 seeks to minimize the number of key negotiations.
>>  Do you think that's insufficient?  If so, please add a comment on that
>> issue.
> 
> 
> The NN doing key negotiations is fundamentally not feasible. Thats the
> limitation of Kerberos and there's only a certain degree to which it can be
> optimized. The design I proposed in the paper is a little different from
> Kerberos, where the clients negotiate the keys. This frees up the NN from
> the responsibility to do this task.
> 
You've lost me. What are you referring to when you say key negotiations? As
far as I read from your paper, you didn't propose anything new for the
authentication between NN and the user, simply mentioning it will be a
Kerberos like protocol. If you are referring to those capabilities for
accessing DN, those are issued by NN, right?

Kan


Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
On Wed, Mar 25, 2009 at 2:49 AM, Doug Cutting <cu...@apache.org> wrote:

> Amandeep Khurana wrote:
>
>> 1. The Jira covers only authentication using Kerberos. I dont think
>> Kerberos
>> is the best way to do it since I feel the scalability is limited. All keys
>> have to be negotiated by the Kerberos server.
>>
>
> The design in HADOOP-4343 seeks to minimize the number of key negotiations.
>  Do you think that's insufficient?  If so, please add a comment on that
> issue.


The NN doing key negotiations is fundamentally not feasible. Thats the
limitation of Kerberos and there's only a certain degree to which it can be
optimized. The design I proposed in the paper is a little different from
Kerberos, where the clients negotiate the keys. This frees up the NN from
the responsibility to do this task.



>
>  2. The Jira doesnt have cover the access control aspect of things. As a
>> client, I can skip talking to the NN and get blocks from the DN straight
>> away. There is no way to prevent it. This paper takes care of that aspect
>> as
>> well.
>>
>
> The intent is that access to a block on a datanode will require
> authentication.  Currently it does not, but as security features are added
> this clearly must change.  HADOOP-4343 does not mention how this will be
> done, but I believe it must be implemented in the same timeframe as namenode
> authentication.


Agreed.


>
>
> As Raghu said, the security design for Hadoop is far from complete and your
> contributions here are very welcome.


Got that.


>
>
> Doug
>
>

Re: Design for security in Hadoop

Posted by Kan Zhang <ka...@yahoo-inc.com>.


On 3/25/09 2:49 AM, "Doug Cutting" <cu...@apache.org> wrote:

>> 2. The Jira doesnt have cover the access control aspect of things. As a
>> client, I can skip talking to the NN and get blocks from the DN straight
>> away. There is no way to prevent it. This paper takes care of that aspect as
>> well.
> 
> The intent is that access to a block on a datanode will require
> authentication.  Currently it does not, but as security features are
> added this clearly must change.  HADOOP-4343 does not mention how this
> will be done, but I believe it must be implemented in the same timeframe
> as namenode authentication.
> 

We plan to use capability tokens issued by NN to control accesses to DN (see
HADOOP-4359). If DN authenticates users, those capability tokens can be made
non-transferable. This will improve security since stolen tokens can't be
used by the attacker. Another benefit of having authentication is to be able
to establish an encrypted communication channel afterwards (if the
authentication protocol used supports it). However, I think DN user
authentication may not be necessary for many use cases and can be addressed
after NN authentication is done.

Kan


Re: Design for security in Hadoop

Posted by Doug Cutting <cu...@apache.org>.
Amandeep Khurana wrote:
> 1. The Jira covers only authentication using Kerberos. I dont think Kerberos
> is the best way to do it since I feel the scalability is limited. All keys
> have to be negotiated by the Kerberos server.

The design in HADOOP-4343 seeks to minimize the number of key 
negotiations.  Do you think that's insufficient?  If so, please add a 
comment on that issue.

> 2. The Jira doesnt have cover the access control aspect of things. As a
> client, I can skip talking to the NN and get blocks from the DN straight
> away. There is no way to prevent it. This paper takes care of that aspect as
> well.

The intent is that access to a block on a datanode will require 
authentication.  Currently it does not, but as security features are 
added this clearly must change.  HADOOP-4343 does not mention how this 
will be done, but I believe it must be implemented in the same timeframe 
as namenode authentication.

As Raghu said, the security design for Hadoop is far from complete and 
your contributions here are very welcome.

Doug


Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
1. The Jira covers only authentication using Kerberos. I dont think Kerberos
is the best way to do it since I feel the scalability is limited. All keys
have to be negotiated by the Kerberos server. The design in the paper has a
little different protocol for authentication.

2. The Jira doesnt have cover the access control aspect of things. As a
client, I can skip talking to the NN and get blocks from the DN straight
away. There is no way to prevent it. This paper takes care of that aspect as
well.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Mar 20, 2009 at 12:54 PM, Doug Cutting <cu...@apache.org> wrote:

> Amandeep Khurana wrote:
>
>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>
>
> How does this relate to the current proposal in Jira?
>
> https://issues.apache.org/jira/browse/HADOOP-4343
>
> Doug
>

Re: Design for security in Hadoop

Posted by Doug Cutting <cu...@apache.org>.
Amandeep Khurana wrote:
> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf

How does this relate to the current proposal in Jira?

https://issues.apache.org/jira/browse/HADOOP-4343

Doug

Re: Design for security in Hadoop

Posted by Amandeep Khurana <am...@gmail.com>.
Apparently, the file attached was striped off. Here's the link for where you
can get it:
http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Mar 17, 2009 at 5:43 PM, Amandeep Khurana <am...@gmail.com> wrote:

> Hi
>
> I've been working on security in Hadoop and have come up with a design for
> the same. I ran some basic experiments to evaluate the design. Here's the
> report for the same.
>
> Feedback/comments/discussions on this would be great.
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>