You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by larry mccay <lm...@apache.org> on 2013/09/04 21:06:11 UTC

[DISCUSS] Security Efforts and Branching

Hello Kai, Jerry and common-dev'ers -

I would like to try and get a game plan together for how we go about
getting some of these larger security changes into branches that are
manageable, reviewable and ultimately mergeable in a timely manner.

In order to even start this discussion, I think we need an inventory of the
high level projects that are underway in parallel. We can then identify
those that are at the point where patches can be used to seed a branch.
This will give us some insight into how to break it into phases.

Off the top of my head, I can think of the following high level efforts:

1. Pluggable Authentication and Token based SSO
2. CryptoFS for volume level encryption
3. Hive Table/Column Level Encryption (admittedly this is Hive work but it
will leverage common work done in Hadoop)
4. Authorization

Now, #1 and #2 above have related Jiras and a number of patches available
and are therefore early contenders for branching.

#1 has a draft for an initial iteration that was discussed in another
thread and I will attach a pdf version of the iteration-1 proposal to this
mail.

I propose that we converge on an initial plan based on further discussion
of the attached iteration and file a Jira to represent that iteration. We
can then break down the larger patches on existing Jiras to fit into the
constrained scope of the agreed upon iteration and attach them to subtasks
of the iteration Jira.

We can then seed a Pluggable Authentication and Token based SSO branch with
those related patches from H-9392, H-9534, H-9781.

Now, whether we introduce a whole central sso service in that branch is up
for discussion but I personally think that it will violate the "keeping it
small and manageable" goal. I am wondering whether a branch for security
services would do well to decouple the consumers from a specific
implementation that happens to be remote. Then within the Pluggable
Authentication branch - we can concentrate on the consumer level and local
implementations.

I assume that the CryptoFS work is also intended to be done within the
branches and we have to therefore consider how to leverage common code for
things like key access for encryption/decryption and signing/verifying.
This sort of thing is being introduced by H-9534 as part of the Pluggable
Authentication branch in support of JWT tokens. So, we will have to think
through what branches are required for Crypto in the near term.

Perhaps, we can concentrate on those portions of crypto that will be of
immediate benefit to iteration-1 and leave higher order CryptoFS stuff to
another iteration? I don't think that we want an explosion of branches at
any given time. If we can limit it to specific areas, close down on the
iteration and get it merged before creating a new set of branches that
would be best. Again, ease of review, test and merge is important for us.

I am curious how development across related branches like these would work
though. If the service work needs to leverage work from the other how do we
do that easily. Can we branch a branch? Will that require both to be ready
to merge at the same time?

Perhaps, low-level dependencies can be duplicated for some time and then
consolidated later?

Anyway, specific questions:

Does the proposal to start with the attached iteration-1 draft to create an
iteration Jira make sense to everyone?

Does anyone have specific suggestions regarding the best way for managing
branches that should be decoupled but at the same time leverage common code?

Any other thoughts or insight?

thanks,

--larry

RE: [DISCUSS] Security Efforts and Branching

Posted by "Zheng, Kai" <ka...@intel.com>.

Larry, and all

Apologize for not responding sooner. I read your proposals and think about how to collaborate well and speed up things for all of us. From community discussions around the Hadoop Summit, TokenAuth should be a pluggable full stack to accommodate different implementations. HADOOP-9392 reflects that thinking and came up with the breakdown attached in the JIRA. To simplify the discussion I would try to illustrate it here in very high level as follows.

Simply we would have:
TokenAuth = TokenAuth framework + TokenAuth implementation (HAS) + TokenAuth integration

= TokenAuth framework =
It first defines TokenAuth as the desired pluggable framework that defines and provides required APIs, protocols, flows, and facilities along with common implementations for related constructs, entities and even services. The framework is a subject for continued discussion and defined together as a common effort of the community. It's important that the framework be pluggable in all the key places to allow certain solutions to employ their own product level implementations. Based on this framework, we could build the HAS implementation. Initially, we have the following items to think about to define relevant API and provide core facilities for the framework and the list is to be complemented.
1. Common token definition;
2. TokenAuthn method for Hadoop RPC;
3. Authentication Service;
4. Identity Token Service;
5. Access Token Service;
6. Fine grained authorization;
7. Attribute Service;
8. Token authentication client;
9. Token cache;
10. Common configuration across TokenAuth;
11. Hadoop token command;
12. Key Provider;
13. Web SSO support;
14. REST SSO support;
15. Auditing support.

= TokenAuth implementation (HAS) =
This defines and implements Hadoop AuthN/AuthZ Server (HAS) based on TokenAuth framework. HAS is a centralized server to address AAA (Authentication, Authorization, Auditing) concerns for Hadoop across the ecosystem. The 'A' of HAS could stand for "Authentication", "Authorization", or "Auditing", depending on which role(s) HAS is provisioned with. HAS is a complete and enterprise ready security solution based on TokenAuth framework and utilizes the common facilities provided by the framework. It customizes and provides all the necessary implementations of constructs, entities, and services defined in the framework that's required by enterprise deployment. Initially we have the following for the implementation:
1. Provide common and management facilities including configuration loading/syncing mechanism, auditing and logging support, shared high availability approach, REST support and so on;
2. Implement Authentication Server role for HAS, implementing Authentication Service, and Identity Token Service defined in the framework. The authentication engine can be configured with a chain of authentication modules to support multi-factor authentication. Particularly, it will support LDAP authentication;
3. Implement Authorization Server role for HAS, implementing Access Token Service;
4. Implement centralized administration for fine-grained authorization for Authorization Server role. Optional in initial iteration;
5. Implement Attribute Service for HAS, to allow integration of third party attribute authorities. Optional in initial iteration.
6. Provides authorization enforcement library for Hadoop services to enforce security policies utilizing related services provided by the Authorization Server. Optional in initial iteration.

= TokenAuth integration =
This includes tasks that employ TokenAuth framework and relevant implementation(s) to enable related supports for various Hadoop components across the ecosystem for typical enterprise deployments. Currently we have the following in mind:
1. Enable Web SSO flow for web interfaces like HDFS and YARN;
2. Enable REST SSO flow for REST interface like Oozie;
3. Add Thrift and Hive JDBC support using TokenAuth. We consider this support because it is an important interface for enterprise to interact with data;
4. Enable to access Zookeeper using TokenAuth since it's widely used as the coordinator across the ecosystem.

I regard decouple of the pluggable framework from specific implementation as important since we're addressing the similar requirements on the other hand we have different implementation considerations in approaches like the ones represented by HADOOP-9392 and HADOOP-9533. For example, to support pluggable authentication HADOOP-9392 prefers to JAAS based authentication modules but HADOOP-9533 suggests using Apache Shiro. By this decouple we could best collaborate and contribute, as far as I understood, you might agree with this approach as can be seen in your recent email, "decouple the pluggable framework from any specific central server implementation". If I understood you correctly, do you think for the initial iteration we have to have two central servers like HAS server and HSSO server? If not, do you think it works for us to have HAS as a community effort as the TokenAuth framework and we both contribute on the implementation?

To proceed, I would try to align between us, complementing your proposal and addressing your concerns as follows.

= Iteration Endstate =
Besides what you mentioned from user view, how about adding this consideration:
Additionally, the initial iteration would also lay down the ground TokenAuth framework with fine defined APIs, protocols, flows and core facilities for implementations. The framework should avoid rework and big change for future implementations.

= Terminology and Naming =
It would be great if we can unify the related terminologies in this effort, at least in the framework level. This could be probably achieved in the process of defining relevant APIs for the TokenAuth framework.

= Project scope =
It's great we have the common list in scope for the first iteration as you mentioned as follows:
Usecases:
client types: REST, CLI, UI
authentication types: Simple, Kerberos, authentication/LDAP, federation/SAML

We might also consider OAuth 2.0 support. Anyway please note by defining this in-scope list we know what's required as must-have in the iteration as enforcement of our consensus, however it should not limit any relevant parties to contribute more meanwhile unless it does not be appropriate at the time.

= Branch =
As you mentioned we may have different branches for different features considering merge.  Another approach is just having one branch with relevant security features, the review and merge work can still be JIRA based.

1. Based on your proposal, how about the following as the branch(es) scope:
1)  Pluggable Authentication and Token based SSO
2)  CryptoFS for volume level encryption (HCFS)
3) Pluggable UGI change
4) Key management system
5) Unified authorization

2. With the above scope in mind, a candidate branch name could be like 'security-branch' instead of 'tokenauth-branch'. How about creating the branch now if we don't have other concerns?

3. Check-in philosophy. Agree with your proposal with slightly concerns:
In terms of check-in philosophy, we should take a review then check-in approach to the branch with lazy consensus - wherein we do not need to explicitly +1 every check-in to the branch but we will honor any -1's with discussion to resolve before checking in. This will provide us each with the opportunity to track the work being done and ensure that we understand it and find that it meets the intended goals.

We might need explicit +1 otherwise we would need define a time window pending to wait when to check-in.
One issue we would like to clarify, does voting also include the security branch committers.

= JIRA =
We might not need additional umbrella JIRA for now since we already have HADOOP-9392 and HADOOP-9533. By the way I would suggest we use existing feature JIRAs to discuss relevant and specific issues on the going. Leveraging these JIRAs we might avoid too much details in the common-dev thread and it's also easy to track relevant discussions.

I agree it's a good point to start with an inventory of the existing JIRAs. We can do that if there're no other concerns. We would provide the full list of breakdown JIRAs and attach it in HADOOP-9392 then for further collaboration.

Regards,
Kai

From: larry mccay [mailto:larry.mccay@gmail.com]
Sent: Wednesday, September 18, 2013 6:27 AM
To: Zheng, Kai; Chen, Haifeng; common-dev@hadoop.apache.org
Subject: Re: [DISCUSS] Security Efforts and Branching

All -

I apologize for not following up sooner. I have been heads down on some other matters that required my attention.

It seems that it may be easier to move forward by gaining consensus a little bit at a time rather than trying to hit the ground running where the other thread left off.

Would it be agreeable to everyone to start with an inventory of the existing Jiras that have patches available or nearly available so that we can determine what concrete bits we have to start with?

Once we get that done, we can try and frame a set of goals to to make up the initial iteration and determine what from the inventory will be leverage in that iteration.

Does this sound reasonable to everyone?
Would anyone like to propose another starting point?

thanks,

--larry

On Wed, Sep 4, 2013 at 4:26 PM, larry mccay <la...@gmail.com>> wrote:
It doesn't look like the PDF made it all the way through to the archives and maybe even to recipients - so the following is the text version of the iteration-1 draft:

Iteration 1: Pluggable User Authentication and Federation

Introduction
The intent of this effort is to bootstrap the development of pluggable token-based authentication mechanisms to support certain goals of enterprise authentication integrations. By restricting the scope of this effort, we hope to provide immediate benefit to the community while keeping the initial contribution to a manageable size that can be easily reviewed, understood and extended with further development through follow up JIRAs and related iterations.

Iteration Endstate
Once complete, this effort will have extended the authentication mechanisms - for all client types - from the existing: Simple, Kerberos and Plain (for RPC) to include LDAP authentication and SAML based federation. In addition, the ability to provide additional/custom authentication mechanisms will be enabled for users to plug in their preferred mechanisms.

Project Scope
The scope of this effort is a subset of the features covered by the overviews of HADOOP-9392 and HADOOP-9533. This effort concentrates on enabling Hadoop to issue, accept/validate SSO tokens of its own. The pluggable authentication mechanism within SASL/RPC layer and the authentication filter pluggability for REST and UI components will be leveraged and extended to support the results of this effort.

Out of Scope
In order to scope the initial deliverable as the minimally viable product, a handful of things have been simplified or left out of scope for this effort. This is not meant to say that these aspects are not useful or not needed but that they are not necessary for this iteration. We do however need to ensure that we don't do anything to preclude adding them in future iterations.
1. Additional Attributes - the result of authentication will continue to use the existing hadoop tokens and identity representations. Additional attributes used for finer grained authorization decisions will be added through follow-up efforts.
2. Token revocation - the ability to revoke issued identity tokens will be added later
3. Multi-factor authentication - this will likely require additional attributes and is not necessary for this iteration.
4. Authorization changes - we will require additional attributes for the fine-grained access control plans. This is not needed for this iteration.
5. Domains - we assume a single flat domain for all users
6. Kinit alternative - we can leverage existing REST clients such as cURL to retrieve tokens through authentication and federation for the time being
7. A specific authentication framework isn't really necessary within the REST endpoints for this iteration. If one is available then we can use it otherwise we can leverage existing things like Apache Shiro within a servlet filter.

In Scope
What is in scope for this effort is defined by the usecases described below. Components required for supporting the usecases are summarized for each client type. Each component is a candidate for a JIRA subtask - though multiple components are likely to be included in a JIRA to represent a set of functionality rather than individual JIRAs per component.

Terminology and Naming
The terms and names of components within this document are merely descriptive of the functionality that they represent. Any similarity or difference in names or terms from those that are found in other documents are not intended to make any statement about those other documents or the descriptions within. This document represents the pluggable authentication mechanisms and server functionality required to replace Kerberos.

Ultimately, the naming of the implementation classes will be a product of the patches accepted by the community.

Usecases:
client types: REST, CLI, UI
authentication types: Simple, Kerberos, authentication/LDAP, federation/SAML

Simple and Kerberos
Simple and Kerberos usecases continue to work as they do today. The addition of Authentication/LDAP and Federation/SAML are added through the existing pluggability points either as they are or with required extension. Either way, continued support for Simple and Kerberos must not require changes to existing deployments in the field as a result of this effort.

REST
USECASE REST-1 Authentication/LDAP:
For REST clients, we will provide the ability to:
1. use cURL to Authenticate via LDAP through an IdP endpoint exposed by an AuthenticationServer instance via REST calls to:
   a. authenticate - passing username/password returning a hadoop id_token
   b. get-access-token - from the TokenGrantingService by passing the hadoop id_token as an Authorization: Bearer token along with the desired service name (master service name) returning a hadoop access token
2. Successfully invoke a hadoop service REST API passing the hadoop access token through an HTTP header as an Authorization Bearer token
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
3. Successfully block access to a REST resource when presenting a hadoop access token intended for a different service
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler

USECASE REST-2 Federation/SAML:
We will also provide federation capabilities for REST clients such that:
1. acquire SAML assertion token from a trusted IdP (shibboleth?) and persist in a permissions protected file - ie. ~/.hadoop_tokens/.idp_token
2. use cURL to Federate a token from a trusted IdP through an SP endpoint exposed by an AuthenticationServer(FederationServer?) instance via REST calls to:
   a. federate - passing a SAML assertion as an Authorization: Bearer token returning a hadoop id_token
      - can copy and paste from commandline or use cat to include persisted token through "--Header Authorization: Bearer 'cat ~/.hadoop_tokens/.id_token'"
   b. get-access-token - from the TokenGrantingService by passing the hadoop id_token as an Authorization: Bearer token along with the desired service name (master service name) to the TokenGrantingService returning a hadoop access token
3. Successfully invoke a hadoop service REST API passing the hadoop access token through an HTTP header as an Authorization Bearer token
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
4. Successfully block access to a REST resource when presenting a hadoop access token intended for a different service
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
REQUIRED COMPONENTS for REST USECASES:
COMP-1. REST client - cURL or similar
COMP-2. REST endpoint for BASIC authentication to LDAP - IdP endpoint example - returning hadoop id_token
COMP-3. REST endpoint for federation with SAML Bearer token - shibboleth SP?|OpenSAML? - returning hadoop id_token
COMP-4. REST TokenGrantingServer endpoint for acquiring hadoop access tokens from hadoop id_tokens
COMP-5. SSOAuthenticationHandler to validate incoming hadoop access tokens
COMP-6. some source of a SAML assertion - shibboleth IdP?
COMP-7. hadoop token and authority implementations
COMP-8. core services for crypto support for signing, verifying and PKI management

CLI
USECASE CLI-1 Authentication/LDAP:
For CLI/RPC clients, we will provide the ability to:
1. use cURL to Authenticate via LDAP through an IdP endpoint exposed by an AuthenticationServer instance via REST calls to:
   a. authenticate - passing username/password returning a hadoop id_token
      - for RPC clients we need to persist the returned hadoop identity token in a file protected by fs permissions so that it may be leveraged until expiry
      - directing the returned response to a file may suffice for now something like ">~/.hadoop_tokens/.id_token"
2. use hadoop CLI to invoke RPC API on a specific hadoop service
   a. RPC client negotiates a TokenAuth method through SASL layer, hadoop id_token is retrieved from ~/.hadoop_tokens/.id_token is passed as Authorization: Bearer token to the get-access-token REST endpoint exposed by TokenGrantingService returning a hadoop access token
   b. RPC server side validates the presented hadoop access token and continues to serve request
   c. Successfully invoke a hadoop service RPC API

USECASE CLI-2 Federation/SAML:
For CLI/RPC clients, we will provide the ability to:
1. acquire SAML assertion token from a trusted IdP (shibboleth?) and persist in a permissions protected file - ie. ~/.hadoop_tokens/.idp_token
2. use cURL to Federate a token from a trusted IdP through an SP endpoint exposed by an AuthenticationServer(FederationServer?) instance via REST calls to:
   a. federate - passing a SAML assertion as an Authorization: Bearer token returning a hadoop id_token
      - can copy and paste from commandline or use cat to include previously persisted token through "--Header Authorization: Bearer 'cat ~/.hadoop_tokens/.id_token'"
3. use hadoop CLI to invoke RPC API on a specific hadoop service
   a. RPC client negotiates a TokenAuth method through SASL layer, hadoop id_token is retrieved from ~/.hadoop_tokens/.id_token is passed as Authorization: Bearer token to the get-access-token REST endpoint exposed by TokenGrantingService returning a hadoop access token
   b. RPC server side validates the presented hadoop access token and continues to serve request
   c. Successfully invoke a hadoop service RPC API

REQUIRED COMPONENTS for CLI USECASES - (beyond those required for REST):
COMP-9. TokenAuth Method negotiation, etc
COMP-10. Client side implementation to leverage REST endpoint for acquiring hadoop access tokens given a hadoop id_token
COMP-11. Server side implementation to validate incoming hadoop access tokens

UI
Various Hadoop services have their own web UI consoles for administration and end user interactions. These consoles need to also benefit from the pluggability of authentication mechansims to be on par with the access control of the cluster REST and RPC APIs.
Web consoles are protected with an WebSSOAuthenticationHandler which will be configured for either authentication or federation.

USECASE UI-1 Authentication/LDAP:
For the authentication usecase:
1. User's browser requests access to a UI console page
2. WebSSOAuthenticationHandler intercepts the request and redirects the browser to an IdP web endpoint exposed by the AuthenticationServer passing the requested url as the redirect_url
3. IdP web endpoint presents the user with a FORM over https
   a. user provides username/password and submits the FORM
4. AuthenticationServer authenticates the user with provided credentials against the configured LDAP server and:
   a. leverages a servlet filter or other authentication mechanism for the endpoint and authenticates the user with a simple LDAP bind with username and password
   b. acquires a hadoop id_token and uses it to acquire the required hadoop access token which is added as a cookie
   c. redirects the browser to the original service UI resource via the provided redirect_url
5. WebSSOAuthenticationHandler for the original UI resource interrogates the incoming request again for an authcookie that contains an access token upon finding one:
   a. validates the incoming token
   b. returns the AuthenticationToken as per AuthenticationHandler contract
   c. AuthenticationFilter adds the hadoop auth cookie with the expected token
   d. serves requested resource for valid tokens
   e. subsequent requests are handled by the AuthenticationFilter recognition of the hadoop auth cookie

USECASE UI-2 Federation/SAML:
For the federation usecase:
1. User's browser requests access to a UI console page
2. WebSSOAuthenticationHandler intercepts the request and redirects the browser to an SP web endpoint exposed by the AuthenticationServer passing the requested url as the redirect_url. This endpoint:
   a. is dedicated to redirecting to the external IdP passing the required parameters which may include a redirect_url back to itself as well as encoding the original redirect_url so that it can determine it on the way back to the client
3. the IdP:
   a. challenges the user for credentials and authenticates the user
   b. creates appropriate token/cookie and redirects back to the AuthenticationServer endpoint
4. AuthenticationServer endpoint:
   a. extracts the expected token/cookie from the incoming request and validates it
   b. creates a hadoop id_token
   c. acquires a hadoop access token for the id_token
   d. creates appropriate cookie and redirects back to the original redirect_url - being the requested resource
5. WebSSOAuthenticationHandler for the original UI resource interrogates the incoming request again for an authcookie that contains an access token upon finding one:
   a. validates the incoming token
   b. returns the AuthenticationToken as per AuthenticationHandler contrac
   c. AuthenticationFilter adds the hadoop auth cookie with the expected token
   d. serves requested resource for valid tokens
   e. subsequent requests are handled by the AuthenticationFilter recognition of the hadoop auth cookie
REQUIRED COMPONENTS for UI USECASES:
COMP-12. WebSSOAuthenticationHandler
COMP-13. IdP Web Endpoint within AuthenticationServer for FORM based login
COMP-14. SP Web Endpoint within AuthenticationServer for 3rd party token federation

On Wed, Sep 4, 2013 at 3:06 PM, larry mccay <lm...@apache.org>> wrote:
Hello Kai, Jerry and common-dev'ers -

I would like to try and get a game plan together for how we go about getting some of these larger security changes into branches that are manageable, reviewable and ultimately mergeable in a timely manner.

In order to even start this discussion, I think we need an inventory of the high level projects that are underway in parallel. We can then identify those that are at the point where patches can be used to seed a branch. This will give us some insight into how to break it into phases.

Off the top of my head, I can think of the following high level efforts:

1. Pluggable Authentication and Token based SSO
2. CryptoFS for volume level encryption
3. Hive Table/Column Level Encryption (admittedly this is Hive work but it will leverage common work done in Hadoop)
4. Authorization

Now, #1 and #2 above have related Jiras and a number of patches available and are therefore early contenders for branching.

#1 has a draft for an initial iteration that was discussed in another thread and I will attach a pdf version of the iteration-1 proposal to this mail.

I propose that we converge on an initial plan based on further discussion of the attached iteration and file a Jira to represent that iteration. We can then break down the larger patches on existing Jiras to fit into the constrained scope of the agreed upon iteration and attach them to subtasks of the iteration Jira.

We can then seed a Pluggable Authentication and Token based SSO branch with those related patches from H-9392, H-9534, H-9781.

Now, whether we introduce a whole central sso service in that branch is up for discussion but I personally think that it will violate the "keeping it small and manageable" goal. I am wondering whether a branch for security services would do well to decouple the consumers from a specific implementation that happens to be remote. Then within the Pluggable Authentication branch - we can concentrate on the consumer level and local implementations.

I assume that the CryptoFS work is also intended to be done within the branches and we have to therefore consider how to leverage common code for things like key access for encryption/decryption and signing/verifying. This sort of thing is being introduced by H-9534 as part of the Pluggable Authentication branch in support of JWT tokens. So, we will have to think through what branches are required for Crypto in the near term.

Perhaps, we can concentrate on those portions of crypto that will be of immediate benefit to iteration-1 and leave higher order CryptoFS stuff to another iteration? I don't think that we want an explosion of branches at any given time. If we can limit it to specific areas, close down on the iteration and get it merged before creating a new set of branches that would be best. Again, ease of review, test and merge is important for us.

I am curious how development across related branches like these would work though. If the service work needs to leverage work from the other how do we do that easily. Can we branch a branch? Will that require both to be ready to merge at the same time?

Perhaps, low-level dependencies can be duplicated for some time and then consolidated later?

Anyway, specific questions:

Does the proposal to start with the attached iteration-1 draft to create an iteration Jira make sense to everyone?

Does anyone have specific suggestions regarding the best way for managing branches that should be decoupled but at the same time leverage common code?

Any other thoughts or insight?

thanks,

--larry

RE: [DISCUSS] Security Efforts and Branching

Posted by "Zheng, Kai" <ka...@intel.com>.

Sorry, please kindly allow me to repost this with some cleanup.

Larry, and all

Apologize for not responding sooner. I read your proposals and think about how to collaborate well and speed up things for all of us. From community discussions around the Hadoop Summit, TokenAuth should be a pluggable full stack to accommodate different implementations. HADOOP-9392 reflects that thinking and came up with the breakdown attached in the JIRA. To simplify the discussion I would try to illustrate it here in very high level as follows.

Simply we would have:
TokenAuth = TokenAuth framework + TokenAuth implementation (HAS) + TokenAuth integration

= TokenAuth framework =
It first defines TokenAuth as the desired pluggable framework that defines and provides required APIs, protocols, flows, and facilities along with common implementations for related constructs, entities and even services. The framework is a subject for continued discussion and defined together as a common effort of the community. It's important that the framework be pluggable in all the key places to allow certain solutions to employ their own product level implementations. Based on this framework, we could build the HAS implementation. Initially, we have the following items to think about to define relevant API and provide core facilities for the framework and the list is to be complemented.
1. Common token definition;
2. TokenAuthn method for Hadoop RPC;
3. Authentication Service;
4. Identity Token Service;
5. Access Token Service;
6. Fine grained authorization;
7. Attribute Service;
8. Token authentication client;
9. Token cache;
10. Common configuration across TokenAuth;
11. Hadoop token command;
12. Key Provider;
13. Web SSO support;
14. REST SSO support;
15. Auditing support.

= TokenAuth implementation (HAS) =
This defines and implements Hadoop AuthN/AuthZ Server (HAS) based on TokenAuth framework. HAS is a centralized server to address AAA (Authentication, Authorization, Auditing) concerns for Hadoop across the ecosystem. The 'A' of HAS could stand for "Authentication", "Authorization", or "Auditing", depending on which role(s) HAS is provisioned with. HAS is a complete and enterprise ready security solution based on TokenAuth framework and utilizes the common facilities provided by the framework. It customizes and provides all the necessary implementations of constructs, entities, and services defined in the framework that's required by enterprise deployment. Initially we have the following for the implementation:
1. Provide common and management facilities including configuration loading/syncing mechanism, auditing and logging support, shared high availability approach, REST support and so on;
2. Implement Authentication Server role for HAS, implementing Authentication Service, and Identity Token Service defined in the framework. The authentication engine can be configured with a chain of authentication modules to support multi-factor authentication. Particularly, it will support LDAP authentication;
3. Implement Authorization Server role for HAS, implementing Access Token Service;
4. Implement centralized administration for fine-grained authorization for Authorization Server role. Optional in initial iteration;
5. Implement Attribute Service for HAS, to allow integration of third party attribute authorities. Optional in initial iteration.
6. Provides authorization enforcement library for Hadoop services to enforce security policies utilizing related services provided by the Authorization Server. Optional in initial iteration.

= TokenAuth integration =
This includes tasks that employ TokenAuth framework and relevant implementation(s) to enable related supports for various Hadoop components across the ecosystem for typical enterprise deployments. Currently we have the following in mind:
1. Enable Web SSO flow for web interfaces like HDFS and YARN;
2. Enable REST SSO flow for REST interface like Oozie;
3. Add Thrift and Hive JDBC support using TokenAuth. We consider this support because it is an important interface for enterprise to interact with data;
4. Enable to access Zookeeper using TokenAuth since it's widely used as the coordinator across the ecosystem.

I regard decouple of the pluggable framework from specific implementation as important since we're addressing the similar requirements on the other hand we have different implementation considerations in approaches like the ones represented by HADOOP-9392 and HADOOP-9533. For example, to support pluggable authentication HADOOP-9392 prefers to JAAS based authentication modules but HADOOP-9533 suggests using Apache Shiro. By this decouple we could best collaborate and contribute, as far as I understood, you might agree with this approach as can be seen in your recent email, "decouple the pluggable framework from any specific central server implementation". If I understood you correctly, do you think for the initial iteration we have to have two central servers like HAS server and HSSO server? If not, do you think it works for us to have HAS as a community effort as the TokenAuth framework and we both contribute on the implementation?

To proceed, I would try to align between us, complementing your proposal and addressing your concerns as follows.

= Iteration Endstate =
Besides what you mentioned from user view, how about adding this consideration:
Additionally, the initial iteration would also lay down the ground TokenAuth framework with fine defined APIs, protocols, flows and core facilities for implementations. The framework should avoid rework and big change for future implementations.

= Terminology and Naming =
It would be great if we can unify the related terminologies in this effort, at least in the framework level. This could be probably achieved in the process of defining relevant APIs for the TokenAuth framework.

= Project scope =
It's great we have the common list in scope for the first iteration as you mentioned as follows:
Usecases:
client types: REST, CLI, UI
authentication types: Simple, Kerberos, authentication/LDAP, federation/SAML

We might also consider OAuth 2.0 support. Anyway please note by defining this in-scope list we know what's required as must-have in the iteration as enforcement of our consensus, however it should not limit any relevant parties to contribute more meanwhile unless it does not be appropriate at the time.

= Branch =
As you mentioned we may have different branches for different features considering merge.  Another approach is just having one branch with relevant security features, the review and merge work can still be JIRA based.

1. Based on your proposal, how about the following as the branch(es) scope:
1) Pluggable Authentication and Token based SSO
2) CryptoFS for volume level encryption (HCFS)
3) Pluggable UGI change
4) Key management system
5) Unified authorization

2. With the above scope in mind, a candidate branch name could be like 'security-branch' instead of 'tokenauth-branch'. How about creating the branch now if we don't have other concerns?

3. Check-in philosophy. Agree with your proposal with slightly concerns:
In terms of check-in philosophy, we should take a review then check-in approach to the branch with lazy consensus - wherein we do not need to explicitly +1 every check-in to the branch but we will honor any -1's with discussion to resolve before checking in. This will provide us each with the opportunity to track the work being done and ensure that we understand it and find that it meets the intended goals.

We might need explicit +1 otherwise we would need define a time window pending to wait when to check-in.
One issue we would like to clarify, does voting also include the security branch committers.

= JIRA =
We might not need additional umbrella JIRA for now since we already have HADOOP-9392 and HADOOP-9533. By the way I would suggest we use existing feature JIRAs to discuss relevant and specific issues on the going. Leveraging these JIRAs we might avoid too much details in the common-dev thread and it's also easy to track relevant discussions.

I agree it's a good point to start with an inventory of the existing JIRAs. We can do that if there're no other concerns. We would provide the full list of breakdown JIRAs and attach it in HADOOP-9392 then for further collaboration.

Regards,
Kai

From: larry mccay [mailto:larry.mccay@gmail.com]
Sent: Wednesday, September 18, 2013 6:27 AM
To: Zheng, Kai; Chen, Haifeng; common-dev@hadoop.apache.org
Subject: Re: [DISCUSS] Security Efforts and Branching

All -

I apologize for not following up sooner. I have been heads down on some other matters that required my attention.

It seems that it may be easier to move forward by gaining consensus a little bit at a time rather than trying to hit the ground running where the other thread left off.

Would it be agreeable to everyone to start with an inventory of the existing Jiras that have patches available or nearly available so that we can determine what concrete bits we have to start with?

Once we get that done, we can try and frame a set of goals to to make up the initial iteration and determine what from the inventory will be leverage in that iteration.

Does this sound reasonable to everyone?
Would anyone like to propose another starting point?

thanks,

--larry

On Wed, Sep 4, 2013 at 4:26 PM, larry mccay <la...@gmail.com>> wrote:
It doesn't look like the PDF made it all the way through to the archives and maybe even to recipients - so the following is the text version of the iteration-1 draft:

Iteration 1: Pluggable User Authentication and Federation

Introduction
The intent of this effort is to bootstrap the development of pluggable token-based authentication mechanisms to support certain goals of enterprise authentication integrations. By restricting the scope of this effort, we hope to provide immediate benefit to the community while keeping the initial contribution to a manageable size that can be easily reviewed, understood and extended with further development through follow up JIRAs and related iterations.

Iteration Endstate
Once complete, this effort will have extended the authentication mechanisms - for all client types - from the existing: Simple, Kerberos and Plain (for RPC) to include LDAP authentication and SAML based federation. In addition, the ability to provide additional/custom authentication mechanisms will be enabled for users to plug in their preferred mechanisms.

Project Scope
The scope of this effort is a subset of the features covered by the overviews of HADOOP-9392 and HADOOP-9533. This effort concentrates on enabling Hadoop to issue, accept/validate SSO tokens of its own. The pluggable authentication mechanism within SASL/RPC layer and the authentication filter pluggability for REST and UI components will be leveraged and extended to support the results of this effort.

Out of Scope
In order to scope the initial deliverable as the minimally viable product, a handful of things have been simplified or left out of scope for this effort. This is not meant to say that these aspects are not useful or not needed but that they are not necessary for this iteration. We do however need to ensure that we don't do anything to preclude adding them in future iterations.
1. Additional Attributes - the result of authentication will continue to use the existing hadoop tokens and identity representations. Additional attributes used for finer grained authorization decisions will be added through follow-up efforts.
2. Token revocation - the ability to revoke issued identity tokens will be added later
3. Multi-factor authentication - this will likely require additional attributes and is not necessary for this iteration.
4. Authorization changes - we will require additional attributes for the fine-grained access control plans. This is not needed for this iteration.
5. Domains - we assume a single flat domain for all users
6. Kinit alternative - we can leverage existing REST clients such as cURL to retrieve tokens through authentication and federation for the time being
7. A specific authentication framework isn't really necessary within the REST endpoints for this iteration. If one is available then we can use it otherwise we can leverage existing things like Apache Shiro within a servlet filter.

In Scope
What is in scope for this effort is defined by the usecases described below. Components required for supporting the usecases are summarized for each client type. Each component is a candidate for a JIRA subtask - though multiple components are likely to be included in a JIRA to represent a set of functionality rather than individual JIRAs per component.

Terminology and Naming
The terms and names of components within this document are merely descriptive of the functionality that they represent. Any similarity or difference in names or terms from those that are found in other documents are not intended to make any statement about those other documents or the descriptions within. This document represents the pluggable authentication mechanisms and server functionality required to replace Kerberos.

Ultimately, the naming of the implementation classes will be a product of the patches accepted by the community.

Usecases:
client types: REST, CLI, UI
authentication types: Simple, Kerberos, authentication/LDAP, federation/SAML

Simple and Kerberos
Simple and Kerberos usecases continue to work as they do today. The addition of Authentication/LDAP and Federation/SAML are added through the existing pluggability points either as they are or with required extension. Either way, continued support for Simple and Kerberos must not require changes to existing deployments in the field as a result of this effort.

REST
USECASE REST-1 Authentication/LDAP:
For REST clients, we will provide the ability to:
1. use cURL to Authenticate via LDAP through an IdP endpoint exposed by an AuthenticationServer instance via REST calls to:
   a. authenticate - passing username/password returning a hadoop id_token
   b. get-access-token - from the TokenGrantingService by passing the hadoop id_token as an Authorization: Bearer token along with the desired service name (master service name) returning a hadoop access token
2. Successfully invoke a hadoop service REST API passing the hadoop access token through an HTTP header as an Authorization Bearer token
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
3. Successfully block access to a REST resource when presenting a hadoop access token intended for a different service
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler

USECASE REST-2 Federation/SAML:
We will also provide federation capabilities for REST clients such that:
1. acquire SAML assertion token from a trusted IdP (shibboleth?) and persist in a permissions protected file - ie. ~/.hadoop_tokens/.idp_token
2. use cURL to Federate a token from a trusted IdP through an SP endpoint exposed by an AuthenticationServer(FederationServer?) instance via REST calls to:
   a. federate - passing a SAML assertion as an Authorization: Bearer token returning a hadoop id_token
      - can copy and paste from commandline or use cat to include persisted token through "--Header Authorization: Bearer 'cat ~/.hadoop_tokens/.id_token'"
   b. get-access-token - from the TokenGrantingService by passing the hadoop id_token as an Authorization: Bearer token along with the desired service name (master service name) to the TokenGrantingService returning a hadoop access token
3. Successfully invoke a hadoop service REST API passing the hadoop access token through an HTTP header as an Authorization Bearer token
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
4. Successfully block access to a REST resource when presenting a hadoop access token intended for a different service
   a. validation of the incoming token on the service endpoint is accomplished by an SSOAuthenticationHandler
REQUIRED COMPONENTS for REST USECASES:
COMP-1. REST client - cURL or similar
COMP-2. REST endpoint for BASIC authentication to LDAP - IdP endpoint example - returning hadoop id_token
COMP-3. REST endpoint for federation with SAML Bearer token - shibboleth SP?|OpenSAML? - returning hadoop id_token
COMP-4. REST TokenGrantingServer endpoint for acquiring hadoop access tokens from hadoop id_tokens
COMP-5. SSOAuthenticationHandler to validate incoming hadoop access tokens
COMP-6. some source of a SAML assertion - shibboleth IdP?
COMP-7. hadoop token and authority implementations
COMP-8. core services for crypto support for signing, verifying and PKI management

CLI
USECASE CLI-1 Authentication/LDAP:
For CLI/RPC clients, we will provide the ability to:
1. use cURL to Authenticate via LDAP through an IdP endpoint exposed by an AuthenticationServer instance via REST calls to:
   a. authenticate - passing username/password returning a hadoop id_token
      - for RPC clients we need to persist the returned hadoop identity token in a file protected by fs permissions so that it may be leveraged until expiry
      - directing the returned response to a file may suffice for now something like ">~/.hadoop_tokens/.id_token"
2. use hadoop CLI to invoke RPC API on a specific hadoop service
   a. RPC client negotiates a TokenAuth method through SASL layer, hadoop id_token is retrieved from ~/.hadoop_tokens/.id_token is passed as Authorization: Bearer token to the get-access-token REST endpoint exposed by TokenGrantingService returning a hadoop access token
   b. RPC server side validates the presented hadoop access token and continues to serve request
   c. Successfully invoke a hadoop service RPC API

USECASE CLI-2 Federation/SAML:
For CLI/RPC clients, we will provide the ability to:
1. acquire SAML assertion token from a trusted IdP (shibboleth?) and persist in a permissions protected file - ie. ~/.hadoop_tokens/.idp_token
2. use cURL to Federate a token from a trusted IdP through an SP endpoint exposed by an AuthenticationServer(FederationServer?) instance via REST calls to:
   a. federate - passing a SAML assertion as an Authorization: Bearer token returning a hadoop id_token
      - can copy and paste from commandline or use cat to include previously persisted token through "--Header Authorization: Bearer 'cat ~/.hadoop_tokens/.id_token'"
3. use hadoop CLI to invoke RPC API on a specific hadoop service
   a. RPC client negotiates a TokenAuth method through SASL layer, hadoop id_token is retrieved from ~/.hadoop_tokens/.id_token is passed as Authorization: Bearer token to the get-access-token REST endpoint exposed by TokenGrantingService returning a hadoop access token
   b. RPC server side validates the presented hadoop access token and continues to serve request
   c. Successfully invoke a hadoop service RPC API

REQUIRED COMPONENTS for CLI USECASES - (beyond those required for REST):
COMP-9. TokenAuth Method negotiation, etc
COMP-10. Client side implementation to leverage REST endpoint for acquiring hadoop access tokens given a hadoop id_token
COMP-11. Server side implementation to validate incoming hadoop access tokens

UI
Various Hadoop services have their own web UI consoles for administration and end user interactions. These consoles need to also benefit from the pluggability of authentication mechansims to be on par with the access control of the cluster REST and RPC APIs.
Web consoles are protected with an WebSSOAuthenticationHandler which will be configured for either authentication or federation.

USECASE UI-1 Authentication/LDAP:
For the authentication usecase:
1. User's browser requests access to a UI console page
2. WebSSOAuthenticationHandler intercepts the request and redirects the browser to an IdP web endpoint exposed by the AuthenticationServer passing the requested url as the redirect_url
3. IdP web endpoint presents the user with a FORM over https
   a. user provides username/password and submits the FORM
4. AuthenticationServer authenticates the user with provided credentials against the configured LDAP server and:
   a. leverages a servlet filter or other authentication mechanism for the endpoint and authenticates the user with a simple LDAP bind with username and password
   b. acquires a hadoop id_token and uses it to acquire the required hadoop access token which is added as a cookie
   c. redirects the browser to the original service UI resource via the provided redirect_url
5. WebSSOAuthenticationHandler for the original UI resource interrogates the incoming request again for an authcookie that contains an access token upon finding one:
   a. validates the incoming token
   b. returns the AuthenticationToken as per AuthenticationHandler contract
   c. AuthenticationFilter adds the hadoop auth cookie with the expected token
   d. serves requested resource for valid tokens
   e. subsequent requests are handled by the AuthenticationFilter recognition of the hadoop auth cookie

USECASE UI-2 Federation/SAML:
For the federation usecase:
1. User's browser requests access to a UI console page
2. WebSSOAuthenticationHandler intercepts the request and redirects the browser to an SP web endpoint exposed by the AuthenticationServer passing the requested url as the redirect_url. This endpoint:
   a. is dedicated to redirecting to the external IdP passing the required parameters which may include a redirect_url back to itself as well as encoding the original redirect_url so that it can determine it on the way back to the client
3. the IdP:
   a. challenges the user for credentials and authenticates the user
   b. creates appropriate token/cookie and redirects back to the AuthenticationServer endpoint
4. AuthenticationServer endpoint:
   a. extracts the expected token/cookie from the incoming request and validates it
   b. creates a hadoop id_token
   c. acquires a hadoop access token for the id_token
   d. creates appropriate cookie and redirects back to the original redirect_url - being the requested resource
5. WebSSOAuthenticationHandler for the original UI resource interrogates the incoming request again for an authcookie that contains an access token upon finding one:
   a. validates the incoming token
   b. returns the AuthenticationToken as per AuthenticationHandler contrac
   c. AuthenticationFilter adds the hadoop auth cookie with the expected token
   d. serves requested resource for valid tokens
   e. subsequent requests are handled by the AuthenticationFilter recognition of the hadoop auth cookie
REQUIRED COMPONENTS for UI USECASES:
COMP-12. WebSSOAuthenticationHandler
COMP-13. IdP Web Endpoint within AuthenticationServer for FORM based login
COMP-14. SP Web Endpoint within AuthenticationServer for 3rd party token federation

On Wed, Sep 4, 2013 at 3:06 PM, larry mccay <lm...@apache.org>> wrote:
Hello Kai, Jerry and common-dev'ers -

I would like to try and get a game plan together for how we go about getting some of these larger security changes into branches that are manageable, reviewable and ultimately mergeable in a timely manner.

In order to even start this discussion, I think we need an inventory of the high level projects that are underway in parallel. We can then identify those that are at the point where patches can be used to seed a branch. This will give us some insight into how to break it into phases.

Off the top of my head, I can think of the following high level efforts:

1. Pluggable Authentication and Token based SSO
2. CryptoFS for volume level encryption
3. Hive Table/Column Level Encryption (admittedly this is Hive work but it will leverage common work done in Hadoop)
4. Authorization

Now, #1 and #2 above have related Jiras and a number of patches available and are therefore early contenders for branching.

#1 has a draft for an initial iteration that was discussed in another thread and I will attach a pdf version of the iteration-1 proposal to this mail.

I propose that we converge on an initial plan based on further discussion of the attached iteration and file a Jira to represent that iteration. We can then break down the larger patches on existing Jiras to fit into the constrained scope of the agreed upon iteration and attach them to subtasks of the iteration Jira.

We can then seed a Pluggable Authentication and Token based SSO branch with those related patches from H-9392, H-9534, H-9781.

Now, whether we introduce a whole central sso service in that branch is up for discussion but I personally think that it will violate the "keeping it small and manageable" goal. I am wondering whether a branch for security services would do well to decouple the consumers from a specific implementation that happens to be remote. Then within the Pluggable Authentication branch - we can concentrate on the consumer level and local implementations.

I assume that the CryptoFS work is also intended to be done within the branches and we have to therefore consider how to leverage common code for things like key access for encryption/decryption and signing/verifying. This sort of thing is being introduced by H-9534 as part of the Pluggable Authentication branch in support of JWT tokens. So, we will have to think through what branches are required for Crypto in the near term.

Perhaps, we can concentrate on those portions of crypto that will be of immediate benefit to iteration-1 and leave higher order CryptoFS stuff to another iteration? I don't think that we want an explosion of branches at any given time. If we can limit it to specific areas, close down on the iteration and get it merged before creating a new set of branches that would be best. Again, ease of review, test and merge is important for us.

I am curious how development across related branches like these would work though. If the service work needs to leverage work from the other how do we do that easily. Can we branch a branch? Will that require both to be ready to merge at the same time?

Perhaps, low-level dependencies can be duplicated for some time and then consolidated later?

Anyway, specific questions:

Does the proposal to start with the attached iteration-1 draft to create an iteration Jira make sense to everyone?

Does anyone have specific suggestions regarding the best way for managing branches that should be decoupled but at the same time leverage common code?

Any other thoughts or insight?

thanks,

--larry