You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/10 01:59:44 UTC

[Hadoop Wiki] Update of "Hbase/HBaseTokenAuthentication" by GaryHelmling

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/HBaseTokenAuthentication" page has been changed by GaryHelmling.
The comment on this change is: Initial draft.
http://wiki.apache.org/hadoop/Hbase/HBaseTokenAuthentication

--------------------------------------------------

New page:
= HBase Token Authentication =
While HBase security now supports Kerberos authentication for client RPC connections, this is only part of the puzzle for integration with secure Hadoop.  Kerberos authentication is only used for direct client access to HDFS.  The Hadoop MapReduce framework instead uses a DIGEST-MD5 authentication scheme, where the client is granted a signed "delegation token" and secret "token authenticator" (the SHA1 hash of the delegation token and a NN secret key) when a MapReduce job is submitted.  The token and authenticator are serialized into a secure location in HDFS, so that the spawned Child processes can de-serialize the credentials and use them to re-authenticate to the NN as the submitting user.

Since Kerberos credentials are not used in the MapReduce task execution context, any client attempts to authenticate to HBase will fail.  As a result, HBase connections will need to support an alternate authentication scheme, similarly to the Hadoop MapReduce framework.

=== Goals ===
The main considerations for supporting map reduce authentication are:

 1. The implementation should avoid any changes to core Hadoop code.  Any changes in Hadoop will require a great deal more review and discussion to potentially be accepted, and would necessitate running a forked version of Hadoop for some time.
 1. Any changes should be transparent to existing map-reduce user code.  We shouldn't require any new APIs to be used for authentication, for example.
 1. Changes to the job submission process, such as using a wrapper or utility to submit map-reduce jobs, are preferable to any changes requiring code modifications

== HBase Authentication Tokens ==
While Hadoop user delegation tokens provide an existing means of Map``Reduce task authentication, their reliance on an secret key stored in memory on the Name``Node makes them inaccessible for authentication in HBase.  Fortunately, the Hadoop security implementation and Map``Reduce job submission and execution code provides a generalized framework for token handling.  Building on top of this, we can provide token based authentication from MR tasks to HBase without any core Hadoop or Map``Reduce changes.

=== Proposal: Adding an HBase user token ===
 1. extend {{{org.apache.hadoop.security.token.TokenIdentifier}}} with our own token implementation
 1. implement {{{org.apache.hadoop.security.token.SecretManager}}}
 1. master will generate a secret key for signing and authenticating tokens
   a. will need to persist somewhere (zookeeper?) to allow for master restarts and failover
   a. will need to distribute generated secret key to RS
     i. could be on region checkin/heartbeats, though stack is removing those
     i. could be distributed through zookeeper as well
 1. add a helper like {{{TableMapReduceUtil.initJob()}}} to use when submitting a new job
   a. will obtain a new token from master
   a. add token to Credentials instance
   a. normal {{{JobClient}}} code will serialize Credentials for MR job
 1. when running MR job, Credentials will be deserialized from secure location
   a. HBaseClient will look in credentials for any relevant tokens

==== Limitations ====
 1. Doesn't appear we'll be able to use the existing delegation token renew mechanism (but do we really need to do token renewal?)

=== Token ===
The HBase authentication token is modeled directly after the Hadoop user delegation token.  We have dropped support for a designated renewer, however, as we will not be able to support HBase token renewal without modification to core map reduce code.  The token will consist of:
 * Token``ID:
   1. Owner ID -- Username that this token will authenticate as
   1. Issue date -- timestamp (in msec) when this token was generated
   1. Expire date -- timestamp (in msec) at which this token expires
   1. Sequence -- to ensure uniqueness
 * Token``Authenticator := HMAC_SHA1(master key, Token``ID)
 * Authentication Token := (Token``ID, Token``Authenticator)

==== Authentication ====
HBase token authentication builds on top of DIGEST-MD5 authentication support provided by Hadoop RPC.  HBase token authentication follows the same process as Hadoop user delegation token authentication by the !NameNode:
 1. Client sends Token``ID to server
 1. Server uses Token``ID and the in-memory master secret key to regenerate Token``Authenticator
 1. Server validates Token``ID, checks for expiration
 1. Server and client then use Token``Authenticator as the shared secret to negotiate DIGEST-MD5 authentication

==== Master Secret Key ====
Authentication relies on a secret key generated at runtime on the master and used to generate Authentication Tokens for clients.  Tokens will be generated on the master for Kerberos authenticated clients, but token based authentication will need to be allowed on all masters and region servers in a cluster.  So the master will need a means to distribute the secret key to other cluster nodes.

The master will also need to write the secret key to persistent storage in order for authentication tokens to survive a cluster restart.

==== Implementation ====
 1. Extend {{{org.apache.hadoop.security.token.TokenIdentifier}}} with new HBase type
 1. Implement {{{org.apache.hadoop.security.token.TokenSelector}}} to pull out HBase type tokens
 1. Extend {{{org.apache.hadoop.security.token.SecretManager}}} with implementation to generate HBase tokens.  This will be used on HMaster to generate HBase tokens, and on HRegionServer to validate tokens for authentication.

=== Map Reduce Flow ===
For all of this to work without changes to Hadoop and MapReduce code, we have two key requirements:
 1. We must be able to add our own tokens to the MR job Credentials instance at job submission time (and the job must be able to serialize our token correctly with the rest of the job info)
 1. The Child task executing on each node must deserialize our token and add it to the {{{UserGroupInformation}}} instance so it can later be picked up by the HBase client for authentication

==== Job Submission ====
 1. Add a new utility class {{{SecureMapReduceUtil}}} with a static helper method, something like {{{void initAuthentication(Job job)}}}
   a. Call Master to obtain a new authentication token for the logged in user
     * Token will only be returned if user is authenticated via Kerberos, same as HDFS
   a. Add HBase token to job credentials -- {{{job.getCredentials().addToken(Text alias, Token)}}}
     * {{{FileSystem.getCanonicalServiceName()}}} is used as the alias for HDFS delegation tokens, what should we use?
 1. {{{Job.submit()}}} is later called normally, which should serialize token with the rest of the job credentials
   a. {{{JobTracker.submitJob()}}} receives the credentials via RPC and adds them to a {{{JobInProgress}}} instance added to the job queue
   a. Scheduler will write out the tokens when the job is run.  {{{JobInProgress.initTasks()}}} -> {{{generateAndStoreTokens()}}} -> {{{Credentials.writeTokenStorageFile()}}}
   a. The serialized tokens will be written to {{{<jobdir>/jobToken}}}

==== Job Execution on Task Nodes ====
 1. On task start, {{{Child.main()}}} will read in a copy of the tokens from the local filesystem, local path passed as an env variable, read in using {{{TokenCache.loadTokens()}}}
 1. Each token is added to the child task {{{UserGroupInformation}}} instance used to run the local task
 1. Any HBase connections opened by the task will inherit the same UGI
 1. A {{{TokenInfo}}} annotation on the {{{HRegionInterface}}} and {{{HMasterInterface}}} protocol interfaces identifies the HBase {{{TokenSelector}}} implementation, which is then used to extract the relevant authentication token from the UGI's credentials
 1. Using the HBase authentication token, the authentication process proceeds as above