You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/06 22:41:22 UTC

[Pig Wiki] Update of "Howl/HowlAuthentication" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "Howl/HowlAuthentication" page has been changed by AlanGates.
http://wiki.apache.org/pig/Howl/HowlAuthentication

--------------------------------------------------

New page:
This page will enlist use cases for authentication related to Howl. It will attempt to outline the required changes to enable those use cases.

== Background and terminology ==

The Hadoop Security (!HadoopS) release uses Kerberos to provide authentication. On a secure cluster, the cluster servers (Namenode (nn), Jobtracker (jt), datanode, tasktracker) are themselves Kerberos (service) principals and end users are user principals and users and these services mutually authenticate to each other using Kerberos tickets. !HadoopS uses security tokens called "delegation tokens" (these are NOT Kerberos tickets but a Hadoop specific security token) to authenticate the map/reduce tasks. So at job submission time once the job client has provided the user Kerberos ticket to authenticate to the namenode and jobtracker, it is handed delegation tokens from the namenode so that the tasks can use these to talk to the namenode. These delegation tokens are stored in the "credential store" for the job and the job tracker automatically renews these for the job upto a maximum lifetime of 7 days.

=== Oozie use case ===
Oozie is a service which users use to submit jobs to the !HadoopS cluster. It somewhat resembles the Howl server since the Howl server also needs to act on behalf of users while accessing the !DFS. Users authenticate to oozie and then the oozie service acts on behalf of the user while working with job tracker or namenode. For this to work, both the namenode and jobtracker need to recognize the "oozie" prinicpal to be a "proxy user" principal (i.e. a principal that can act on behalf of other users). In addition namenode and jobtracker need to know the possibles IPs for the proxy user service, list of users or groups (.i.e all users belonging to the group would be allowed) that the oozie principal can act on behalf of. This proxy user list and associated information is maintained in a configuration read by the namenode and jobtracker. Once the user authenticates to oozie, oozie authenticates itself to nn/jt using the oozie principal and also uses the !UserGroupInformation.doAs() to secure a !JobClient object associated with the real user (it needs the real username for the doAs() which it gets hold of from the user authentication). Through this process, oozie adds delegation tokens  (actually the !JobClient code does this in a subsequent submitJob()) for the jt and primary nn into the new !JobClient to pass on to the launcher map task for the Pig/MR job. If the Pig script/MR job run needs to access more than the primary name node, an oozie parameter should be used to specify the list of nns that need to be accessed and oozie will get delegation tokens for all of them through the jobclient.

== Changes required in Howl ==
   * Howl server will need to run as a proxy user prinicpal. So at deployment time, the configuration of nn and jt will need to be updated to recognize the "howl" principal as as "proxy user" principal. A "howl" net group (similar to oozie) will be needed and all users who want to use Howl will need to add themselves to the "howl" group.
   * Howl server will also need to hand out delegation tokens (like the nn) so that the output committer task can use them to authenticate to the Howl server to "publish" partitions. Apart from the output committer, oozie will also request Howl delegation tokens and hand them to the corresponding Pig/mapred jobs.
   * End users of Howl using Pig/Hive/Map Reduce/Howl cli (and not using oozie) would authenticate to Howl using Kerberos tickets in the thift api calls. As noted in the point above, the output committer task would authenticate to the Howl server using the Howl delegation token in the publish_parition api call. So the thrift calls need to support both Kerberos based and delegation token based authentication. '''There should be a property which is honored to run metastore without any authentication, preferably this should be the same property that Hadoop uses for non secure operation.'''
   * Howl server code should change to implement !UserGroupInformation.doAs() so as to do all operations as the real user. The real user's username would be needed to invoke doAs() (Hopefully there is some way to get this from the Kerberos ticket with which the user authenticated.) 
   * !HowlOutputFormat will need to get delegation tokens from the Howl server in checkOutputSpecs() and store the token into the Hadoop credential store so that it can be passed to the tasks. Specifically the !OutputCommitter task will use this token to authenticate to the Howl server to invoke the publish_partition API call.
   * The JT should renew the Howl delegation token so it is kept valid for long running jobs (this might be difficult since JT will need to make thrift call to renew the delegation token.  For the short term we will simply set the timeout on these delegation tokens to be long.  In the future the JT can handle renewing them.


== Use cases with Howl ==

=== Howl client running DDL commands ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The Howl client needs to acquire the service ticket to access the Howl service (This will happen transparently through !HiveMetaStoreClient). This service ticket is used to authenticate the user to the Howl server.
 * The Howl server after authenticating the user does a !UserGroupInformation.doAs() call using the real user's username to peform the action requested.

=== Pig script reading from and writing to tables in Howl ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The !HowlInputFormat needs to acquire the service ticket to access the Howl service  (This will happen transparently through !HiveMetaStoreClient) . This service ticket is used to authenticate the user to the Howl server.
 * !HowlOutputFormat will need to get delegation tokens from the Howl server in checkOutputSpecs() and store the token into the Hadoop credential store so that it can be passed to the tasks. Specifically the !OutputCommitter task will use this token to authenticate to the Howl server to invoke the publish_partition API call. 
  
=== Hive query reading from and writing to tables in Howl ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The Hive client needs to acquire the service ticket to access the Howl service (This will happen transparently through !HiveMetaStoreClient). This service ticket is used to authenticate the user to the Howl server.

=== Java Map Reduce job reading from and writing to tables in Howl  ===
 * Same as Pig use case?

=== Oozie running a Pig script which reads from or writes to tables in Howl ===
'''How will Oozie know that the Pig script interacts with Howl - will need some change in oozie to allow the work flow xml to indicate this?'''
 * Once oozie knows that the Pig script may read/write through Howl (maybe through some information in the workflow xml), it should also authenticate to the Howl server and get the Howl delegation token on behalf of the real user (in addition to the usual jt/nn delegation tokens it gets by doing doAs() for creating the jobclient). The Howl delegation token should be added on to the launcher task so it is available on the map task launching the Pig script
 * The !HowlInputFormat/!HowlOutputFormat code will use the delegation tokens already present to authenticate to Howl server. 
 * The Howl delegation token should get sent to the actual map/reduce tasks of the Pig job and also specifically to an !OutputCommitter task so that it can use it to publish partition to the Howl server.
 
=== Oozie running a Java MR job which reads from or writes to tables in Howl ===
'''How will Oozie know that the Java MR job interacts with Howl - will need some change in oozie to allow the work flow xml to indicate this?'''
 * Same as Pig?

=== Tools like DAQ invoke Howl API calls to register data ===
 * These services would simply use their Kerberos tickets to authenticate in the thrift API calls. Apparently DAQ runs as proxy user and hence DAQ's use case would be similar to the oozie one.