You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Srikrishan Malik <ma...@gmail.com> on 2017/09/13 13:03:34 UTC

impersonation in hadoop

Hello,

I was trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users:
http://dewoods.com/blog/hadoop-kerberos-guide
and about tokens:
https://hortonworks.com/blog/the-role-of-delegation-tokens-in-apache-hadoop-security/
..

But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
1. user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
2. The program is app specific and gets service tickets for HDFS
3. It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
4. The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
5. The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.

Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.

My final aim is to write a spark connector with impersonation support
for an data storage system  which does not use hadoop(tokens) but
supports kerberos.

Thanks & regards
-Sri

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: impersonation in hadoop

Posted by Steve Loughran <st...@hortonworks.com>.
On 13 Sep 2017, at 14:03, Srikrishan Malik <ma...@gmail.com>> wrote:

Hello,

I was trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users:
http://dewoods.com/blog/hadoop-kerberos-guide
and about tokens:
https://hortonworks.com/blog/the-role-of-delegation-tokens-in-apache-hadoop-security/
..


alsohttps://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details

But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
1. user does a kinit and executes a end user facing program like
beeline, spark-submit etc.

2. The program is app specific and gets service tickets for HDFS

yes

3. It then gets tokens for all the services it may need during the job

yes

exeution and saves the tokens in an HDFS directory.

or includes them in IPC calls

4. The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.


If you are using Hadoop RPC, you can include credentials (Kerberos tickets, hadoop tokens)
 in the IPC call (which are encrypted if wire encryption is turned on)



5. The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.

if you want to talk with the identify of the user from an RPC call, you just use UGI.getCurrentUser.doAs(), as the RPC call will be mapped
to the caller before your code is invoked


Is the above high level understanding correct? (I have more follow up queries.)

nobody really understands it. Nobody understands Kerberos either. Stepping through with a debugger always helps.

Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.

yes


My final aim is to write a spark connector with impersonation support
for an data storage system  which does not use hadoop(tokens) but
supports kerberos.


Spark yarn job submit takes a list of filesystems to get K-tickets for, includes it in job setup. Also looks for Hive and HBase if configured.

Long lived spark jobs (streaming) take a keytab; the app master re-auths with the KDC reguarly, then pushes tokens out to the workers