You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by gsvigruha <ge...@lynxanalytics.com> on 2016/09/16 03:43:49 UTC

Impersonate users using the same SparkContext

Hi,

is there a way to impersonate multiple users using the same SparkContext
(e.g. like this
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html)
when going through the Spark API?

What I'd like to do is that
1) submit a long running Spark yarn-client application using a Hadoop
superuser (e.g. "super")
2) impersonate different users with "super" when reading/writing restricted
HDFS files using the Spark API

I know about the --proxy-user flag but its effect is fixed within a
spark-submit.

I looked at the code and it seems the username is determined by the
SPARK_USER env var first (which seems to be always set) and then the
UserGroupInformation.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2247
What I'd like I guess is the UserGroupInformation to take priority.

Is there a way to make this work? Thank you!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Impersonate-users-using-the-same-SparkContext-tp27735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Impersonate users using the same SparkContext

Posted by Steve Loughran <st...@hortonworks.com>.

> On 16 Sep 2016, at 04:43, gsvigruha <ge...@lynxanalytics.com> wrote:
> 
> Hi,
> 
> is there a way to impersonate multiple users using the same SparkContext
> (e.g. like this
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html)
> when going through the Spark API?
> 
> What I'd like to do is that
> 1) submit a long running Spark yarn-client application using a Hadoop
> superuser (e.g. "super")
> 2) impersonate different users with "super" when reading/writing restricted
> HDFS files using the Spark API
> 
> I know about the --proxy-user flag but its effect is fixed within a
> spark-submit.
> 
> I looked at the code and it seems the username is determined by the
> SPARK_USER env var first (which seems to be always set) and then the
> UserGroupInformation.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2247
> What I'd like I guess is the UserGroupInformation to take priority.
> 

If you can get the Kerberos tickets or Hadoop tokens all the way to your code, then you execute the code in a doAs call, this adopts the kerberos tokens of that context to access HDFS, Hive, HBase, etc

otherUserUGI.doAs {
  ....
}

If you just want to run something as a different user

-short lived: have oozie set things up
-long-lived: you need the kerberos keytab of whoever the app needs to run as. 


On an insecure cluster, the identity used to talk to HDFS can actually be set in the env var HADOOP_USER_NAME, you can also use some of the UGI methods like createProxyUser() to create the identity to spoof in 

val hbase = UserGroupInformation.createRemoteUser("hbase")
hbase.doAs() { ... }


some possibly useful information

https://www.youtube.com/watch?v=Xz2tPmK2cKg
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org