You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maziyar PANAHI (JIRA)" <ji...@apache.org> on 2018/11/17 13:11:00 UTC
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

     [ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maziyar PANAHI updated SPARK-26101:
-----------------------------------
    Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory.

How to produce this issue:

```

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at <console>:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory.

How to produce this issue:

```scala

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at <console>:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> ---------------------------------------------------------------------
>
>                 Key: SPARK-26101
>                 URL: https://issues.apache.org/jira/browse/SPARK-26101
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.3.0
>            Reporter: Maziyar PANAHI
>            Priority: Major
>
> Hello,
>  
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions.
> Example (running Spark by user `panahi`):
> ```
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set();
> users with modify permissions: Set(*panahi*); groups with modify permissions: Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
> user: *panahi*
> ```
> However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory.
> How to produce this issue:
> ```
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> *result:*
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at <console>:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at <console>:37 c: Array[String] = Array(*yarn*)
> ```
> I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org