You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Niels Becker (JIRA)" <ji...@apache.org> on 2015/07/31 22:09:04 UTC

[jira] [Commented] (SPARK-7791) Set user for executors in standalone-mode

    [ https://issues.apache.org/jira/browse/SPARK-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649753#comment-14649753 ] 

Niels Becker commented on SPARK-7791:
-------------------------------------

I run into the same problem saving a dataframe as parquet.
Our Environment:
- Ubuntu 14
- Spark 1.4.1 prebuild for Hadoop 2.6
- GlusterFS 3.7
- Mesos 0.23.0
- Docker 1.7.1

Start _pyspark_ as _sparkuser_ and load some data into a dataframe {{df}}. Then run {{df.write.format("parquet").save("/data/test/wikipedia_test.parquet")}}
_/data_ is a GlusterFS voulme on each node
_/data/test_ permissions:
{code}
# owner: sparkuser
# group: sparkuser
# flags: -s-
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::r-x
{code}

Tomasz described a workaround in [https://www.mail-archive.com/user@spark.apache.org/msg28820.html] but that does not work for us.
Interesting thing is that {{*.gz.parquet}} files have {noformat}root:sparkuser -rw-r--r--{noformat} permissions
but {{*.gz.parquet.crc}} files have {noformat}root:sparkuser -rw-rw-r--{noformat} permissions like they should have been.
This sugests that spark does not use default file permissions at least for parquet files.

I can confirm that setting {{SPARK_USER}} to either {{root}} nor {{sparkuser}} has no effekt.
Running pyspark as root works.

I assume that all spark tasks are executed as root and overwrite the default file permissions but do not change the user.
So after the job is done the driver tries to rename the files to its final destination but fails because lack of permissions.

> Set user for executors in standalone-mode
> -----------------------------------------
>
>                 Key: SPARK-7791
>                 URL: https://issues.apache.org/jira/browse/SPARK-7791
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core
>            Reporter: Tomasz Früboes
>
> I'm opening this following a discussion in https://www.mail-archive.com/user@spark.apache.org/msg28633.html
>  Our setup was following. Spark (1.3.1, prebuilt for hadoop 2.6, also 2.4) was installed in the standalone mode and started manually from the root account. Everything worked properly apart of operations  such us
> rdd.saveAsPickleFile(ofile)
> which end with exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.
> : java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_000001/part-r-00002.parquet; isDirectory=false; length=534; replication=1; blocksize=33554432; modification_time=1432042832000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-00002.parquet at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
> (files created in _temporary were owned by user root). It would be great if spark could set the user for the executor also in standalone mode. Setting SPARK_USER has no effect here.
> BTW it may be a good idea to add some warning (e.g. during spark startup) that running from root account is not very healthy idea. E.g. mapping this function 
> def test(x):
>    f = open('/etc/testTMF.txt', 'w')
>    return 0
> on a rdd creates a file in /etc/ (surprisingly calls like f.Write("text") end with an exception)
> Thanks,
>   Tomasz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org