You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michał Wesołowski (Jira)" <ji...@apache.org> on 2019/11/15 13:11:00 UTC
[jira] [Created] (SPARK-29916) spark on kubernetes fails with hadoop-3.2 due to the user not existing in executor pod

Michał Wesołowski created SPARK-29916:
-----------------------------------------

             Summary: spark on kubernetes fails with hadoop-3.2 due to the user not existing in executor pod
                 Key: SPARK-29916
                 URL: https://issues.apache.org/jira/browse/SPARK-29916
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.0
            Reporter: Michał Wesołowski


I'm running tests on kubernetes with spark-3.0-preview version with hadoop-3.2 libraries. 

I needed cloud libraries (azure in particular) support so this is build based on v3.0.0-preview tag with cloud profile since binaries provided don't contain it. 

I run simple computation on AKS (azure kubernetes service) with Azure Data Lake Storage gen2 and with it fails with the following error:
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o49.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.244.2.6, executor 1): java.io.IOException: There is no primary group for UGI localuser(auth:SIMPLE)
        at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
 {code}
It looks like hadoop library was expecting the user "localuser" to exist in executor pod. This user is the one which invoked spark-submit on my local machine, however I didn't set it explicitly.  

 

I investigated the pod and this user is set in SPARK_USER environment variable in both executor and driver pods. 

Relevant logs from executor:
{code:java}
19/11/15 12:56:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, localuser); groups with view permissions: Set(); users  with modify permissions: Set(root, localuser); groups with modify permissions: Set() 
...
19/11/15 12:56:53 INFO SecurityManager: Changing view acls to: root,localuser
19/11/15 12:56:53 INFO SecurityManager: Changing modify acls to: root,localuser
19/11/15 12:56:53 INFO SecurityManager: Changing view acls groups to:
19/11/15 12:56:53 INFO SecurityManager: Changing modify acls groups to:
...
19/11/15 12:57:02 WARN ShellBasedUnixGroupsMapping: unable to return groups for user localuser
PartialGroupNameException The user name 'localuser' is not found. id: ‘localuser’: no such user
id: ‘localuser’: no such user        at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.resolvePartialGroupNames(ShellBasedUnixGroupsMapping.java:294)
        at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:207)
        at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:97)
        at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:51)
        at org.apache.hadoop.security.Groups$GroupCacheLoader.fetchGroupList(Groups.java:387)
        at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:321)
        at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:270)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
        at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
        at org.apache.hadoop.security.Groups.getGroups(Groups.java:228)
        at org.apache.hadoop.security.UserGroupInformation.getGroups(UserGroupInformation.java:1588)
        at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1453)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)

{code}
One woraround for this I've found is setting
{code:java}
--proxy-user root {code}
parameter with spark submit. 

 

The rest of spark-submit command is quite typical for runnig with kubernetes. For storage I connect among connection details I supply:
{code:java}
  --conf "spark.hadoop.fs.azure.account.auth.type=OAuth" ` {code}
which could be relevant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org