You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ian Hummel (JIRA)" <ji...@apache.org> on 2016/10/21 16:06:58 UTC

[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

    [ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595524#comment-15595524 ] 

Ian Hummel commented on SPARK-5158:
-----------------------------------

I'm running into this now and have done some digging.  My setup is

- Small, dedicated standalone spark cluster
-- using spark.authenticate.secret
-- spark-env.sh sets HADOOP_CONF_DIR correctly on each node
-- core-site.xml has
--- hadoop.security.authentication = kerberos
--- hadoop.security.authorization = true
- Kerberized HDFS cluster

Reading and writing to HDFS in local mode works fine, provided I have run {{kinit}} beforehand.  Running a distributed job via the standalone cluster does not, seemingly because clients connecting to standalone clusters don't attempt to fetch/forward HDFS delegation tokens.

What I had hoped would work is ssh'ing onto each standalone worker node individually and running kinit out-of-process before submitting my job.  I figured that since the executors are launched as my unix user that they would inherit my kerberos context and be able to talk to HDFS, just as they can in local mode. 

I verified with a debugger that the {{UserGroupInformation}} in the worker JVMs correctly picks up the fact that the user the process is running as can access the kerberos ticket cache.

But it still doesn't work.

The reason is that the executor process ({{CoarseGrainedExecutorBackend}}) does something like this:

{code}
SparkHadoopUtil.get.runAsSparkUser { () =>
...
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))  
...
}
{code}

{{runAsSparkUser}} does this:

{code}
  def runAsSparkUser(func: () => Unit) {
    val user = Utils.getCurrentUserName()
    logDebug("running as user: " + user)
    val ugi = UserGroupInformation.createRemoteUser(user)
    transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
    ugi.doAs(new PrivilegedExceptionAction[Unit] {
      def run: Unit = func()
    })
  }
{code}

{{createRemoteUser}} does this:

{code}
  public static UserGroupInformation createRemoteUser(String user) {
    if (user == null || user.isEmpty()) {
      throw new IllegalArgumentException("Null user");
    }
    Subject subject = new Subject();
    subject.getPrincipals().add(new User(user));
    UserGroupInformation result = new UserGroupInformation(subject);
    result.setAuthenticationMethod(AuthenticationMethod.SIMPLE);
    return result;
  }
{code}

So effectively, if we had an HDFS delegation token, we would have copied it over in {{transferCredentials}}, but since there is no way for the client to include them when the task is submitted over-the wire, we are creating a _blank_ UGI from scratch and losing the Kerberos context.  Subsequent calls to HDFS are attempting with "simple" authentication and everything fails.


One workaround is that you can obtain an HDFS delegation token out of band, store it in a file, make it available on all worker nodes and then ensure executors are launched with {{HADOOP_TOKEN_FILE_LOCATION}} set.  To be more specific:

On client machine:
- ensure {{core-site.xml}}, {{hdfs-site.xml}} and {{yarn-site.xml}} are configured properly
- ensure {{HDFS_CONF_DIR}} is set
- Run {{spark-submit --class org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null /nfs/path/to/TOKEN}}

On worker machines:
- ensure {{/nfs/path/to/TOKEN}} is readable

On client machine:
- submit job adding {{--conf "spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/nfs/path/to/TOKEN"}}

There are obviously issues with this in terms of expiration, renewal, etc... just wanted to mention it for the record.


Another workaround is a build of Spark which simply comments out the {{runAsSparkUser}} call.  In this case users can simply have a cron job running kinit in the background (using a keytab) and the spawned executor will use the inherited kerberos context to talk to HDFS.

It seems like {{CoarseGrainedExecutorBackend}} is also used by Mesos, and I noticed SPARK-12909. If security doesn't even work for Mesos or Standalone why do even try the {{runAsSparkUser}} call?  It honestly seems like there is no reason for that... proxy users are not useful outside of a kerberized context (right?).  There is no real secured user identity when running as a standalone cluster (or from what I can tell when running under Mesos), only that which comes from whatever unix user the workers are running as.

As it stands, we actually _deescalate_ that user's privileges (by wiping the kerberos context).  Shouldn't we just keep them as they are?  This makes it a lot easier for standalone clusters to interact with a kerberized HDFS.

I know this ticket is more about forwarding keytabs to the executors, but the scenario outlined above also gets to that use case.

Thoughts?

> Allow for keytab-based HDFS security in Standalone mode
> -------------------------------------------------------
>
>                 Key: SPARK-5158
>                 URL: https://issues.apache.org/jira/browse/SPARK-5158
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Matthew Cheah
>            Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS clusters in standalone mode. The main reason we haven't accepted these patches have been that they rely on insecure distribution of token files from the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the Spark driver and executors independently log in and acquire credentials using a keytab. This would work for users who have a dedicated, single-tenant, Spark clusters (i.e. they are willing to have a keytab on every machine running Spark for their application). It wouldn't address all possible deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org