You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Tony Kinsley <tk...@gmail.com> on 2016/04/13 06:57:33 UTC

Accessing Secure Hadoop from Mesos cluster

I have been working towards getting some spark streaming jobs to run in
Mesos cluster mode (using docker containers) and write data periodically to
a secure HDFS cluster. Unfortunately this does not seem to be well
supported currently in spark (
https://issues.apache.org/jira/browse/SPARK-12909). The problem seems to be
that A) passing in a principal and keytab only get processed if the backend
is yarn, B) all the code for renewing tickets is implemented by the yarn
backend.


My first attempt to get around this problem was to create docker containers
that would use a custom entrypoint to run a process manager. Then have cron
running in each container which would periodically run kinit. I was hoping
this would work since the spark can correctly log in if the TGT exists (at
least from my tests manually kinit’ing and running spark in local mode).
However this hack will not work (currently anyways) as the Mesos scheduler
does not specify whether a shell should be used for the command. Mesos will
default to using the shell and then override the entrypoint of the docker
image with /bin/sh (https://issues.apache.org/jira/browse/MESOS-1770).


Since I have not been able to come up with an acceptable work around I am
looking into the possibility of adding the functionality into Spark, but I
wanted to check in to make sure I was not duplicating others work and also
to get some general advice on a good approach to solving this problem. I
have found this old email chain that talks about some different challenges
associated with authenticating correctly to the NameNodes (
http://comments.gmane.org/gmane.comp.lang.scala.spark.user/14257).


I've noticed that the Yarn security settings are namespaced to be specific
to Yarn and that there is some code that seems to be fairly generic
(AMDelegationTokenRenewer.scala and ExecutorDelegationTokenUpdater for
instance although I'm not sure about the use of the YarnSparkHadoopUtils).
It would seem to me that some of this code could be reused across the
various cluster backends. That said, I am fairly new to working with Hadoop
and Spark, and do not claim to understand the inner workings of Yarn or
Mesos, although I feel much more comfortable with Mesos.


I would definitely appreciate some guidance especially since whatever work
that I or ViaSat (my employer) gets working we would definitely be
interested in contributing it back and would very much want to avoid
maintaining a fork of Spark.

Tony

Re: Accessing Secure Hadoop from Mesos cluster

Posted by Michael Gummelt <mg...@mesosphere.io>.
DCOS Spark 1.6.1 supports kerberos.  It'll be available in DCOS 1.7, to be
released in a couple weeks.

On Tue, Apr 12, 2016 at 9:57 PM, Tony Kinsley <tk...@gmail.com> wrote:

> I have been working towards getting some spark streaming jobs to run in
> Mesos cluster mode (using docker containers) and write data periodically to
> a secure HDFS cluster. Unfortunately this does not seem to be well
> supported currently in spark (
> https://issues.apache.org/jira/browse/SPARK-12909). The problem seems to
> be that A) passing in a principal and keytab only get processed if the
> backend is yarn, B) all the code for renewing tickets is implemented by the
> yarn backend.
>
>
> My first attempt to get around this problem was to create docker
> containers that would use a custom entrypoint to run a process manager.
> Then have cron running in each container which would periodically run
> kinit. I was hoping this would work since the spark can correctly log in if
> the TGT exists (at least from my tests manually kinit’ing and running spark
> in local mode). However this hack will not work (currently anyways) as the
> Mesos scheduler does not specify whether a shell should be used for the
> command. Mesos will default to using the shell and then override the
> entrypoint of the docker image with /bin/sh (
> https://issues.apache.org/jira/browse/MESOS-1770).
>
>
> Since I have not been able to come up with an acceptable work around I am
> looking into the possibility of adding the functionality into Spark, but I
> wanted to check in to make sure I was not duplicating others work and also
> to get some general advice on a good approach to solving this problem. I
> have found this old email chain that talks about some different challenges
> associated with authenticating correctly to the NameNodes (
> http://comments.gmane.org/gmane.comp.lang.scala.spark.user/14257).
>
>
> I've noticed that the Yarn security settings are namespaced to be specific
> to Yarn and that there is some code that seems to be fairly generic
> (AMDelegationTokenRenewer.scala and ExecutorDelegationTokenUpdater for
> instance although I'm not sure about the use of the YarnSparkHadoopUtils).
> It would seem to me that some of this code could be reused across the
> various cluster backends. That said, I am fairly new to working with Hadoop
> and Spark, and do not claim to understand the inner workings of Yarn or
> Mesos, although I feel much more comfortable with Mesos.
>
>
> I would definitely appreciate some guidance especially since whatever work
> that I or ViaSat (my employer) gets working we would definitely be
> interested in contributing it back and would very much want to avoid
> maintaining a fork of Spark.
>
> Tony
>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere