You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Schulz <da...@hotmail.com> on 2015/08/31 12:02:05 UTC

Data Security on Spark-on-HDFS

Hi guys,

In a nutshell: does Spark check and respect user privileges when reading/writing data.

I am curious about the data security when Spark runs on top of HDFS — maybe though YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction when accessing data? So is there a shortcoming when using Spark because the JVM processes are already running and therefore the launching user is omitted by Spark when accessing data residing on HDFS? Or is Spark only reading/writing data, that the user had access to, that launched this Thread?

What about local store when running in Standalone mode? What about access calls to HBase or Hive then?

Thanks for taking time.

Best regards, Daniel.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Data Security on Spark-on-HDFS

Posted by Steve Loughran <st...@hortonworks.com>.

> On 31 Aug 2015, at 11:02, Daniel Schulz <da...@hotmail.com> wrote:
> 
> Hi guys,
> 
> In a nutshell: does Spark check and respect user privileges when reading/writing data.

Yes, in a locked down YARN cluster —until your tokens expire

> 
> I am curious about the data security when Spark runs on top of HDFS — maybe though YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction when accessing data? So is there a shortcoming when using Spark because the JVM processes are already running and therefore the launching user is omitted by Spark when accessing data residing on HDFS? Or is Spark only reading/writing data, that the user had access to, that launched this Thread?

in a kerberized YARN cluster, the processes run as the specific user submitting the job (or whoever the kerberos ID -> OS ID mapping files say they are), with the delegated tokens passed up from the client to talk to HDFS. In Spark 1.5 you get the Hive credentials pushed up too.

This means that access is granted with the rights of the user deploying the application, HDFS checking it on every request.

It also means that when the HDFS delegation tokens expire, your HDFS access goes away. Spark 1.5 addresses this by allowing you to optionally provide a keytab for the app master, which is used to re-authenticate with the KDC, and then HDFS. This changes the problem to "getting your cluster ops team to give you a keytab"

the New ORA book, Hadoop Security, is the best start to Hadoop cluster security; Spending some money on the eBook is a worthwhile investment

I'm doing a low-level document on the internals at https://github.com/steveloughran/kerberos_and_hadoop/ —though that's targeted at developers and people debugging their code more than users of apps

> 
> What about local store when running in Standalone mode? What about access calls to HBase or Hive then?
> 

Someone else will have to cover that
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org