You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by dbolshak <bo...@gmail.com> on 2016/10/13 09:50:22 UTC

spark with kerberos

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster. 
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.denis@gmail.com



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-with-kerberos-tp27894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: spark with kerberos

Posted by Steve Loughran <st...@hortonworks.com>.
On 19 Oct 2016, at 00:18, Michael Segel <ms...@hotmail.com>> wrote:

(Sorry sent reply via wrong account.. )

Steve,

Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)

Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local  to the cluster.


not necessarily...you can share a KDC. And in a land of active directory you'd need some trust



So you will have to set up some sort of realm trusts between the clusters.

If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re going to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe that they are separate.


good point; you may not be able to get the tickets for cluster C accounts. But if you can log in as a user for


I would also think that you would have trouble with the credential… isn’t is tied to a user at a specific machine?

there are two types of kerberos identity, simple "hdfs@REALM" and server-specific "hdfs/server@REALM". The simple ones work just as well in small clusters, it's just that in larger clusters your KDCs (especially AD) tend to interpret an attempt by 200 machines to log in as user "hdfs@REALM" in 30s as an attempt to brute force a password, and start rejecting logins. The separation into hdfs/_HOST_/REALM style avoids that, and may reduce the damage if the keytab leaks

If the user submitting work is logged into the KDC of cluster C, e.g:


kinit user@CLUSTERC


and spark is configured to ask for the extra namenode tokens,

spark.yarn.access.namenodes hdfs://cluster-c:8020


..then spark MAY ask for those tokens, pass them up to cluster B and so have them available for talking to cluster C. The submitted job is using the block tokens, so doesn't need to log in to kerberos itself, and if cluster B is insecure, doesn't need to worry about credentials and identity there. The HDFS client code just returns the block token to talk to cluster C when an attempt to talk to the DN of cluster C is rejected with an "authenticate yourself" response.

The main issue to me is: will that token get picked up and propagated to an insecure cluster, so as to support this operation? Because there's a risk that the ubiquitous static method, UserGroupInformation.isSecurityEnabled() is being checked in places, and as the cluster itself isn't secure (hadoop.security.authentication  in core-site.xml != "simple"). It looks like org.apache.spark.deploy.yarn.security.HDFSCredentialProvider is doing exactly that (as does HBase and Hive), meaning job submission doesn't fetch tokens unless the destination cluster is secure.

One thing that could be attempted, would be turning authentication on to kerberos just in the job launch config, and seeing if that will collect all required tokens *without* getting confused by the fact that YARN and HDFS don't need them.

spark.hadoop.hadoop.security.authentication

I have no idea if this works; you've have to try it and see

(Its been a while since I looked at this and I drank heavily to forget Kerberos… so I may be a bit fuzzy here.)


denying all knowledge of Kerberos is always a good tactic.

Re: spark with kerberos

Posted by Michael Segel <ms...@hotmail.com>.
(Sorry sent reply via wrong account.. )

Steve,

Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)

Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local  to the cluster.

So you will have to set up some sort of realm trusts between the clusters.

If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re going to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe that they are separate.


I would also think that you would have trouble with the credential… isn’t is tied to a user at a specific machine?
(Its been a while since I looked at this and I drank heavily to forget Kerberos… so I may be a bit fuzzy here.)

Thx

-Mike
On Oct 18, 2016, at 2:59 PM, Steve Loughran <st...@hortonworks.com>> wrote:


On 17 Oct 2016, at 22:11, Michael Segel <mi...@hotmail.com>> wrote:

@Steve you are going to have to explain what you mean by ‘turn Kerberos on’.

Taken one way… it could mean making cluster B secure and running Kerberos and then you’d have to create some sort of trust between B and C,



I'd imagined making cluster B a kerberized cluster.

I don't think you need to go near trust relations though —ideally you'd just want the same accounts everywhere if you can, if not, the main thing is that the user submitting the job can get a credential for  that far NN at job submission time, and that credential is propagated all the way to the executors.


Did you mean turn on kerberos on the nodes in Cluster B so that each node becomes a trusted client that can connect to C

OR

Did you mean to turn on kerberos on the master node (eg edge node) where the data persists if you collect() it so its off the cluster on to a single machine and then push it from there so that only that machine has to have kerberos running and is a trusted server to Cluster C?


Note: In option 3, I hope I said it correctly, but I believe that you would be collecting the data to a client (edge node) before pushing it out to the secured cluster.





Does that make sense?

On Oct 14, 2016, at 1:32 PM, Steve Loughran <st...@hortonworks.com>> wrote:


On 13 Oct 2016, at 10:50, dbolshak <bo...@gmail.com>> wrote:

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

If you want to talk to the secure clsuter, C, from code running in cluster B, you'll need to turn kerberos on there. Maybe, maybe, you could just get away with kerberos being turned off, but you, the user, launching the application while logged in to kerberos yourself and so trusted by Cluster C.

one of the problems you are likely to hit with Spark here is that it's only going to collect the tokens you need to talk to HDFS at the time you launch the application, and by default, it only knows about the cluster FS. You will need to tell spark about the other filesystem at launch time, so it will know to authenticate with it as you, then collect the tokens needed for the application itself to work with kerberos.

spark.yarn.access.namenodes=hdfs://cluster-c:8080

-Steve

ps: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/





Re: spark with kerberos

Posted by Steve Loughran <st...@hortonworks.com>.
On 17 Oct 2016, at 22:11, Michael Segel <mi...@hotmail.com>> wrote:

@Steve you are going to have to explain what you mean by ‘turn Kerberos on’.

Taken one way… it could mean making cluster B secure and running Kerberos and then you’d have to create some sort of trust between B and C,



I'd imagined making cluster B a kerberized cluster.

I don't think you need to go near trust relations though —ideally you'd just want the same accounts everywhere if you can, if not, the main thing is that the user submitting the job can get a credential for  that far NN at job submission time, and that credential is propagated all the way to the executors.


Did you mean turn on kerberos on the nodes in Cluster B so that each node becomes a trusted client that can connect to C

OR

Did you mean to turn on kerberos on the master node (eg edge node) where the data persists if you collect() it so its off the cluster on to a single machine and then push it from there so that only that machine has to have kerberos running and is a trusted server to Cluster C?


Note: In option 3, I hope I said it correctly, but I believe that you would be collecting the data to a client (edge node) before pushing it out to the secured cluster.





Does that make sense?

On Oct 14, 2016, at 1:32 PM, Steve Loughran <st...@hortonworks.com>> wrote:


On 13 Oct 2016, at 10:50, dbolshak <bo...@gmail.com>> wrote:

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

If you want to talk to the secure clsuter, C, from code running in cluster B, you'll need to turn kerberos on there. Maybe, maybe, you could just get away with kerberos being turned off, but you, the user, launching the application while logged in to kerberos yourself and so trusted by Cluster C.

one of the problems you are likely to hit with Spark here is that it's only going to collect the tokens you need to talk to HDFS at the time you launch the application, and by default, it only knows about the cluster FS. You will need to tell spark about the other filesystem at launch time, so it will know to authenticate with it as you, then collect the tokens needed for the application itself to work with kerberos.

spark.yarn.access.namenodes=hdfs://cluster-c:8080

-Steve

ps: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/



Re: spark with kerberos

Posted by Steve Loughran <st...@hortonworks.com>.
On 13 Oct 2016, at 10:50, dbolshak <bo...@gmail.com>> wrote:

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

If you want to talk to the secure clsuter, C, from code running in cluster B, you'll need to turn kerberos on there. Maybe, maybe, you could just get away with kerberos being turned off, but you, the user, launching the application while logged in to kerberos yourself and so trusted by Cluster C.

one of the problems you are likely to hit with Spark here is that it's only going to collect the tokens you need to talk to HDFS at the time you launch the application, and by default, it only knows about the cluster FS. You will need to tell spark about the other filesystem at launch time, so it will know to authenticate with it as you, then collect the tokens needed for the application itself to work with kerberos.

spark.yarn.access.namenodes=hdfs://cluster-c:8080

-Steve

ps: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/

Re: spark with kerberos

Posted by Saisai Shao <sa...@gmail.com>.
I think security has nothing to do with what API you use, spark sql or RDD
API.

Assuming you're running on yarn cluster (that is the only cluster manager
supports Kerberos currently).

Firstly you need to get Kerberos tgt in your local spark-submit process,
after being authenticated by Kerberos, Spark could get delegation tokens
from HDFS, so that you could communicate with security hadoop cluster. Here
in your case since you have to communicate with other remote HDFS clusters,
so you have to get tokens from all these remote clusters, you could
configure "spark.yarn.access.namenodes" to list all the security hdfs
cluster you want to access, then hadoop client API will get tokens from all
these clusters.

For the details you could refer to
https://spark.apache.org/docs/latest/running-on-yarn.html.

I didn't try personally since I don't have such requirements. It may
requires additional steps which I missed. You could take a a try.


On Thu, Oct 13, 2016 at 6:38 PM, Denis Bolshakov <bo...@gmail.com>
wrote:

> The problem happens when writting (reading works fine)
>
> rdd.saveAsNewAPIHadoopFile
>
> We use just RDD and HDFS, no other things.
> Spark 1.6.1 version.
> `Claster A` - CDH 5.7.1
> `Cluster B` - vanilla hadoop 2.6.5
> `Cluster C` - CDH 5.8.0
>
> Best regards,
> Denis
>
> On 13 October 2016 at 13:06, ayan guha <gu...@gmail.com> wrote:
>
>> And a little more details on Spark version, hadoop version and
>> distribution would also help...
>>
>> On Thu, Oct 13, 2016 at 9:05 PM, ayan guha <gu...@gmail.com> wrote:
>>
>>> I think one point you need to mention is your target - HDFS, Hive or
>>> Hbase (or something else) and which end points are used.
>>>
>>> On Thu, Oct 13, 2016 at 8:50 PM, dbolshak <bo...@gmail.com>
>>> wrote:
>>>
>>>> Hello community,
>>>>
>>>> We've a challenge and no ideas how to solve it.
>>>>
>>>> The problem,
>>>>
>>>> Say we have the following environment:
>>>> 1. `cluster A`, the cluster does not use kerberos and we use it as a
>>>> source
>>>> of data, important thing is - we don't manage this cluster.
>>>> 2. `cluster B`, small cluster where our spark application is running and
>>>> performing some logic. (we manage this cluster and it does not have
>>>> kerberos).
>>>> 3. `cluster C`, the cluster uses kerberos and we use it to keep results
>>>> of
>>>> our spark application, we manage this cluster
>>>>
>>>> Our requrements and conditions that are not mentioned yet:
>>>> 1. All clusters are in a single data center, but in the different
>>>> subnetworks.
>>>> 2. We cannot turn on kerberos on `cluster A`
>>>> 3. We cannot turn off kerberos on `cluster C`
>>>> 4. We can turn on/off kerberos on `cluster B`, currently it's turned
>>>> off.
>>>> 5. Spark app is built on top of RDD and does not depend on spark-sql.
>>>>
>>>> Does anybody know how to write data using RDD api to remote cluster
>>>> which is
>>>> running with Kerberos?
>>>>
>>>> --
>>>> //with Best Regards
>>>> --Denis Bolshakov
>>>> e-mail: bolshakov.denis@gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/spark-with-kerberos-tp27894.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.denis@gmail.com
>
>
>

Re: spark with kerberos

Posted by Denis Bolshakov <bo...@gmail.com>.
The problem happens when writting (reading works fine)

rdd.saveAsNewAPIHadoopFile

We use just RDD and HDFS, no other things.
Spark 1.6.1 version.
`Claster A` - CDH 5.7.1
`Cluster B` - vanilla hadoop 2.6.5
`Cluster C` - CDH 5.8.0

Best regards,
Denis

On 13 October 2016 at 13:06, ayan guha <gu...@gmail.com> wrote:

> And a little more details on Spark version, hadoop version and
> distribution would also help...
>
> On Thu, Oct 13, 2016 at 9:05 PM, ayan guha <gu...@gmail.com> wrote:
>
>> I think one point you need to mention is your target - HDFS, Hive or
>> Hbase (or something else) and which end points are used.
>>
>> On Thu, Oct 13, 2016 at 8:50 PM, dbolshak <bo...@gmail.com>
>> wrote:
>>
>>> Hello community,
>>>
>>> We've a challenge and no ideas how to solve it.
>>>
>>> The problem,
>>>
>>> Say we have the following environment:
>>> 1. `cluster A`, the cluster does not use kerberos and we use it as a
>>> source
>>> of data, important thing is - we don't manage this cluster.
>>> 2. `cluster B`, small cluster where our spark application is running and
>>> performing some logic. (we manage this cluster and it does not have
>>> kerberos).
>>> 3. `cluster C`, the cluster uses kerberos and we use it to keep results
>>> of
>>> our spark application, we manage this cluster
>>>
>>> Our requrements and conditions that are not mentioned yet:
>>> 1. All clusters are in a single data center, but in the different
>>> subnetworks.
>>> 2. We cannot turn on kerberos on `cluster A`
>>> 3. We cannot turn off kerberos on `cluster C`
>>> 4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
>>> 5. Spark app is built on top of RDD and does not depend on spark-sql.
>>>
>>> Does anybody know how to write data using RDD api to remote cluster
>>> which is
>>> running with Kerberos?
>>>
>>> --
>>> //with Best Regards
>>> --Denis Bolshakov
>>> e-mail: bolshakov.denis@gmail.com
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/spark-with-kerberos-tp27894.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.denis@gmail.com

Re: spark with kerberos

Posted by ayan guha <gu...@gmail.com>.
And a little more details on Spark version, hadoop version and distribution
would also help...

On Thu, Oct 13, 2016 at 9:05 PM, ayan guha <gu...@gmail.com> wrote:

> I think one point you need to mention is your target - HDFS, Hive or Hbase
> (or something else) and which end points are used.
>
> On Thu, Oct 13, 2016 at 8:50 PM, dbolshak <bo...@gmail.com>
> wrote:
>
>> Hello community,
>>
>> We've a challenge and no ideas how to solve it.
>>
>> The problem,
>>
>> Say we have the following environment:
>> 1. `cluster A`, the cluster does not use kerberos and we use it as a
>> source
>> of data, important thing is - we don't manage this cluster.
>> 2. `cluster B`, small cluster where our spark application is running and
>> performing some logic. (we manage this cluster and it does not have
>> kerberos).
>> 3. `cluster C`, the cluster uses kerberos and we use it to keep results of
>> our spark application, we manage this cluster
>>
>> Our requrements and conditions that are not mentioned yet:
>> 1. All clusters are in a single data center, but in the different
>> subnetworks.
>> 2. We cannot turn on kerberos on `cluster A`
>> 3. We cannot turn off kerberos on `cluster C`
>> 4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
>> 5. Spark app is built on top of RDD and does not depend on spark-sql.
>>
>> Does anybody know how to write data using RDD api to remote cluster which
>> is
>> running with Kerberos?
>>
>> --
>> //with Best Regards
>> --Denis Bolshakov
>> e-mail: bolshakov.denis@gmail.com
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/spark-with-kerberos-tp27894.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards,
Ayan Guha

Re: spark with kerberos

Posted by ayan guha <gu...@gmail.com>.
I think one point you need to mention is your target - HDFS, Hive or Hbase
(or something else) and which end points are used.

On Thu, Oct 13, 2016 at 8:50 PM, dbolshak <bo...@gmail.com> wrote:

> Hello community,
>
> We've a challenge and no ideas how to solve it.
>
> The problem,
>
> Say we have the following environment:
> 1. `cluster A`, the cluster does not use kerberos and we use it as a source
> of data, important thing is - we don't manage this cluster.
> 2. `cluster B`, small cluster where our spark application is running and
> performing some logic. (we manage this cluster and it does not have
> kerberos).
> 3. `cluster C`, the cluster uses kerberos and we use it to keep results of
> our spark application, we manage this cluster
>
> Our requrements and conditions that are not mentioned yet:
> 1. All clusters are in a single data center, but in the different
> subnetworks.
> 2. We cannot turn on kerberos on `cluster A`
> 3. We cannot turn off kerberos on `cluster C`
> 4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
> 5. Spark app is built on top of RDD and does not depend on spark-sql.
>
> Does anybody know how to write data using RDD api to remote cluster which
> is
> running with Kerberos?
>
> --
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.denis@gmail.com
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/spark-with-kerberos-tp27894.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha