You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Eric Yang <er...@gmail.com> on 2020/05/06 22:32:36 UTC

[DISCUSS] Secure Hadoop without Kerberos

Hi all,

Kerberos was developed decade before web development becomes popular.
There are some Kerberos limitations which does not work well in Hadoop.  A
few examples of corner cases:

1. Kerberos principal doesn't encode port number, it is difficult to know
if the principal is coming from an authorized daemon or a hacker container
trying to forge service principal.
2. Hadoop Kerberos principals are used as high privileged principal, a form
of credential to impersonate end user.
3. Delegation token may allow expired users to continue to run jobs long
after they are gone, without rechecking if end user credentials is still
valid.
4.  Passing different form of tokens does not work well with cloud provider
security mechanism.  For example, passing AWS sts token for S3 bucket.
There is no renewal mechanism, nor good way to identify when the token
would expire.

There are companies that work on bridging security mechanism of different
types, but this is not primary goal for Hadoop.  Hadoop can benefit from
modernized security using open standards like OpenID Connect, which
proposes to unify web applications using SSO.   This ensure the client
credentials are transported in each stage of client servers interaction.
This may improve overall security, and provide more cloud native form
factor.  I wonder if there is any interested in the community to enable
Hadoop OpenID Connect integration work?

regards,
Eric

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Eric Yang <er...@gmail.com>.
Hi Steve,

Thank you for sharing the work done for Amazon STS token to work with s3a
connector.  This works for direct HDFS to S3 bucket interaction.  Your
statement is also spot on for containers running in YARN has no mechanism
to update the triple of session credentials.
If I am not mistaken, Amazon STS token is not renewable, and has a max life
time of 12 hours.  New token must be obtained for AWS role
for long running containers.  There are a number of ways to fix session
issues for YARN:

1.  RM keeps track of the session and login secrets, and inject STS token
into container running environment periodically.  (Nasty hack to modify
environment variable of a running process).
2.  Transport the client access key and secret key to container, and
container performs the re-login process.
3.  If user home directory contains ~/.aws/credentials on all nodes, this
works without code change, but operational nightmare.
4.  Streamline the token handling to use OIDC JWT token, and client
libraries will always perform check with OIDC server to keep token fresh.

Option 1-3 might work with existing s3a connector work with some
modification to application as well.  Number 4 is aimed to modify Hadoop
libraries that does authentication and token renewal transparently.  This
allows existing application to work by swapping jar files only without more
code modification.  It will also improve security because session
expiration is synchronized.  I am leaning toward address the
fundamental problem, and I know the community has spent years of
improvement to get to this point.  However, Hadoop needs a way forward.
This discussion helps to determine if it is essential to support OIDC as
alternate security mechanism.  How to do it using existing code, and how
not to break existing code.

regards,
Eric

On Thu, May 21, 2020 at 9:22 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

> On Wed, 6 May 2020 at 23:32, Eric Yang <er...@gmail.com> wrote:
>
> > Hi all,
> >
> >
> > 4.  Passing different form of tokens does not work well with cloud
> provider
> > security mechanism.  For example, passing AWS sts token for S3 bucket.
> > There is no renewal mechanism, nor good way to identify when the token
> > would expire.
> >
> >
> well, HADOOP-14556 does it fairly well, supporting session and role tokens.
> We even know when they expire because we ask for a duration when we request
> the session/role creds.
> See org.apache.hadoop.fs.s3a.auth.delegation.AbstractS3ATokenIdentifier for
> the core of what we marshall, including encryption secrets.
>
> The main issue there is that Yarn can't refresh those tokens because a new
> triple of session credentials are required; currently token renewal assumes
> the token is unchanged and a request is made to the service to update their
> table of issued tokens. But even if the RM could get back a new token from
> a refresh call, we are left with the problem of "how to get an updated set
> of creds to each process"
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Eric Yang <ey...@apache.org>.
This sounds promising and really fantastic news.  We look forward to this
feature, and let us know what we can do to help.  Thanks

regards,
Eric

On Tue, May 26, 2020 at 10:55 AM Daryn Sharp <da...@verizonmedia.com.invalid>
wrote:

> There’s a few too many issues being mixed here.
>
>
> We aren’t very far from having OIDC support.  The pre-requisite RPC/TLS &
> RPC/mTLS recently completed rollout to our entire production grid.
> Majority of the past year was spent shaking out bugs and ensuring 100%
> compatibility.  There are a few rough edges I need to clean up for a
> community release.
>
>
> A few weeks ago I created a rough POC to leverage RPC/mTLS with OIDC access
> tokens.  Goal is a mTLS cert may be blessed to impersonate with an access
> token.  A compromised service may only be abused to impersonate users that
> have recently accessed said service.
>
>
> Kerberos, mTLs, and OIDC may all be simultaneously supported.  Part of the
> simplicity is regardless of the client’s authn/authz, delegation tokens are
> still acquired by jobs to avoid short-lived identity credential expiration.
>
>
> Credential refreshing is a bigger can of worms that requires careful
> thought and a separate discussion.
>
> On Thu, May 21, 2020 at 12:32 PM Vipin Rathor <v....@gmail.com> wrote:
>
> > Hi Eric,
> >
> > Thanks for starting this discussion.
> >
> > Kerberos was developed decade before web development becomes popular.
> > > There are some Kerberos limitations which does not work well in Hadoop.
> > >
> > Sure, Kerberos was developed long before the web but it was selected as
> de
> > facto authentication mechanism in Hadoop after the internet boom. And it
> > was selected for a reason - it is one of the strongest symmetric key
> based
> > authentication mechanism out there which doesn't transmit the password in
> > the plain text. Kerberos has been around since long and has stood the
> test
> > of time.
> >
> >  Microsoft Active Directory, which is extensively used in many
> > > organizations, is based on Kerberos.
> > >
> > +1 to this.
> > And the fact that Microsoft has put Active Directory in Azure too, tells
> me
> > that AD (and thereof Kerberos) is not going away any time soon.
> >
> > Overall, I agree with Rajive and Craig on this topic. Paving way for the
> > OpenID Connect in Hadoop is a good idea but seeing it as a replacement to
> > Kerberos, needs to be carefully thought out. All the problems, that are
> > described in the original mail, are not really Kerberos issues.
> > Yes, we do understand that making Kerberos work *in a right way* is
> always
> > an uphill task (I'm a long time Kerberos+Hadoop Support Engineer) but
> that
> > shouldn't be the reason to replace it.
> >
> > Hint: CVE-2020-9492
> > >
> > Btw, the CVE-2020-9492 is not accessible right now in the CVE database,
> > maybe it is not yet public.
> >
> > On Thu, May 21, 2020 at 9:22 AM Steve Loughran
> <stevel@cloudera.com.invalid
> > >
> > wrote:
> >
> > > On Wed, 6 May 2020 at 23:32, Eric Yang <er...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > >
> > > > 4.  Passing different form of tokens does not work well with cloud
> > > provider
> > > > security mechanism.  For example, passing AWS sts token for S3
> bucket.
> > > > There is no renewal mechanism, nor good way to identify when the
> token
> > > > would expire.
> > > >
> > > >
> > > well, HADOOP-14556 does it fairly well, supporting session and role
> > tokens.
> > > We even know when they expire because we ask for a duration when we
> > request
> > > the session/role creds.
> > > See org.apache.hadoop.fs.s3a.auth.delegation.AbstractS3ATokenIdentifier
> > for
> > > the core of what we marshall, including encryption secrets.
> > >
> > > The main issue there is that Yarn can't refresh those tokens because a
> > new
> > > triple of session credentials are required; currently token renewal
> > assumes
> > > the token is unchanged and a request is made to the service to update
> > their
> > > table of issued tokens. But even if the RM could get back a new token
> > from
> > > a refresh call, we are left with the problem of "how to get an updated
> > set
> > > of creds to each process"
> > >
> >
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Daryn Sharp <da...@verizonmedia.com.INVALID>.
There’s a few too many issues being mixed here.


We aren’t very far from having OIDC support.  The pre-requisite RPC/TLS &
RPC/mTLS recently completed rollout to our entire production grid.
Majority of the past year was spent shaking out bugs and ensuring 100%
compatibility.  There are a few rough edges I need to clean up for a
community release.


A few weeks ago I created a rough POC to leverage RPC/mTLS with OIDC access
tokens.  Goal is a mTLS cert may be blessed to impersonate with an access
token.  A compromised service may only be abused to impersonate users that
have recently accessed said service.


Kerberos, mTLs, and OIDC may all be simultaneously supported.  Part of the
simplicity is regardless of the client’s authn/authz, delegation tokens are
still acquired by jobs to avoid short-lived identity credential expiration.


Credential refreshing is a bigger can of worms that requires careful
thought and a separate discussion.

On Thu, May 21, 2020 at 12:32 PM Vipin Rathor <v....@gmail.com> wrote:

> Hi Eric,
>
> Thanks for starting this discussion.
>
> Kerberos was developed decade before web development becomes popular.
> > There are some Kerberos limitations which does not work well in Hadoop.
> >
> Sure, Kerberos was developed long before the web but it was selected as de
> facto authentication mechanism in Hadoop after the internet boom. And it
> was selected for a reason - it is one of the strongest symmetric key based
> authentication mechanism out there which doesn't transmit the password in
> the plain text. Kerberos has been around since long and has stood the test
> of time.
>
>  Microsoft Active Directory, which is extensively used in many
> > organizations, is based on Kerberos.
> >
> +1 to this.
> And the fact that Microsoft has put Active Directory in Azure too, tells me
> that AD (and thereof Kerberos) is not going away any time soon.
>
> Overall, I agree with Rajive and Craig on this topic. Paving way for the
> OpenID Connect in Hadoop is a good idea but seeing it as a replacement to
> Kerberos, needs to be carefully thought out. All the problems, that are
> described in the original mail, are not really Kerberos issues.
> Yes, we do understand that making Kerberos work *in a right way* is always
> an uphill task (I'm a long time Kerberos+Hadoop Support Engineer) but that
> shouldn't be the reason to replace it.
>
> Hint: CVE-2020-9492
> >
> Btw, the CVE-2020-9492 is not accessible right now in the CVE database,
> maybe it is not yet public.
>
> On Thu, May 21, 2020 at 9:22 AM Steve Loughran <stevel@cloudera.com.invalid
> >
> wrote:
>
> > On Wed, 6 May 2020 at 23:32, Eric Yang <er...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > >
> > > 4.  Passing different form of tokens does not work well with cloud
> > provider
> > > security mechanism.  For example, passing AWS sts token for S3 bucket.
> > > There is no renewal mechanism, nor good way to identify when the token
> > > would expire.
> > >
> > >
> > well, HADOOP-14556 does it fairly well, supporting session and role
> tokens.
> > We even know when they expire because we ask for a duration when we
> request
> > the session/role creds.
> > See org.apache.hadoop.fs.s3a.auth.delegation.AbstractS3ATokenIdentifier
> for
> > the core of what we marshall, including encryption secrets.
> >
> > The main issue there is that Yarn can't refresh those tokens because a
> new
> > triple of session credentials are required; currently token renewal
> assumes
> > the token is unchanged and a request is made to the service to update
> their
> > table of issued tokens. But even if the RM could get back a new token
> from
> > a refresh call, we are left with the problem of "how to get an updated
> set
> > of creds to each process"
> >
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Vipin Rathor <v....@gmail.com>.
Hi Eric,

Thanks for starting this discussion.

Kerberos was developed decade before web development becomes popular.
> There are some Kerberos limitations which does not work well in Hadoop.
>
Sure, Kerberos was developed long before the web but it was selected as de
facto authentication mechanism in Hadoop after the internet boom. And it
was selected for a reason - it is one of the strongest symmetric key based
authentication mechanism out there which doesn't transmit the password in
the plain text. Kerberos has been around since long and has stood the test
of time.

 Microsoft Active Directory, which is extensively used in many
> organizations, is based on Kerberos.
>
+1 to this.
And the fact that Microsoft has put Active Directory in Azure too, tells me
that AD (and thereof Kerberos) is not going away any time soon.

Overall, I agree with Rajive and Craig on this topic. Paving way for the
OpenID Connect in Hadoop is a good idea but seeing it as a replacement to
Kerberos, needs to be carefully thought out. All the problems, that are
described in the original mail, are not really Kerberos issues.
Yes, we do understand that making Kerberos work *in a right way* is always
an uphill task (I'm a long time Kerberos+Hadoop Support Engineer) but that
shouldn't be the reason to replace it.

Hint: CVE-2020-9492
>
Btw, the CVE-2020-9492 is not accessible right now in the CVE database,
maybe it is not yet public.

On Thu, May 21, 2020 at 9:22 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

> On Wed, 6 May 2020 at 23:32, Eric Yang <er...@gmail.com> wrote:
>
> > Hi all,
> >
> >
> > 4.  Passing different form of tokens does not work well with cloud
> provider
> > security mechanism.  For example, passing AWS sts token for S3 bucket.
> > There is no renewal mechanism, nor good way to identify when the token
> > would expire.
> >
> >
> well, HADOOP-14556 does it fairly well, supporting session and role tokens.
> We even know when they expire because we ask for a duration when we request
> the session/role creds.
> See org.apache.hadoop.fs.s3a.auth.delegation.AbstractS3ATokenIdentifier for
> the core of what we marshall, including encryption secrets.
>
> The main issue there is that Yarn can't refresh those tokens because a new
> triple of session credentials are required; currently token renewal assumes
> the token is unchanged and a request is made to the service to update their
> table of issued tokens. But even if the RM could get back a new token from
> a refresh call, we are left with the problem of "how to get an updated set
> of creds to each process"
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
On Wed, 6 May 2020 at 23:32, Eric Yang <er...@gmail.com> wrote:

> Hi all,
>
>
> 4.  Passing different form of tokens does not work well with cloud provider
> security mechanism.  For example, passing AWS sts token for S3 bucket.
> There is no renewal mechanism, nor good way to identify when the token
> would expire.
>
>
well, HADOOP-14556 does it fairly well, supporting session and role tokens.
We even know when they expire because we ask for a duration when we request
the session/role creds.
See org.apache.hadoop.fs.s3a.auth.delegation.AbstractS3ATokenIdentifier for
the core of what we marshall, including encryption secrets.

The main issue there is that Yarn can't refresh those tokens because a new
triple of session credentials are required; currently token renewal assumes
the token is unchanged and a request is made to the service to update their
table of issued tokens. But even if the RM could get back a new token from
a refresh call, we are left with the problem of "how to get an updated set
of creds to each process"

Re: [EXTERNAL] Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by "Craig.Condit" <Cr...@target.com>.
I have to strongly disagree with making UGI.doAs() private. Just because you feel that impersonation isn't an important feature, does not make it so for all users. There are many valid use cases which require impersonation, and in fact I consider this to be one of the differentiating features of the Hadoop ecosystem. We make use of it heavily to build a variety of services which would not be possible without this. Also consider that in addition to gateway services such as Knox being broken by this change, you would also cripple job schedulers such as Oozie. Running workloads on YARN as different users is vital to ensure that queue resources are allocated and accounted for properly as well as file permissions enforced. Without impersonation, all users of a cluster would need to be granted access to talk directly to YARN. Higher level access points or APIs would not be possible.

Craig Condit

________________________________
From: Eric Yang <er...@gmail.com>
Sent: Wednesday, May 20, 2020 1:57 PM
To: Akira Ajisaka <aa...@apache.org>
Cc: Hadoop Common <co...@hadoop.apache.org>
Subject: [EXTERNAL] Re: [DISCUSS] Secure Hadoop without Kerberos

Hi Akira,

Thank you for the information.  Knox plays a main role in reverse proxy for
Hadoop cluster.  I understand the importance to keep Knox running to
centralize audit log for ingress into the cluster.  Other reverse proxy
solution like Nginx are more feature rich for caching static contents and
load balancer.  It would be great to have ability to use either Knox or
Nginx as reverse proxy solution.  Company wide OIDC is likely to run
independently from Hadoop cluster, but also possible to run in a Hadoop
cluster.  Reverse proxy must have ability to redirects to OIDC where
exposed endpoint is appropriate.

HADOOP-11717 was a good effort to enable SSO integration except it is
written to extend on Kerberos authentication, which prevents decoupling
from Kerberos a reality.  I gathered a few design requirements this
morning, and welcome to contribute:

1.  Encryption is mandatory.  Server certificate validation is required.
2.  Existing token infrastructure for block access token remains the same.
3.  Replace delegation token transport with OIDC JWT token.
4.  Patch token renewer logic to support renew token with OIDC endpoint
before token expires.
5.  Impersonation logic uses service user credentials.  New way to renew
service user credentials securely.
6.  Replace Hadoop RPC SASL transport with TLS because OIDC works with TLS
natively.
7.  Command CLI improvements to use environment variables or files for
accessing client credentials

Downgrade the use of UGI.doAs() to private of Hadoop.  Service should not
run with elevated privileges unless there is a good reason for it (i.e.
loading hive external tables).
I think this is good starting point, and feedback can help to turn these
requirements into tasks.  Let me know what you think.  Thanks

regards,
Eric

On Tue, May 19, 2020 at 9:47 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Eric, thank you for starting the discussion.
>
> I'm interested in OpenID Connect (OIDC) integration.
>
> In addition to the benefits (security, cloud native), operating costs may
> be reduced in some companies.
> We have our company-wide OIDC provider and enable SSO for Hadoop Web UIs
> via Knox + OIDC in Yahoo! JAPAN.
> On the other hand, Hadoop administrators have to manage our own KDC
> servers only for Hadoop ecosystems.
> If Hadoop and its ecosystem can support OIDC, we don't have to manage KDC
> and that way operating costs will be reduced.
>
> Regards,
> Akira
>
> On Thu, May 7, 2020 at 7:32 AM Eric Yang <er...@gmail.com> wrote:
>
>> Hi all,
>>
>> Kerberos was developed decade before web development becomes popular.
>> There are some Kerberos limitations which does not work well in Hadoop.  A
>> few examples of corner cases:
>>
>> 1. Kerberos principal doesn't encode port number, it is difficult to know
>> if the principal is coming from an authorized daemon or a hacker container
>> trying to forge service principal.
>> 2. Hadoop Kerberos principals are used as high privileged principal, a
>> form
>> of credential to impersonate end user.
>> 3. Delegation token may allow expired users to continue to run jobs long
>> after they are gone, without rechecking if end user credentials is still
>> valid.
>> 4.  Passing different form of tokens does not work well with cloud
>> provider
>> security mechanism.  For example, passing AWS sts token for S3 bucket.
>> There is no renewal mechanism, nor good way to identify when the token
>> would expire.
>>
>> There are companies that work on bridging security mechanism of different
>> types, but this is not primary goal for Hadoop.  Hadoop can benefit from
>> modernized security using open standards like OpenID Connect, which
>> proposes to unify web applications using SSO.   This ensure the client
>> credentials are transported in each stage of client servers interaction.
>> This may improve overall security, and provide more cloud native form
>> factor.  I wonder if there is any interested in the community to enable
>> Hadoop OpenID Connect integration work?
>>
>> regards,
>> Eric
>>
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Eric Yang <er...@gmail.com>.
Hi Akira,

Thank you for the information.  Knox plays a main role in reverse proxy for
Hadoop cluster.  I understand the importance to keep Knox running to
centralize audit log for ingress into the cluster.  Other reverse proxy
solution like Nginx are more feature rich for caching static contents and
load balancer.  It would be great to have ability to use either Knox or
Nginx as reverse proxy solution.  Company wide OIDC is likely to run
independently from Hadoop cluster, but also possible to run in a Hadoop
cluster.  Reverse proxy must have ability to redirects to OIDC where
exposed endpoint is appropriate.

HADOOP-11717 was a good effort to enable SSO integration except it is
written to extend on Kerberos authentication, which prevents decoupling
from Kerberos a reality.  I gathered a few design requirements this
morning, and welcome to contribute:

1.  Encryption is mandatory.  Server certificate validation is required.
2.  Existing token infrastructure for block access token remains the same.
3.  Replace delegation token transport with OIDC JWT token.
4.  Patch token renewer logic to support renew token with OIDC endpoint
before token expires.
5.  Impersonation logic uses service user credentials.  New way to renew
service user credentials securely.
6.  Replace Hadoop RPC SASL transport with TLS because OIDC works with TLS
natively.
7.  Command CLI improvements to use environment variables or files for
accessing client credentials

Downgrade the use of UGI.doAs() to private of Hadoop.  Service should not
run with elevated privileges unless there is a good reason for it (i.e.
loading hive external tables).
I think this is good starting point, and feedback can help to turn these
requirements into tasks.  Let me know what you think.  Thanks

regards,
Eric

On Tue, May 19, 2020 at 9:47 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Eric, thank you for starting the discussion.
>
> I'm interested in OpenID Connect (OIDC) integration.
>
> In addition to the benefits (security, cloud native), operating costs may
> be reduced in some companies.
> We have our company-wide OIDC provider and enable SSO for Hadoop Web UIs
> via Knox + OIDC in Yahoo! JAPAN.
> On the other hand, Hadoop administrators have to manage our own KDC
> servers only for Hadoop ecosystems.
> If Hadoop and its ecosystem can support OIDC, we don't have to manage KDC
> and that way operating costs will be reduced.
>
> Regards,
> Akira
>
> On Thu, May 7, 2020 at 7:32 AM Eric Yang <er...@gmail.com> wrote:
>
>> Hi all,
>>
>> Kerberos was developed decade before web development becomes popular.
>> There are some Kerberos limitations which does not work well in Hadoop.  A
>> few examples of corner cases:
>>
>> 1. Kerberos principal doesn't encode port number, it is difficult to know
>> if the principal is coming from an authorized daemon or a hacker container
>> trying to forge service principal.
>> 2. Hadoop Kerberos principals are used as high privileged principal, a
>> form
>> of credential to impersonate end user.
>> 3. Delegation token may allow expired users to continue to run jobs long
>> after they are gone, without rechecking if end user credentials is still
>> valid.
>> 4.  Passing different form of tokens does not work well with cloud
>> provider
>> security mechanism.  For example, passing AWS sts token for S3 bucket.
>> There is no renewal mechanism, nor good way to identify when the token
>> would expire.
>>
>> There are companies that work on bridging security mechanism of different
>> types, but this is not primary goal for Hadoop.  Hadoop can benefit from
>> modernized security using open standards like OpenID Connect, which
>> proposes to unify web applications using SSO.   This ensure the client
>> credentials are transported in each stage of client servers interaction.
>> This may improve overall security, and provide more cloud native form
>> factor.  I wonder if there is any interested in the community to enable
>> Hadoop OpenID Connect integration work?
>>
>> regards,
>> Eric
>>
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Akira Ajisaka <aa...@apache.org>.
Hi Eric, thank you for starting the discussion.

I'm interested in OpenID Connect (OIDC) integration.

In addition to the benefits (security, cloud native), operating costs may
be reduced in some companies.
We have our company-wide OIDC provider and enable SSO for Hadoop Web UIs
via Knox + OIDC in Yahoo! JAPAN.
On the other hand, Hadoop administrators have to manage our own KDC servers
only for Hadoop ecosystems.
If Hadoop and its ecosystem can support OIDC, we don't have to manage KDC
and that way operating costs will be reduced.

Regards,
Akira

On Thu, May 7, 2020 at 7:32 AM Eric Yang <er...@gmail.com> wrote:

> Hi all,
>
> Kerberos was developed decade before web development becomes popular.
> There are some Kerberos limitations which does not work well in Hadoop.  A
> few examples of corner cases:
>
> 1. Kerberos principal doesn't encode port number, it is difficult to know
> if the principal is coming from an authorized daemon or a hacker container
> trying to forge service principal.
> 2. Hadoop Kerberos principals are used as high privileged principal, a form
> of credential to impersonate end user.
> 3. Delegation token may allow expired users to continue to run jobs long
> after they are gone, without rechecking if end user credentials is still
> valid.
> 4.  Passing different form of tokens does not work well with cloud provider
> security mechanism.  For example, passing AWS sts token for S3 bucket.
> There is no renewal mechanism, nor good way to identify when the token
> would expire.
>
> There are companies that work on bridging security mechanism of different
> types, but this is not primary goal for Hadoop.  Hadoop can benefit from
> modernized security using open standards like OpenID Connect, which
> proposes to unify web applications using SSO.   This ensure the client
> credentials are transported in each stage of client servers interaction.
> This may improve overall security, and provide more cloud native form
> factor.  I wonder if there is any interested in the community to enable
> Hadoop OpenID Connect integration work?
>
> regards,
> Eric
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Rajive Chittajallu <ra...@ieee.org>.
On Wed, May 6, 2020 at 3:32 PM Eric Yang <er...@gmail.com> wrote:
>
> Hi all,
>
> Kerberos was developed decade before web development becomes popular.
> There are some Kerberos limitations which does not work well in Hadoop.  A
> few examples of corner cases:

Microsoft Active Directory, which is extensively used in many organizations,
is based on Kerberos.

> 1. Kerberos principal doesn't encode port number, it is difficult to know
> if the principal is coming from an authorized daemon or a hacker container
> trying to forge service principal.

Clients use ephemeral ports. Not sure of what the relevancy of this statement.

> 2. Hadoop Kerberos principals are used as high privileged principal, a form
> of credential to impersonate end user.

Principals are identities of the user. You can make identities fully qualified,
to include issuing authority if you want to. This is not kerberos specific.

Remember, Kerberos is an authentication mechanism, How those assertions
are translated to authorization rules are application specific.

Probably reconsider alternatives to auth_to_local rules.

> 3. Delegation token may allow expired users to continue to run jobs long
> after they are gone, without rechecking if end user credentials is still
> valid.

Delegation tokens are hadoop specific implementation, whose lifecycle is
outside the scope of Kerberos. Hadoop (NN/RM) can periodically check
respective IDP Policy and revoke tokens. Or have a central token
management service, similar to KMS

> 4.  Passing different form of tokens does not work well with cloud provider
> security mechanism.  For example, passing AWS sts token for S3 bucket.
> There is no renewal mechanism, nor good way to identify when the token
> would expire.

This is outside the scope of Kerberos.

Assuming you are using YARN, making RM handle S3 temp credentials,
similar to HDFS delegation tokens is something to consider.

> There are companies that work on bridging security mechanism of different
> types, but this is not primary goal for Hadoop.  Hadoop can benefit from
> modernized security using open standards like OpenID Connect, which
> proposes to unify web applications using SSO.   This ensure the client
> credentials are transported in each stage of client servers interaction.
> This may improve overall security, and provide more cloud native form
> factor.  I wonder if there is any interested in the community to enable
> Hadoop OpenID Connect integration work?

End to end identity assertion is where Kerberos in it self does not address.
But any implementation should not pass "credentials'. Need a way to pass
signed requests, that could be verified along the chain.

>
> regards,
> Eric

On Wed, May 6, 2020 at 3:32 PM Eric Yang <er...@gmail.com> wrote:
>
> Hi all,
>
> Kerberos was developed decade before web development becomes popular.
> There are some Kerberos limitations which does not work well in Hadoop.  A
> few examples of corner cases:
>
> 1. Kerberos principal doesn't encode port number, it is difficult to know
> if the principal is coming from an authorized daemon or a hacker container
> trying to forge service principal.
> 2. Hadoop Kerberos principals are used as high privileged principal, a form
> of credential to impersonate end user.
> 3. Delegation token may allow expired users to continue to run jobs long
> after they are gone, without rechecking if end user credentials is still
> valid.
> 4.  Passing different form of tokens does not work well with cloud provider
> security mechanism.  For example, passing AWS sts token for S3 bucket.
> There is no renewal mechanism, nor good way to identify when the token
> would expire.
>
> There are companies that work on bridging security mechanism of different
> types, but this is not primary goal for Hadoop.  Hadoop can benefit from
> modernized security using open standards like OpenID Connect, which
> proposes to unify web applications using SSO.   This ensure the client
> credentials are transported in each stage of client servers interaction.
> This may improve overall security, and provide more cloud native form
> factor.  I wonder if there is any interested in the community to enable
> Hadoop OpenID Connect integration work?
>
> regards,
> Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Eric Yang <er...@gmail.com>.
See my comments inline:

On Wed, May 20, 2020 at 4:50 PM Rajive Chittajallu <ra...@ieee.org> wrote:

> On Wed, May 20, 2020 at 1:47 PM Eric Yang <er...@gmail.com> wrote:
> >
> >> > Kerberos was developed decade before web development becomes popular.
> >> > There are some Kerberos limitations which does not work well in
> Hadoop.  A
> >> > few examples of corner cases:
> >>
> >> Microsoft Active Directory, which is extensively used in many
> organizations,
> >> is based on Kerberos.
> >
> >
> > True, but with rise of Google and AWS.  OIDC seems to be a formidable
> standard that can replace Kerberos for authentication.  I think providing
> an option for the new standard is good for Hadoop.
> >
>
> I think you are referring to Oauth2 and adoption across varies
> significantly across vendors. When one refers to Kerberos, its mostly
> about MIT Kerberos or Microsoft Active Directory. But Oauth2 is a
> specification, implementations vary and are quite prone to bugs. I
> would be very careful in making a generic statement as a "formidable
> standard".
>
> AWS services, atleast in the context of Data processing / Analytics
> does not support Oauth2. Its more of a GCP thing. AWS uses Signed
> requests [1].
>
> [1] https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html


Kerberos is a protocol for authentication.  OIDC is also an authentication
protocol.  MIT Kerberos or Oauth2 are frameworks, not authentication
protocol.  By no means that I am suggesting to adopt Oauth2 framework
because implementing according to protocol spec is better than hard wired
to a certain libraries.  We can adopt existing OIDC libraries like pac4j to
reduce maintenance of implementing OIDC protocol in Hadoop.  AWS has been
offering OIDC authentication for EKS, and IAM identity provider.  By
offering native OIDC support, it will help Hadoop to access cloud services
that are secured by OIDC more easily.


>
>
>>
> >> > 1. Kerberos principal doesn't encode port number, it is difficult to
> know
> >> > if the principal is coming from an authorized daemon or a hacker
> container
> >> > trying to forge service principal.
> >>
> >> Clients use ephemeral ports. Not sure of what the relevancy of this
> statement.
> >
> > Hint: CVE-2020-9492
> >
>
> Its a reserved one. You can help the conversation by describing a threat
> model.
>

Hadoop security mailing list has the problem listed, if you are interested
in this area.  Hadoop Kerberos security quirks is a off topic for
decoupling Kerberos from Hadoop.


> >> > 2. Hadoop Kerberos principals are used as high privileged principal,
> a form
> >> > of credential to impersonate end user.
> >>
> >> Principals are identities of the user. You can make identities fully
> qualified,
> >> to include issuing authority if you want to. This is not kerberos
> specific.
> >>
> >> Remember, Kerberos is an authentication mechanism, How those assertions
> >> are translated to authorization rules are application specific.
> >>
> >> Probably reconsider alternatives to auth_to_local rules.
> >
> >
> > Trust must be validated.  Hadoop Kerberos principals for service that
> can perform impersonation are equal to root power.  Transport root power
> securely without being intercepted is quite difficult, when services are
> running as root instead of daemons.  There is alternate solution to always
> forward signed end user token, hence, there is no need of validation of
> proxy user credential.  The down side of forwarding signed token is
> difficult to forward multiple tokens of incompatible security mechanism
> because renewal mechanism and expiration time may not be deciphered by the
> transport mechanism.  This is the reason that using SSO token is a good way
> to ensure every libraries and framework abide by same security practice to
> eliminate confused deputy problems.
>
> Trust of what? Service principals should not be used for
> authentication in client context, there
> are there for server identification.


The trust is referring to service (Oozie/Hive) impersonates as end user,
and namenode issues delegation token after check proxy user ACL, The form
of token presented to namenode is a service tgt, not end user tgt.  The
service tgt is validated in proxy user ACL validation with namenode to
allow impersonation to happen.  If service tgt is intercepted due to lack
of encryption in RPC or HTTP transport, service ticket is vulnerable to
replay attack.


>
>
OAuth2 (which OIDC flow is based on) suggests JWT, which are signed
> tokens. Can you
> elaborate more on what do you mean my "SSO Token"?


SSO Token is JWT token in this context.  My advice is there should only be
one token transported, instead of multiple tokens to prevent out of sync
expiration date problem on multiple tokens.


>  To improve security for doAS use cases, add context to the calls. Just
> replacing

Kerberos with a different authentication mechanism is not going to
> solve the problem.


The focus is to support alternate security mechanism that may have been
chosen by other companies.  It is not strictly solving any doAs problem,
but nice to consider impact to Hadoop's proxy user implementation.

And how to improve Proxy User usecases vary by application. Asserting
>
a 'on-behalf-of' action,
> when there is an active client on the other end (eg: hdfs proxy) would
> be different from one that
> is initiated per schedule, eg Oozie.


I don't agree that doAs is any different between hdfs proxy or Oozie.  They
are both using impersonation power and behaving like root programs.  As the
result, they must be treated as root programs with extra effort to secure
all entry points to avoid security mistakes.


>

> >>
> >> > 3. Delegation token may allow expired users to continue to run jobs
> long
> >> > after they are gone, without rechecking if end user credentials is
> still
> >> > valid.
> >>
> >> Delegation tokens are hadoop specific implementation, whose lifecycle is
> >> outside the scope of Kerberos. Hadoop (NN/RM) can periodically check
> >> respective IDP Policy and revoke tokens. Or have a central token
> >> management service, similar to KMS
> >>
> >> > 4.  Passing different form of tokens does not work well with cloud
> provider
> >> > security mechanism.  For example, passing AWS sts token for S3 bucket.
> >> > There is no renewal mechanism, nor good way to identify when the token
> >> > would expire.
> >>
> >> This is outside the scope of Kerberos.
> >>
> >> Assuming you are using YARN, making RM handle S3 temp credentials,
> >> similar to HDFS delegation tokens is something to consider.
> >>
> >> > There are companies that work on bridging security mechanism of
> different
> >> > types, but this is not primary goal for Hadoop.  Hadoop can benefit
> from
> >> > modernized security using open standards like OpenID Connect, which
> >> > proposes to unify web applications using SSO.   This ensure the client
> >> > credentials are transported in each stage of client servers
> interaction.
> >> > This may improve overall security, and provide more cloud native form
> >> > factor.  I wonder if there is any interested in the community to
> enable
> >> > Hadoop OpenID Connect integration work?
> >>
> >> End to end identity assertion is where Kerberos in it self does not
> address.
> >> But any implementation should not pass "credentials'. Need a way to pass
> >> signed requests, that could be verified along the chain.
> >
> >
> > We agree on this, and OIDC seems like a good option to pass signed
> requests and verifies the signed token.
> >
> >>
> >> >
> >> > regards,
> >> > Eric
>

Re: [DISCUSS] Secure Hadoop without Kerberos

Posted by Rajive Chittajallu <ra...@ieee.org>.
On Wed, May 20, 2020 at 1:47 PM Eric Yang <er...@gmail.com> wrote:
>
>> > Kerberos was developed decade before web development becomes popular.
>> > There are some Kerberos limitations which does not work well in Hadoop.  A
>> > few examples of corner cases:
>>
>> Microsoft Active Directory, which is extensively used in many organizations,
>> is based on Kerberos.
>
>
> True, but with rise of Google and AWS.  OIDC seems to be a formidable standard that can replace Kerberos for authentication.  I think providing an option for the new standard is good for Hadoop.
>

I think you are referring to Oauth2 and adoption across varies
significantly across vendors. When one refers to Kerberos, its mostly
about MIT Kerberos or Microsoft Active Directory. But Oauth2 is a
specification, implementations vary and are quite prone to bugs. I
would be very careful in making a generic statement as a "formidable
standard".

AWS services, atleast in the context of Data processing / Analytics
does not support Oauth2. Its more of a GCP thing. AWS uses Signed
requests [1].

[1] https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html

>>
>> > 1. Kerberos principal doesn't encode port number, it is difficult to know
>> > if the principal is coming from an authorized daemon or a hacker container
>> > trying to forge service principal.
>>
>> Clients use ephemeral ports. Not sure of what the relevancy of this statement.
>
> Hint: CVE-2020-9492
>

Its a reserved one. You can help the conversation by describing a threat model.

>> > 2. Hadoop Kerberos principals are used as high privileged principal, a form
>> > of credential to impersonate end user.
>>
>> Principals are identities of the user. You can make identities fully qualified,
>> to include issuing authority if you want to. This is not kerberos specific.
>>
>> Remember, Kerberos is an authentication mechanism, How those assertions
>> are translated to authorization rules are application specific.
>>
>> Probably reconsider alternatives to auth_to_local rules.
>
>
> Trust must be validated.  Hadoop Kerberos principals for service that can perform impersonation are equal to root power.  Transport root power securely without being intercepted is quite difficult, when services are running as root instead of daemons.  There is alternate solution to always forward signed end user token, hence, there is no need of validation of proxy user credential.  The down side of forwarding signed token is difficult to forward multiple tokens of incompatible security mechanism because renewal mechanism and expiration time may not be deciphered by the transport mechanism.  This is the reason that using SSO token is a good way to ensure every libraries and framework abide by same security practice to eliminate confused deputy problems.

Trust of what? Service principals should not be used for
authentication in client context, there
are there for server identification.

OAuth2 (which OIDC flow is based on) suggests JWT, which are signed
tokens. Can you
elaborate more on what do you mean my "SSO Token"?

To improve security for doAS use cases, add context to the calls. Just replacing
Kerberos with a different authentication mechanism is not going to
solve the problem.

And how to improve Proxy User usecases vary by application. Asserting
a 'on-behalf-of' action,
when there is an active client on the other end (eg: hdfs proxy) would
be different from one that
is initiated per schedule, eg Oozie.


>>
>> > 3. Delegation token may allow expired users to continue to run jobs long
>> > after they are gone, without rechecking if end user credentials is still
>> > valid.
>>
>> Delegation tokens are hadoop specific implementation, whose lifecycle is
>> outside the scope of Kerberos. Hadoop (NN/RM) can periodically check
>> respective IDP Policy and revoke tokens. Or have a central token
>> management service, similar to KMS
>>
>> > 4.  Passing different form of tokens does not work well with cloud provider
>> > security mechanism.  For example, passing AWS sts token for S3 bucket.
>> > There is no renewal mechanism, nor good way to identify when the token
>> > would expire.
>>
>> This is outside the scope of Kerberos.
>>
>> Assuming you are using YARN, making RM handle S3 temp credentials,
>> similar to HDFS delegation tokens is something to consider.
>>
>> > There are companies that work on bridging security mechanism of different
>> > types, but this is not primary goal for Hadoop.  Hadoop can benefit from
>> > modernized security using open standards like OpenID Connect, which
>> > proposes to unify web applications using SSO.   This ensure the client
>> > credentials are transported in each stage of client servers interaction.
>> > This may improve overall security, and provide more cloud native form
>> > factor.  I wonder if there is any interested in the community to enable
>> > Hadoop OpenID Connect integration work?
>>
>> End to end identity assertion is where Kerberos in it self does not address.
>> But any implementation should not pass "credentials'. Need a way to pass
>> signed requests, that could be verified along the chain.
>
>
> We agree on this, and OIDC seems like a good option to pass signed requests and verifies the signed token.
>
>>
>> >
>> > regards,
>> > Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org