You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Eron Wright <er...@gmail.com> on 2018/05/10 00:30:38 UTC

[Discuss] FLIP-26 - SSL Mutual Authentication

Hello,

Given that some SSL enhancement bugs have been posted lately, I took some
time to revise FLIP-26 which explores how to harden both external and
internal communication.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80453255

Some recent related issues:
- FLINK-9312 - mutual auth for intra-cluster communication
- FLINK-5030 - original SSL feature work

There's also some recent discussion of how to use Flink SSL effectively in
a Kubernetes environment.   The issue is about hostname verification.  The
proposal that I've put forward in FLIP-26 is to not use hostname
verification for intra-cluster communication, but rather to rely in a
cluster-internal certificate and a truststore consisting only of that
certificate.   Meanwhile, a new "external" certificate would be
configurable for the web/api endpoint and associated with a well-known DNS
name as provided by a K8s Service resource.

Stephan is this in-line with your thinking re FLINK-9312?

Thanks
Eron

Re: [Discuss] FLIP-26 - SSL Mutual Authentication

Posted by Stephan Ewen <se...@apache.org>.
FYI: The 1.6 docs reflect the setup where internal and external SSL are
separately configured, and where internal SSL uses client authentication.

https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/security-ssl.html

On Mon, Aug 13, 2018 at 8:54 AM, Stephan Ewen <se...@apache.org> wrote:

> Sounds good, Eron!
>
> Please go ahead...
>
> On Sat, Jul 28, 2018 at 1:33 AM, Eron Wright <er...@gmail.com> wrote:
>
>>  As an update to this thread, Stephan opted to split the internal/external
>> configuration (by providing overrides for a common SSL configuration):
>> https://github.com/apache/flink/pull/6326
>>
>> Note that Akka doesn't support hostname verification in its 'classic'
>> remoting implementation (though the new Artery implementation apparently
>> does), and such verification wouldn't apply to the client certificate
>> anyway.   So the reality is that one should use a limited truststore
>> (never
>> the system truststore) for Akka communication.
>>
>> On the question of routing external communication thru the YARN resource
>> proxy or Mesos/DCOS admin router, the value proposition is:
>> a) simplifies service discovery on the part of external clients,
>> b) permits single sign-on (SSO) be delegating authentication to a central
>> authority,
>> c) facilitates access from outside the cluster, via a public address.
>> The main challenge is that the Flink client code must support a more
>> diverse array of authentication methods, e.g. Kerberos when communicating
>> with the YARN proxy.
>>
>> Given #6326, the next steps would be (unordered):
>> a) create an umbrella issue for the overall effort
>> b) dive into the authorization work for external communication
>> c) implement auto-generation of a certificate for internal communication
>> d) implement TLS on queryable state interface (FLINK-5029)
>>
>> I'll take care of (a) unless there is any objection.
>> -Eron
>>
>>
>> On Sun, May 13, 2018 at 5:45 AM Stephan Ewen <ew...@gmail.com>
>> wrote:
>>
>> > Throwing in some more food for thought:
>> >
>> > An alternative to the above proposed separation of internal and external
>> > SSL would be the following:
>> >
>> >   - We separate channel encryption and authentication
>> >   - We use one common SSL layer (internal and external) that is in both
>> > cases only responsible for establishing an encrypted connection
>> >   - Authentication / authorization internally is done by SASL with
>> > username/password or shared secret.
>> >   - Authentication externally must be through a proxy and authorization
>> > based on a validating HTTP headers set by the proxy, as discussed
>> above..
>> >
>> > Advantages:
>> >   - There is only one certificate needed, which could also be shared
>> across
>> > applications
>> >   - One or two lines in the config authenticate and authorize internal
>> > communication
>> >   - One could possibly still fall back to the other mode by skipping
>> >
>> > Open Questions / Disadvantages
>> >   - Given that hostname verification during SSL handshake is not
>> possible
>> > in many setups, the encrypted channel is vulnerable to man-in-the-middle
>> > attacks without mutual authentication. Not sure how serious that is,
>> > because it would need an attacker to have compromise network nodes of
>> the
>> > cluster already. is that not a universal issue in the K8s world?
>> >
>> > This is anyways a bit hypothetical, because as long as we have akka
>> beneath
>> > the RPC layer, we cannot go with that approach.
>> >
>> > However, if we want to at least keep the door open towards something
>> like
>> > that in the future, we would need to set up configuration in such a way
>> > that we have a "common SSL" configuration (keystore, truststore, etc.)
>> and
>> > internal/external options that override those. That would anyways be
>> > helpful for backwards compatibility.
>> >
>> > @Eron - what are your thoughts on that?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, May 13, 2018 at 1:40 AM, Stephan Ewen <ew...@gmail.com>
>> > wrote:
>> >
>> > > Thank you for bringing this proposal up. It looks very good and we
>> seem
>> > to
>> > > be thinking along very similar lines.
>> > >
>> > > Below are some comments and thoughts on the FLIP.
>> > >
>> > > *Internal vs. External Connectivity*
>> > >
>> > > That is a very helpful distinction, let's build on that.
>> > >
>> > >   - I would suggest to treat eventually all communication coming
>> > > potentially from users as external, meaning Client-to-Dispatcher,
>> > > Client-to-JobManager (trigger savepoint, change parallelism, ...), Web
>> > UI,
>> > > Queryable State.
>> > >
>> > >   - That leaves communication that is only between
>> > JobManager/TaskManager/
>> > > ResourceManager/Dispatcher/HistoryServer as internal.
>> > >
>> > >   - I am somewhat operating under the assumption that all external
>> > > communication will eventually be HTTP/REST. That works best with many
>> > > setups and is the basis for using service proxies that
>> > > handle  authentication/authorization.
>> > >
>> > >
>> > > In Flink 1.5 and future versions, we have the following update there:
>> > >
>> > >   - Akka is now strictly internal connectivity, the client (except
>> legacy
>> > > client) do not use it any more.
>> > >
>> > >   - The Blob Server will move to purely internal connectivity in Flink
>> > > 1.6, where a POST of a job to the Dispatcher has the jars and the
>> > JobGraph.
>> > > That is important for Kubernetes setups, where exposing the BlobServer
>> > and
>> > > querying the blob port causes quite some friction.
>> > >
>> > >   - Treating queryable state as "internal connectivity" is fine for
>> now.
>> > > We should treat it as "external" connectivity in the future if we
>> move it
>> > > to HTTP/REST.
>> > >
>> > >
>> > > *Internal Connectivity and SSL Mutual Authentication*
>> > >
>> > > Simply activating SSL mutual authentication for the internal
>> > communication
>> > > is a really low hanging fruit.
>> > >
>> > > Activating client authentication for Akka, network stack Netty (and
>> Blob
>> > > Server/Client in Flink 1.6) should require no change in the
>> > configurations
>> > > with respect to Flink 1.4. All processes are, with respect to internal
>> > > communication, simultaneously server and client endpoints. Because of
>> > that,
>> > > they already need KeyStore and TrustStore files for SSL handshakes,
>> where
>> > > the TrustStore needs to trust the KeyStore Certificate.
>> > >
>> > > I personally favor the suggestion made to have a script that
>> generates a
>> > > self-signed certificate and adds it to "conf" and updates the
>> > > configuration. That should be picked up by the Yarn and Mesos clients
>> > > anyways.
>> > >
>> > >
>> > > *External Connectivity*
>> > >
>> > > There is a huge surface area and I think we need to give users a way
>> to
>> > > plug in their own tools.
>> > > From what I see (and after some discussions with Patrick and Gary) I
>> > think
>> > > it makes sense to look at proxies in a broad way, similar to the
>> approach
>> > > Eron outlined.
>> > >
>> > > The basic approach could be like that:
>> > >
>> > >   - Everything goes through HTTPS, so the proxy can work with HTTP
>> > headers.
>> > >   - The proxy handles authentication and possibly authorization. The
>> > proxy
>> > > adds some header, for example a user name, a group id, an
>> authorization
>> > > token.
>> > >   - Flink can configure an implementation of an 'authorizer' or
>> validator
>> > > on the headers to decide whether the request is valid.
>> > >
>> > >   - Example 1: The proxy does authentication and adds the user name /
>> > > group as a header. The the Flink-side authorizer simply checks whether
>> > the
>> > > name is in the config (simple ACL-style) scheme.
>> > >   - Example 2: The proxy adds an JSON Web Token and the authorizer
>> > > validates that token.
>> > >
>> > > For secure connections between the Proxy and the Flink Endpoint I
>> would
>> > > follow Eron's suggestion, to use separate KeyStores and TrustStores
>> than
>> > > for internal communication.
>> > >
>> > > For Yarn and Mesos, I would like to see if we could handle those
>> again as
>> > > a special case of the proxies above:
>> > >   - DCOS Admin Router forwards the user authentication token, so that
>> > > could be another authorizer implementation.
>> > >   - In YARN we could see if can implement the IP filter via such an
>> > > authorizer.
>> > >
>> > >
>> > > *Hostname Verification*
>> > >
>> > > For internal communication, and especially on dynamic environments
>> like
>> > > Kubernetes, it is very hard to work with certificates and have
>> hostname
>> > > verification on.
>> > >
>> > > If we assume internal communication works strictly with a shared
>> secret
>> > > certificate and with client authentication, does hostname verification
>> > > actually still add security in that particular setup? My understanding
>> > was
>> > > that hostname verification is important to not have some valid
>> > certificate
>> > > presented, but the one bound to the server you want to talk to. If we
>> > have
>> > > anyways one trusted certificate only, isn't that already implied?
>> > >
>> > > On the other hand, it is still possible (and potentially valuable) for
>> > > users in standalone mode to use keystores and truststores from a PKI,
>> in
>> > > which case there may still be an argument in favor of hostname
>> > verification.
>> > >
>> > > On Thu, May 10, 2018, 02:30 Eron Wright <er...@gmail.com> wrote:
>> > >
>> > >> Hello,
>> > >>
>> > >> Given that some SSL enhancement bugs have been posted lately, I took
>> > some
>> > >> time to revise FLIP-26 which explores how to harden both external and
>> > >> internal communication.
>> > >>
>> > >>
>> > https://cwiki.apache.org/confluence/pages/viewpage.action?
>> pageId=80453255
>> > >>
>> > >> Some recent related issues:
>> > >> - FLINK-9312 - mutual auth for intra-cluster communication
>> > >> - FLINK-5030 - original SSL feature work
>> > >>
>> > >> There's also some recent discussion of how to use Flink SSL
>> effectively
>> > in
>> > >> a Kubernetes environment.   The issue is about hostname verification.
>> > The
>> > >> proposal that I've put forward in FLIP-26 is to not use hostname
>> > >> verification for intra-cluster communication, but rather to rely in a
>> > >> cluster-internal certificate and a truststore consisting only of that
>> > >> certificate.   Meanwhile, a new "external" certificate would be
>> > >> configurable for the web/api endpoint and associated with a
>> well-known
>> > DNS
>> > >> name as provided by a K8s Service resource.
>> > >>
>> > >> Stephan is this in-line with your thinking re FLINK-9312?
>> > >>
>> > >> Thanks
>> > >> Eron
>> > >>
>> > >
>> >
>>
>
>

Re: [Discuss] FLIP-26 - SSL Mutual Authentication

Posted by Stephan Ewen <se...@apache.org>.
Sounds good, Eron!

Please go ahead...

On Sat, Jul 28, 2018 at 1:33 AM, Eron Wright <er...@gmail.com> wrote:

>  As an update to this thread, Stephan opted to split the internal/external
> configuration (by providing overrides for a common SSL configuration):
> https://github.com/apache/flink/pull/6326
>
> Note that Akka doesn't support hostname verification in its 'classic'
> remoting implementation (though the new Artery implementation apparently
> does), and such verification wouldn't apply to the client certificate
> anyway.   So the reality is that one should use a limited truststore (never
> the system truststore) for Akka communication.
>
> On the question of routing external communication thru the YARN resource
> proxy or Mesos/DCOS admin router, the value proposition is:
> a) simplifies service discovery on the part of external clients,
> b) permits single sign-on (SSO) be delegating authentication to a central
> authority,
> c) facilitates access from outside the cluster, via a public address.
> The main challenge is that the Flink client code must support a more
> diverse array of authentication methods, e.g. Kerberos when communicating
> with the YARN proxy.
>
> Given #6326, the next steps would be (unordered):
> a) create an umbrella issue for the overall effort
> b) dive into the authorization work for external communication
> c) implement auto-generation of a certificate for internal communication
> d) implement TLS on queryable state interface (FLINK-5029)
>
> I'll take care of (a) unless there is any objection.
> -Eron
>
>
> On Sun, May 13, 2018 at 5:45 AM Stephan Ewen <ew...@gmail.com>
> wrote:
>
> > Throwing in some more food for thought:
> >
> > An alternative to the above proposed separation of internal and external
> > SSL would be the following:
> >
> >   - We separate channel encryption and authentication
> >   - We use one common SSL layer (internal and external) that is in both
> > cases only responsible for establishing an encrypted connection
> >   - Authentication / authorization internally is done by SASL with
> > username/password or shared secret.
> >   - Authentication externally must be through a proxy and authorization
> > based on a validating HTTP headers set by the proxy, as discussed above..
> >
> > Advantages:
> >   - There is only one certificate needed, which could also be shared
> across
> > applications
> >   - One or two lines in the config authenticate and authorize internal
> > communication
> >   - One could possibly still fall back to the other mode by skipping
> >
> > Open Questions / Disadvantages
> >   - Given that hostname verification during SSL handshake is not possible
> > in many setups, the encrypted channel is vulnerable to man-in-the-middle
> > attacks without mutual authentication. Not sure how serious that is,
> > because it would need an attacker to have compromise network nodes of the
> > cluster already. is that not a universal issue in the K8s world?
> >
> > This is anyways a bit hypothetical, because as long as we have akka
> beneath
> > the RPC layer, we cannot go with that approach.
> >
> > However, if we want to at least keep the door open towards something like
> > that in the future, we would need to set up configuration in such a way
> > that we have a "common SSL" configuration (keystore, truststore, etc.)
> and
> > internal/external options that override those. That would anyways be
> > helpful for backwards compatibility.
> >
> > @Eron - what are your thoughts on that?
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, May 13, 2018 at 1:40 AM, Stephan Ewen <ew...@gmail.com>
> > wrote:
> >
> > > Thank you for bringing this proposal up. It looks very good and we seem
> > to
> > > be thinking along very similar lines.
> > >
> > > Below are some comments and thoughts on the FLIP.
> > >
> > > *Internal vs. External Connectivity*
> > >
> > > That is a very helpful distinction, let's build on that.
> > >
> > >   - I would suggest to treat eventually all communication coming
> > > potentially from users as external, meaning Client-to-Dispatcher,
> > > Client-to-JobManager (trigger savepoint, change parallelism, ...), Web
> > UI,
> > > Queryable State.
> > >
> > >   - That leaves communication that is only between
> > JobManager/TaskManager/
> > > ResourceManager/Dispatcher/HistoryServer as internal.
> > >
> > >   - I am somewhat operating under the assumption that all external
> > > communication will eventually be HTTP/REST. That works best with many
> > > setups and is the basis for using service proxies that
> > > handle  authentication/authorization.
> > >
> > >
> > > In Flink 1.5 and future versions, we have the following update there:
> > >
> > >   - Akka is now strictly internal connectivity, the client (except
> legacy
> > > client) do not use it any more.
> > >
> > >   - The Blob Server will move to purely internal connectivity in Flink
> > > 1.6, where a POST of a job to the Dispatcher has the jars and the
> > JobGraph.
> > > That is important for Kubernetes setups, where exposing the BlobServer
> > and
> > > querying the blob port causes quite some friction.
> > >
> > >   - Treating queryable state as "internal connectivity" is fine for
> now.
> > > We should treat it as "external" connectivity in the future if we move
> it
> > > to HTTP/REST.
> > >
> > >
> > > *Internal Connectivity and SSL Mutual Authentication*
> > >
> > > Simply activating SSL mutual authentication for the internal
> > communication
> > > is a really low hanging fruit.
> > >
> > > Activating client authentication for Akka, network stack Netty (and
> Blob
> > > Server/Client in Flink 1.6) should require no change in the
> > configurations
> > > with respect to Flink 1.4. All processes are, with respect to internal
> > > communication, simultaneously server and client endpoints. Because of
> > that,
> > > they already need KeyStore and TrustStore files for SSL handshakes,
> where
> > > the TrustStore needs to trust the KeyStore Certificate.
> > >
> > > I personally favor the suggestion made to have a script that generates
> a
> > > self-signed certificate and adds it to "conf" and updates the
> > > configuration. That should be picked up by the Yarn and Mesos clients
> > > anyways.
> > >
> > >
> > > *External Connectivity*
> > >
> > > There is a huge surface area and I think we need to give users a way to
> > > plug in their own tools.
> > > From what I see (and after some discussions with Patrick and Gary) I
> > think
> > > it makes sense to look at proxies in a broad way, similar to the
> approach
> > > Eron outlined.
> > >
> > > The basic approach could be like that:
> > >
> > >   - Everything goes through HTTPS, so the proxy can work with HTTP
> > headers.
> > >   - The proxy handles authentication and possibly authorization. The
> > proxy
> > > adds some header, for example a user name, a group id, an authorization
> > > token.
> > >   - Flink can configure an implementation of an 'authorizer' or
> validator
> > > on the headers to decide whether the request is valid.
> > >
> > >   - Example 1: The proxy does authentication and adds the user name /
> > > group as a header. The the Flink-side authorizer simply checks whether
> > the
> > > name is in the config (simple ACL-style) scheme.
> > >   - Example 2: The proxy adds an JSON Web Token and the authorizer
> > > validates that token.
> > >
> > > For secure connections between the Proxy and the Flink Endpoint I would
> > > follow Eron's suggestion, to use separate KeyStores and TrustStores
> than
> > > for internal communication.
> > >
> > > For Yarn and Mesos, I would like to see if we could handle those again
> as
> > > a special case of the proxies above:
> > >   - DCOS Admin Router forwards the user authentication token, so that
> > > could be another authorizer implementation.
> > >   - In YARN we could see if can implement the IP filter via such an
> > > authorizer.
> > >
> > >
> > > *Hostname Verification*
> > >
> > > For internal communication, and especially on dynamic environments like
> > > Kubernetes, it is very hard to work with certificates and have hostname
> > > verification on.
> > >
> > > If we assume internal communication works strictly with a shared secret
> > > certificate and with client authentication, does hostname verification
> > > actually still add security in that particular setup? My understanding
> > was
> > > that hostname verification is important to not have some valid
> > certificate
> > > presented, but the one bound to the server you want to talk to. If we
> > have
> > > anyways one trusted certificate only, isn't that already implied?
> > >
> > > On the other hand, it is still possible (and potentially valuable) for
> > > users in standalone mode to use keystores and truststores from a PKI,
> in
> > > which case there may still be an argument in favor of hostname
> > verification.
> > >
> > > On Thu, May 10, 2018, 02:30 Eron Wright <er...@gmail.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> Given that some SSL enhancement bugs have been posted lately, I took
> > some
> > >> time to revise FLIP-26 which explores how to harden both external and
> > >> internal communication.
> > >>
> > >>
> > https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=80453255
> > >>
> > >> Some recent related issues:
> > >> - FLINK-9312 - mutual auth for intra-cluster communication
> > >> - FLINK-5030 - original SSL feature work
> > >>
> > >> There's also some recent discussion of how to use Flink SSL
> effectively
> > in
> > >> a Kubernetes environment.   The issue is about hostname verification.
> > The
> > >> proposal that I've put forward in FLIP-26 is to not use hostname
> > >> verification for intra-cluster communication, but rather to rely in a
> > >> cluster-internal certificate and a truststore consisting only of that
> > >> certificate.   Meanwhile, a new "external" certificate would be
> > >> configurable for the web/api endpoint and associated with a well-known
> > DNS
> > >> name as provided by a K8s Service resource.
> > >>
> > >> Stephan is this in-line with your thinking re FLINK-9312?
> > >>
> > >> Thanks
> > >> Eron
> > >>
> > >
> >
>

Re: [Discuss] FLIP-26 - SSL Mutual Authentication

Posted by Eron Wright <er...@gmail.com>.
 As an update to this thread, Stephan opted to split the internal/external
configuration (by providing overrides for a common SSL configuration):
https://github.com/apache/flink/pull/6326

Note that Akka doesn't support hostname verification in its 'classic'
remoting implementation (though the new Artery implementation apparently
does), and such verification wouldn't apply to the client certificate
anyway.   So the reality is that one should use a limited truststore (never
the system truststore) for Akka communication.

On the question of routing external communication thru the YARN resource
proxy or Mesos/DCOS admin router, the value proposition is:
a) simplifies service discovery on the part of external clients,
b) permits single sign-on (SSO) be delegating authentication to a central
authority,
c) facilitates access from outside the cluster, via a public address.
The main challenge is that the Flink client code must support a more
diverse array of authentication methods, e.g. Kerberos when communicating
with the YARN proxy.

Given #6326, the next steps would be (unordered):
a) create an umbrella issue for the overall effort
b) dive into the authorization work for external communication
c) implement auto-generation of a certificate for internal communication
d) implement TLS on queryable state interface (FLINK-5029)

I'll take care of (a) unless there is any objection.
-Eron


On Sun, May 13, 2018 at 5:45 AM Stephan Ewen <ew...@gmail.com> wrote:

> Throwing in some more food for thought:
>
> An alternative to the above proposed separation of internal and external
> SSL would be the following:
>
>   - We separate channel encryption and authentication
>   - We use one common SSL layer (internal and external) that is in both
> cases only responsible for establishing an encrypted connection
>   - Authentication / authorization internally is done by SASL with
> username/password or shared secret.
>   - Authentication externally must be through a proxy and authorization
> based on a validating HTTP headers set by the proxy, as discussed above..
>
> Advantages:
>   - There is only one certificate needed, which could also be shared across
> applications
>   - One or two lines in the config authenticate and authorize internal
> communication
>   - One could possibly still fall back to the other mode by skipping
>
> Open Questions / Disadvantages
>   - Given that hostname verification during SSL handshake is not possible
> in many setups, the encrypted channel is vulnerable to man-in-the-middle
> attacks without mutual authentication. Not sure how serious that is,
> because it would need an attacker to have compromise network nodes of the
> cluster already. is that not a universal issue in the K8s world?
>
> This is anyways a bit hypothetical, because as long as we have akka beneath
> the RPC layer, we cannot go with that approach.
>
> However, if we want to at least keep the door open towards something like
> that in the future, we would need to set up configuration in such a way
> that we have a "common SSL" configuration (keystore, truststore, etc.) and
> internal/external options that override those. That would anyways be
> helpful for backwards compatibility.
>
> @Eron - what are your thoughts on that?
>
>
>
>
>
>
>
>
> On Sun, May 13, 2018 at 1:40 AM, Stephan Ewen <ew...@gmail.com>
> wrote:
>
> > Thank you for bringing this proposal up. It looks very good and we seem
> to
> > be thinking along very similar lines.
> >
> > Below are some comments and thoughts on the FLIP.
> >
> > *Internal vs. External Connectivity*
> >
> > That is a very helpful distinction, let's build on that.
> >
> >   - I would suggest to treat eventually all communication coming
> > potentially from users as external, meaning Client-to-Dispatcher,
> > Client-to-JobManager (trigger savepoint, change parallelism, ...), Web
> UI,
> > Queryable State.
> >
> >   - That leaves communication that is only between
> JobManager/TaskManager/
> > ResourceManager/Dispatcher/HistoryServer as internal.
> >
> >   - I am somewhat operating under the assumption that all external
> > communication will eventually be HTTP/REST. That works best with many
> > setups and is the basis for using service proxies that
> > handle  authentication/authorization.
> >
> >
> > In Flink 1.5 and future versions, we have the following update there:
> >
> >   - Akka is now strictly internal connectivity, the client (except legacy
> > client) do not use it any more.
> >
> >   - The Blob Server will move to purely internal connectivity in Flink
> > 1.6, where a POST of a job to the Dispatcher has the jars and the
> JobGraph.
> > That is important for Kubernetes setups, where exposing the BlobServer
> and
> > querying the blob port causes quite some friction.
> >
> >   - Treating queryable state as "internal connectivity" is fine for now.
> > We should treat it as "external" connectivity in the future if we move it
> > to HTTP/REST.
> >
> >
> > *Internal Connectivity and SSL Mutual Authentication*
> >
> > Simply activating SSL mutual authentication for the internal
> communication
> > is a really low hanging fruit.
> >
> > Activating client authentication for Akka, network stack Netty (and Blob
> > Server/Client in Flink 1.6) should require no change in the
> configurations
> > with respect to Flink 1.4. All processes are, with respect to internal
> > communication, simultaneously server and client endpoints. Because of
> that,
> > they already need KeyStore and TrustStore files for SSL handshakes, where
> > the TrustStore needs to trust the KeyStore Certificate.
> >
> > I personally favor the suggestion made to have a script that generates a
> > self-signed certificate and adds it to "conf" and updates the
> > configuration. That should be picked up by the Yarn and Mesos clients
> > anyways.
> >
> >
> > *External Connectivity*
> >
> > There is a huge surface area and I think we need to give users a way to
> > plug in their own tools.
> > From what I see (and after some discussions with Patrick and Gary) I
> think
> > it makes sense to look at proxies in a broad way, similar to the approach
> > Eron outlined.
> >
> > The basic approach could be like that:
> >
> >   - Everything goes through HTTPS, so the proxy can work with HTTP
> headers.
> >   - The proxy handles authentication and possibly authorization. The
> proxy
> > adds some header, for example a user name, a group id, an authorization
> > token.
> >   - Flink can configure an implementation of an 'authorizer' or validator
> > on the headers to decide whether the request is valid.
> >
> >   - Example 1: The proxy does authentication and adds the user name /
> > group as a header. The the Flink-side authorizer simply checks whether
> the
> > name is in the config (simple ACL-style) scheme.
> >   - Example 2: The proxy adds an JSON Web Token and the authorizer
> > validates that token.
> >
> > For secure connections between the Proxy and the Flink Endpoint I would
> > follow Eron's suggestion, to use separate KeyStores and TrustStores than
> > for internal communication.
> >
> > For Yarn and Mesos, I would like to see if we could handle those again as
> > a special case of the proxies above:
> >   - DCOS Admin Router forwards the user authentication token, so that
> > could be another authorizer implementation.
> >   - In YARN we could see if can implement the IP filter via such an
> > authorizer.
> >
> >
> > *Hostname Verification*
> >
> > For internal communication, and especially on dynamic environments like
> > Kubernetes, it is very hard to work with certificates and have hostname
> > verification on.
> >
> > If we assume internal communication works strictly with a shared secret
> > certificate and with client authentication, does hostname verification
> > actually still add security in that particular setup? My understanding
> was
> > that hostname verification is important to not have some valid
> certificate
> > presented, but the one bound to the server you want to talk to. If we
> have
> > anyways one trusted certificate only, isn't that already implied?
> >
> > On the other hand, it is still possible (and potentially valuable) for
> > users in standalone mode to use keystores and truststores from a PKI, in
> > which case there may still be an argument in favor of hostname
> verification.
> >
> > On Thu, May 10, 2018, 02:30 Eron Wright <er...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> Given that some SSL enhancement bugs have been posted lately, I took
> some
> >> time to revise FLIP-26 which explores how to harden both external and
> >> internal communication.
> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80453255
> >>
> >> Some recent related issues:
> >> - FLINK-9312 - mutual auth for intra-cluster communication
> >> - FLINK-5030 - original SSL feature work
> >>
> >> There's also some recent discussion of how to use Flink SSL effectively
> in
> >> a Kubernetes environment.   The issue is about hostname verification.
> The
> >> proposal that I've put forward in FLIP-26 is to not use hostname
> >> verification for intra-cluster communication, but rather to rely in a
> >> cluster-internal certificate and a truststore consisting only of that
> >> certificate.   Meanwhile, a new "external" certificate would be
> >> configurable for the web/api endpoint and associated with a well-known
> DNS
> >> name as provided by a K8s Service resource.
> >>
> >> Stephan is this in-line with your thinking re FLINK-9312?
> >>
> >> Thanks
> >> Eron
> >>
> >
>

Re: [Discuss] FLIP-26 - SSL Mutual Authentication

Posted by Stephan Ewen <ew...@gmail.com>.
Throwing in some more food for thought:

An alternative to the above proposed separation of internal and external
SSL would be the following:

  - We separate channel encryption and authentication
  - We use one common SSL layer (internal and external) that is in both
cases only responsible for establishing an encrypted connection
  - Authentication / authorization internally is done by SASL with
username/password or shared secret.
  - Authentication externally must be through a proxy and authorization
based on a validating HTTP headers set by the proxy, as discussed above..

Advantages:
  - There is only one certificate needed, which could also be shared across
applications
  - One or two lines in the config authenticate and authorize internal
communication
  - One could possibly still fall back to the other mode by skipping

Open Questions / Disadvantages
  - Given that hostname verification during SSL handshake is not possible
in many setups, the encrypted channel is vulnerable to man-in-the-middle
attacks without mutual authentication. Not sure how serious that is,
because it would need an attacker to have compromise network nodes of the
cluster already. is that not a universal issue in the K8s world?

This is anyways a bit hypothetical, because as long as we have akka beneath
the RPC layer, we cannot go with that approach.

However, if we want to at least keep the door open towards something like
that in the future, we would need to set up configuration in such a way
that we have a "common SSL" configuration (keystore, truststore, etc.) and
internal/external options that override those. That would anyways be
helpful for backwards compatibility.

@Eron - what are your thoughts on that?








On Sun, May 13, 2018 at 1:40 AM, Stephan Ewen <ew...@gmail.com> wrote:

> Thank you for bringing this proposal up. It looks very good and we seem to
> be thinking along very similar lines.
>
> Below are some comments and thoughts on the FLIP.
>
> *Internal vs. External Connectivity*
>
> That is a very helpful distinction, let's build on that.
>
>   - I would suggest to treat eventually all communication coming
> potentially from users as external, meaning Client-to-Dispatcher,
> Client-to-JobManager (trigger savepoint, change parallelism, ...), Web UI,
> Queryable State.
>
>   - That leaves communication that is only between JobManager/TaskManager/
> ResourceManager/Dispatcher/HistoryServer as internal.
>
>   - I am somewhat operating under the assumption that all external
> communication will eventually be HTTP/REST. That works best with many
> setups and is the basis for using service proxies that
> handle  authentication/authorization.
>
>
> In Flink 1.5 and future versions, we have the following update there:
>
>   - Akka is now strictly internal connectivity, the client (except legacy
> client) do not use it any more.
>
>   - The Blob Server will move to purely internal connectivity in Flink
> 1.6, where a POST of a job to the Dispatcher has the jars and the JobGraph.
> That is important for Kubernetes setups, where exposing the BlobServer and
> querying the blob port causes quite some friction.
>
>   - Treating queryable state as "internal connectivity" is fine for now.
> We should treat it as "external" connectivity in the future if we move it
> to HTTP/REST.
>
>
> *Internal Connectivity and SSL Mutual Authentication*
>
> Simply activating SSL mutual authentication for the internal communication
> is a really low hanging fruit.
>
> Activating client authentication for Akka, network stack Netty (and Blob
> Server/Client in Flink 1.6) should require no change in the configurations
> with respect to Flink 1.4. All processes are, with respect to internal
> communication, simultaneously server and client endpoints. Because of that,
> they already need KeyStore and TrustStore files for SSL handshakes, where
> the TrustStore needs to trust the KeyStore Certificate.
>
> I personally favor the suggestion made to have a script that generates a
> self-signed certificate and adds it to "conf" and updates the
> configuration. That should be picked up by the Yarn and Mesos clients
> anyways.
>
>
> *External Connectivity*
>
> There is a huge surface area and I think we need to give users a way to
> plug in their own tools.
> From what I see (and after some discussions with Patrick and Gary) I think
> it makes sense to look at proxies in a broad way, similar to the approach
> Eron outlined.
>
> The basic approach could be like that:
>
>   - Everything goes through HTTPS, so the proxy can work with HTTP headers.
>   - The proxy handles authentication and possibly authorization. The proxy
> adds some header, for example a user name, a group id, an authorization
> token.
>   - Flink can configure an implementation of an 'authorizer' or validator
> on the headers to decide whether the request is valid.
>
>   - Example 1: The proxy does authentication and adds the user name /
> group as a header. The the Flink-side authorizer simply checks whether the
> name is in the config (simple ACL-style) scheme.
>   - Example 2: The proxy adds an JSON Web Token and the authorizer
> validates that token.
>
> For secure connections between the Proxy and the Flink Endpoint I would
> follow Eron's suggestion, to use separate KeyStores and TrustStores than
> for internal communication.
>
> For Yarn and Mesos, I would like to see if we could handle those again as
> a special case of the proxies above:
>   - DCOS Admin Router forwards the user authentication token, so that
> could be another authorizer implementation.
>   - In YARN we could see if can implement the IP filter via such an
> authorizer.
>
>
> *Hostname Verification*
>
> For internal communication, and especially on dynamic environments like
> Kubernetes, it is very hard to work with certificates and have hostname
> verification on.
>
> If we assume internal communication works strictly with a shared secret
> certificate and with client authentication, does hostname verification
> actually still add security in that particular setup? My understanding was
> that hostname verification is important to not have some valid certificate
> presented, but the one bound to the server you want to talk to. If we have
> anyways one trusted certificate only, isn't that already implied?
>
> On the other hand, it is still possible (and potentially valuable) for
> users in standalone mode to use keystores and truststores from a PKI, in
> which case there may still be an argument in favor of hostname verification.
>
> On Thu, May 10, 2018, 02:30 Eron Wright <er...@gmail.com> wrote:
>
>> Hello,
>>
>> Given that some SSL enhancement bugs have been posted lately, I took some
>> time to revise FLIP-26 which explores how to harden both external and
>> internal communication.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80453255
>>
>> Some recent related issues:
>> - FLINK-9312 - mutual auth for intra-cluster communication
>> - FLINK-5030 - original SSL feature work
>>
>> There's also some recent discussion of how to use Flink SSL effectively in
>> a Kubernetes environment.   The issue is about hostname verification.  The
>> proposal that I've put forward in FLIP-26 is to not use hostname
>> verification for intra-cluster communication, but rather to rely in a
>> cluster-internal certificate and a truststore consisting only of that
>> certificate.   Meanwhile, a new "external" certificate would be
>> configurable for the web/api endpoint and associated with a well-known DNS
>> name as provided by a K8s Service resource.
>>
>> Stephan is this in-line with your thinking re FLINK-9312?
>>
>> Thanks
>> Eron
>>
>

Re: [Discuss] FLIP-26 - SSL Mutual Authentication

Posted by Stephan Ewen <ew...@gmail.com>.
Thank you for bringing this proposal up. It looks very good and we seem to
be thinking along very similar lines.

Below are some comments and thoughts on the FLIP.

*Internal vs. External Connectivity*

That is a very helpful distinction, let's build on that.

  - I would suggest to treat eventually all communication coming
potentially from users as external, meaning Client-to-Dispatcher,
Client-to-JobManager (trigger savepoint, change parallelism, ...), Web UI,
Queryable State.

  - That leaves communication that is only between
JobManager/TaskManager/ResourceManager/Dispatcher/HistoryServer as internal.

  - I am somewhat operating under the assumption that all external
communication will eventually be HTTP/REST. That works best with many
setups and is the basis for using service proxies that
handle  authentication/authorization.


In Flink 1.5 and future versions, we have the following update there:

  - Akka is now strictly internal connectivity, the client (except legacy
client) do not use it any more.

  - The Blob Server will move to purely internal connectivity in Flink 1.6,
where a POST of a job to the Dispatcher has the jars and the JobGraph. That
is important for Kubernetes setups, where exposing the BlobServer and
querying the blob port causes quite some friction.

  - Treating queryable state as "internal connectivity" is fine for now. We
should treat it as "external" connectivity in the future if we move it to
HTTP/REST.


*Internal Connectivity and SSL Mutual Authentication*

Simply activating SSL mutual authentication for the internal communication
is a really low hanging fruit.

Activating client authentication for Akka, network stack Netty (and Blob
Server/Client in Flink 1.6) should require no change in the configurations
with respect to Flink 1.4. All processes are, with respect to internal
communication, simultaneously server and client endpoints. Because of that,
they already need KeyStore and TrustStore files for SSL handshakes, where
the TrustStore needs to trust the KeyStore Certificate.

I personally favor the suggestion made to have a script that generates a
self-signed certificate and adds it to "conf" and updates the
configuration. That should be picked up by the Yarn and Mesos clients
anyways.


*External Connectivity*

There is a huge surface area and I think we need to give users a way to
plug in their own tools.
From what I see (and after some discussions with Patrick and Gary) I think
it makes sense to look at proxies in a broad way, similar to the approach
Eron outlined.

The basic approach could be like that:

  - Everything goes through HTTPS, so the proxy can work with HTTP headers.
  - The proxy handles authentication and possibly authorization. The proxy
adds some header, for example a user name, a group id, an authorization
token.
  - Flink can configure an implementation of an 'authorizer' or validator
on the headers to decide whether the request is valid.

  - Example 1: The proxy does authentication and adds the user name / group
as a header. The the Flink-side authorizer simply checks whether the name
is in the config (simple ACL-style) scheme.
  - Example 2: The proxy adds an JSON Web Token and the authorizer
validates that token.

For secure connections between the Proxy and the Flink Endpoint I would
follow Eron's suggestion, to use separate KeyStores and TrustStores than
for internal communication.

For Yarn and Mesos, I would like to see if we could handle those again as a
special case of the proxies above:
  - DCOS Admin Router forwards the user authentication token, so that could
be another authorizer implementation.
  - In YARN we could see if can implement the IP filter via such an
authorizer.


*Hostname Verification*

For internal communication, and especially on dynamic environments like
Kubernetes, it is very hard to work with certificates and have hostname
verification on.

If we assume internal communication works strictly with a shared secret
certificate and with client authentication, does hostname verification
actually still add security in that particular setup? My understanding was
that hostname verification is important to not have some valid certificate
presented, but the one bound to the server you want to talk to. If we have
anyways one trusted certificate only, isn't that already implied?

On the other hand, it is still possible (and potentially valuable) for
users in standalone mode to use keystores and truststores from a PKI, in
which case there may still be an argument in favor of hostname verification.

On Thu, May 10, 2018, 02:30 Eron Wright <er...@gmail.com> wrote:

> Hello,
>
> Given that some SSL enhancement bugs have been posted lately, I took some
> time to revise FLIP-26 which explores how to harden both external and
> internal communication.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80453255
>
> Some recent related issues:
> - FLINK-9312 - mutual auth for intra-cluster communication
> - FLINK-5030 - original SSL feature work
>
> There's also some recent discussion of how to use Flink SSL effectively in
> a Kubernetes environment.   The issue is about hostname verification.  The
> proposal that I've put forward in FLIP-26 is to not use hostname
> verification for intra-cluster communication, but rather to rely in a
> cluster-internal certificate and a truststore consisting only of that
> certificate.   Meanwhile, a new "external" certificate would be
> configurable for the web/api endpoint and associated with a well-known DNS
> name as provided by a K8s Service resource.
>
> Stephan is this in-line with your thinking re FLINK-9312?
>
> Thanks
> Eron
>