You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Edward Alexander Rojas Clavijo <ed...@gmail.com> on 2018/03/27 13:24:52 UTC

SSL config on Kubernetes - Dynamic IP

Hi all,

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
configuration based on https://ci.apache.org/projects/flink/flink-docs-
master/ops/security-ssl.html.

However, as the IP of the nodes are dynamic (from the nature of
kubernetes), we are using only the DNS which we can control using
kubernetes services. So we add to the Subject Alternative Name(SAN) the
flink-jobmanager DNS and also the DNS for the task managers
*.flink-taskmanager-svc (each task manager has a DNS in the form
flink-taskmanager-0.flink-taskmanager-svc).

Additionally we set the jobmanager.rpc.address property on all the nodes
and each task manager sets the taskmanager.host property, all matching the
ones on the certificate.

This is working well when using Job with Parallelism set to 1. The SSL
validations are good and the Jobmanager can communicate with Task manager
and vice versa.

But when we set the parallelism to more than 1 we have exceptions on the
SSL validation like this:

Caused by: java.security.cert.CertificateException: No subject alternative
names matching IP address 172.30.247.163 found
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(
X509TrustManagerImpl.java:455)
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(
X509TrustManagerImpl.java:436)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(
X509TrustManagerImpl.java:252)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(
X509TrustManagerImpl.java:136)
at sun.security.ssl.ClientHandshaker.serverCertificate(
ClientHandshaker.java:1601)
... 21 more


From the logs I see the Jobmanager is correctly registering the
taskmanagers:

org.apache.flink.runtime.instance.InstanceManager   - Registered
TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-
flink-taskmanager-1.flink-taskmanager-svc.default.svc.
cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700.
Current number of registered hosts is 3. Current number of alive task slots
is 6.

And also each taskmanager is correctly registered to use the hostname for
communication:

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use
hostname/address
'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local'
(172.30.247.163) for communication.
...
akka.remote.Remoting   - Remoting started; listening on addresses
:[akka.ssl.tcp://flink@flink-taskmanager-1.flink-
taskmanager-svc.default.svc.cluster.local:6122]
...
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
[server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.
cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory
segment size (bytes): 32768, transport type: NIO, number of server threads:
2 (manual), number of client threads: 2 (manual), server connect backlog: 0
(use Netty's default), client connect timeout (sec): 120, send/receive
buffer size (bytes): 0 (use Netty's default)]
...
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
connection information: bf4a9b50e57c99c17049adb66d65f685 @
flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
(dataPort=6121)



But even with that, it seems like the taskmanagers are using the IP
communicate between them and the SSL validation fails.

Do you know if it's possible to make the taskmanagers to use the hostname
to communicate instead of the IP ?
or
Do you have any advice to get the SSL configuration to work on this
environment ?

Thanks in advance.

Regards,
Edward

Re: SSL config on Kubernetes - Dynamic IP

Posted by Christophe Jolif <cj...@gmail.com>.
By the way Fabian, any chance this issue is looked into / the PR considered
for 1.5?

--
Christophe

On Wed, Apr 4, 2018 at 2:41 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Thank you Edward and Christophe!
>
> 2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <
> edward.rojascl@gmail.com>:
>
>> Hi all,
>>
>> I did some tests based on the PR Christophe mentioned above and by making
>> a change on the NettyClient to use CanonicalHostName instead of
>> HostNameAddress to identify the server, the SSL validation works!!
>>
>> I created a PR with this change: https://github.com/apa
>> che/flink/pull/5789
>>
>> Regards,
>> Edward
>>
>> 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <
>> edward.rojascl@gmail.com>:
>>
>>> Hi Till,
>>>
>>> I just created the JIRA ticket: https://issues.apache.org/jira
>>> /browse/FLINK-9103
>>>
>>> I added the JobManager and TaskManager logs, Hope this helps to resolve
>>> the issue.
>>>
>>> Regards,
>>> Edward
>>>
>>> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <tr...@apache.org>:
>>>
>>>> Hi Edward,
>>>>
>>>> could you please file a JIRA issue for this problem. It might be as
>>>> simple as that the TaskManager's network stack uses the IP instead of the
>>>> hostname as you suggested. But we have to look into this to be sure. Also
>>>> the logs of the JobManager as well as the TaskManagers could be helpful.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> I suspect this relates to: https://issues.apache.org/
>>>>> jira/browse/FLINK-5030
>>>>>
>>>>> For which there was a PR at some point but nothing has been done so
>>>>> far. It seems the current code explicitly uses the IP vs Hostname for Netty
>>>>> SSL configuration.
>>>>>
>>>>> Without that I'm really wondering how people are reasonably using SSL
>>>>> on a Kubernetes Flink-based cluster as every time a pod is (re-started) it
>>>>> can theoretically take a different IP? Or do I miss something?
>>>>>
>>>>> --
>>>>> Christophe
>>>>>
>>>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>>>>> edward.rojascl@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with
>>>>>> SSL configuration based on https://ci.apache.org/proje
>>>>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>>>>
>>>>>> However, as the IP of the nodes are dynamic (from the nature of
>>>>>> kubernetes), we are using only the DNS which we can control using
>>>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>>>>> flink-jobmanager DNS and also the DNS for the task managers
>>>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>>>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>>>>
>>>>>> Additionally we set the jobmanager.rpc.address property on all the
>>>>>> nodes and each task manager sets the taskmanager.host property, all
>>>>>> matching the ones on the certificate.
>>>>>>
>>>>>> This is working well when using Job with Parallelism set to 1. The
>>>>>> SSL validations are good and the Jobmanager can communicate with Task
>>>>>> manager and vice versa.
>>>>>>
>>>>>> But when we set the parallelism to more than 1 we have exceptions on
>>>>>> the SSL validation like this:
>>>>>>
>>>>>> Caused by: java.security.cert.CertificateException: No subject
>>>>>> alternative names matching IP address 172.30.247.163 found
>>>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja
>>>>>> va:168)
>>>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>>> tManagerImpl.java:455)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>>> tManagerImpl.java:436)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>>>>> ManagerImpl.java:252)
>>>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>>>>> 9TrustManagerImpl.java:136)
>>>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>>>>> ndshaker.java:1601)
>>>>>> ... 21 more
>>>>>>
>>>>>>
>>>>>> From the logs I see the Jobmanager is correctly registering the
>>>>>> taskmanagers:
>>>>>>
>>>>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered
>>>>>> hosts is 3. Current number of alive task slots is 6.
>>>>>>
>>>>>> And also each taskmanager is correctly registered to use the hostname
>>>>>> for communication:
>>>>>>
>>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager
>>>>>> will use hostname/address 'flink-taskmanager-1.flink-tas
>>>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>>>>> communication.
>>>>>> ...
>>>>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>>>>> -svc.default.svc.cluster.local:6122]
>>>>>> ...
>>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig   -
>>>>>> NettyConfig [server address: flink-taskmanager-1.flink-task
>>>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>>>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>>>> default)]
>>>>>> ...
>>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager
>>>>>> data connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>>>>> (dataPort=6121)
>>>>>>
>>>>>>
>>>>>>
>>>>>> But even with that, it seems like the taskmanagers are using the IP
>>>>>> communicate between them and the SSL validation fails.
>>>>>>
>>>>>> Do you know if it's possible to make the taskmanagers to use the
>>>>>> hostname to communicate instead of the IP ?
>>>>>> or
>>>>>> Do you have any advice to get the SSL configuration to work on this
>>>>>> environment ?
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Regards,
>>>>>> Edward
>>>>>>
>>>>>

Re: SSL config on Kubernetes - Dynamic IP

Posted by Fabian Hueske <fh...@gmail.com>.
Thank you Edward and Christophe!

2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <
edward.rojascl@gmail.com>:

> Hi all,
>
> I did some tests based on the PR Christophe mentioned above and by making
> a change on the NettyClient to use CanonicalHostName instead of
> HostNameAddress to identify the server, the SSL validation works!!
>
> I created a PR with this change: https://github.com/apache/flink/pull/5789
>
> Regards,
> Edward
>
> 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <
> edward.rojascl@gmail.com>:
>
>> Hi Till,
>>
>> I just created the JIRA ticket: https://issues.apache.org/jira
>> /browse/FLINK-9103
>>
>> I added the JobManager and TaskManager logs, Hope this helps to resolve
>> the issue.
>>
>> Regards,
>> Edward
>>
>> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <tr...@apache.org>:
>>
>>> Hi Edward,
>>>
>>> could you please file a JIRA issue for this problem. It might be as
>>> simple as that the TaskManager's network stack uses the IP instead of the
>>> hostname as you suggested. But we have to look into this to be sure. Also
>>> the logs of the JobManager as well as the TaskManagers could be helpful.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> I suspect this relates to: https://issues.apache.org/
>>>> jira/browse/FLINK-5030
>>>>
>>>> For which there was a PR at some point but nothing has been done so
>>>> far. It seems the current code explicitly uses the IP vs Hostname for Netty
>>>> SSL configuration.
>>>>
>>>> Without that I'm really wondering how people are reasonably using SSL
>>>> on a Kubernetes Flink-based cluster as every time a pod is (re-started) it
>>>> can theoretically take a different IP? Or do I miss something?
>>>>
>>>> --
>>>> Christophe
>>>>
>>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>>>> edward.rojascl@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with
>>>>> SSL configuration based on https://ci.apache.org/proje
>>>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>>>
>>>>> However, as the IP of the nodes are dynamic (from the nature of
>>>>> kubernetes), we are using only the DNS which we can control using
>>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>>>> flink-jobmanager DNS and also the DNS for the task managers
>>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>>>
>>>>> Additionally we set the jobmanager.rpc.address property on all the
>>>>> nodes and each task manager sets the taskmanager.host property, all
>>>>> matching the ones on the certificate.
>>>>>
>>>>> This is working well when using Job with Parallelism set to 1. The SSL
>>>>> validations are good and the Jobmanager can communicate with Task manager
>>>>> and vice versa.
>>>>>
>>>>> But when we set the parallelism to more than 1 we have exceptions on
>>>>> the SSL validation like this:
>>>>>
>>>>> Caused by: java.security.cert.CertificateException: No subject
>>>>> alternative names matching IP address 172.30.247.163 found
>>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja
>>>>> va:168)
>>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>> tManagerImpl.java:455)
>>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>>> tManagerImpl.java:436)
>>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>>>> ManagerImpl.java:252)
>>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>>>> 9TrustManagerImpl.java:136)
>>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>>>> ndshaker.java:1601)
>>>>> ... 21 more
>>>>>
>>>>>
>>>>> From the logs I see the Jobmanager is correctly registering the
>>>>> taskmanagers:
>>>>>
>>>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered
>>>>> hosts is 3. Current number of alive task slots is 6.
>>>>>
>>>>> And also each taskmanager is correctly registered to use the hostname
>>>>> for communication:
>>>>>
>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>>>> communication.
>>>>> ...
>>>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>>>> -svc.default.svc.cluster.local:6122]
>>>>> ...
>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>>>> [server address: flink-taskmanager-1.flink-task
>>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>>> default)]
>>>>> ...
>>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>>>> (dataPort=6121)
>>>>>
>>>>>
>>>>>
>>>>> But even with that, it seems like the taskmanagers are using the IP
>>>>> communicate between them and the SSL validation fails.
>>>>>
>>>>> Do you know if it's possible to make the taskmanagers to use the
>>>>> hostname to communicate instead of the IP ?
>>>>> or
>>>>> Do you have any advice to get the SSL configuration to work on this
>>>>> environment ?
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Regards,
>>>>> Edward
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Christophe
>>>>
>>>
>>>
>>
>>
>> --
>> *Edward Alexander Rojas Clavijo*
>>
>>
>>
>> *Software EngineerHybrid CloudIBM France*
>>
>
>

Re: SSL config on Kubernetes - Dynamic IP

Posted by Edward Alexander Rojas Clavijo <ed...@gmail.com>.
Hi all,

I did some tests based on the PR Christophe mentioned above and by making a
change on the NettyClient to use CanonicalHostName instead of
HostNameAddress to identify the server, the SSL validation works!!

I created a PR with this change: https://github.com/apache/flink/pull/5789

Regards,
Edward

2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <
edward.rojascl@gmail.com>:

> Hi Till,
>
> I just created the JIRA ticket: https://issues.apache.org/
> jira/browse/FLINK-9103
>
> I added the JobManager and TaskManager logs, Hope this helps to resolve
> the issue.
>
> Regards,
> Edward
>
> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <tr...@apache.org>:
>
>> Hi Edward,
>>
>> could you please file a JIRA issue for this problem. It might be as
>> simple as that the TaskManager's network stack uses the IP instead of the
>> hostname as you suggested. But we have to look into this to be sure. Also
>> the logs of the JobManager as well as the TaskManagers could be helpful.
>>
>> Cheers,
>> Till
>>
>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com>
>> wrote:
>>
>>>
>>> I suspect this relates to: https://issues.apache.org/
>>> jira/browse/FLINK-5030
>>>
>>> For which there was a PR at some point but nothing has been done so far.
>>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
>>> configuration.
>>>
>>> Without that I'm really wondering how people are reasonably using SSL on
>>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can
>>> theoretically take a different IP? Or do I miss something?
>>>
>>> --
>>> Christophe
>>>
>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>>> edward.rojascl@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>>>> configuration based on https://ci.apache.org/proje
>>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>>
>>>> However, as the IP of the nodes are dynamic (from the nature of
>>>> kubernetes), we are using only the DNS which we can control using
>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>>> flink-jobmanager DNS and also the DNS for the task managers
>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>>
>>>> Additionally we set the jobmanager.rpc.address property on all the
>>>> nodes and each task manager sets the taskmanager.host property, all
>>>> matching the ones on the certificate.
>>>>
>>>> This is working well when using Job with Parallelism set to 1. The SSL
>>>> validations are good and the Jobmanager can communicate with Task manager
>>>> and vice versa.
>>>>
>>>> But when we set the parallelism to more than 1 we have exceptions on
>>>> the SSL validation like this:
>>>>
>>>> Caused by: java.security.cert.CertificateException: No subject
>>>> alternative names matching IP address 172.30.247.163 found
>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>> tManagerImpl.java:455)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>> tManagerImpl.java:436)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>>> ManagerImpl.java:252)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>>> 9TrustManagerImpl.java:136)
>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>>> ndshaker.java:1601)
>>>> ... 21 more
>>>>
>>>>
>>>> From the logs I see the Jobmanager is correctly registering the
>>>> taskmanagers:
>>>>
>>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered
>>>> hosts is 3. Current number of alive task slots is 6.
>>>>
>>>> And also each taskmanager is correctly registered to use the hostname
>>>> for communication:
>>>>
>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>>> communication.
>>>> ...
>>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>>> -svc.default.svc.cluster.local:6122]
>>>> ...
>>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>>> [server address: flink-taskmanager-1.flink-task
>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>> default)]
>>>> ...
>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>>> (dataPort=6121)
>>>>
>>>>
>>>>
>>>> But even with that, it seems like the taskmanagers are using the IP
>>>> communicate between them and the SSL validation fails.
>>>>
>>>> Do you know if it's possible to make the taskmanagers to use the
>>>> hostname to communicate instead of the IP ?
>>>> or
>>>> Do you have any advice to get the SSL configuration to work on this
>>>> environment ?
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regards,
>>>> Edward
>>>>
>>>
>>>
>>>
>>> --
>>> Christophe
>>>
>>
>>
>
>
> --
> *Edward Alexander Rojas Clavijo*
>
>
>
> *Software EngineerHybrid CloudIBM France*
>

Re: SSL config on Kubernetes - Dynamic IP

Posted by Edward Alexander Rojas Clavijo <ed...@gmail.com>.
Hi Till,

I just created the JIRA ticket:
https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the
issue.

Regards,
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <tr...@apache.org>:

> Hi Edward,
>
> could you please file a JIRA issue for this problem. It might be as simple
> as that the TaskManager's network stack uses the IP instead of the hostname
> as you suggested. But we have to look into this to be sure. Also the logs
> of the JobManager as well as the TaskManagers could be helpful.
>
> Cheers,
> Till
>
> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com>
> wrote:
>
>>
>> I suspect this relates to: https://issues.apache.org/
>> jira/browse/FLINK-5030
>>
>> For which there was a PR at some point but nothing has been done so far.
>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
>> configuration.
>>
>> Without that I'm really wondering how people are reasonably using SSL on
>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can
>> theoretically take a different IP? Or do I miss something?
>>
>> --
>> Christophe
>>
>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>> edward.rojascl@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>>> configuration based on https://ci.apache.org/proje
>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>
>>> However, as the IP of the nodes are dynamic (from the nature of
>>> kubernetes), we are using only the DNS which we can control using
>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>> flink-jobmanager DNS and also the DNS for the task managers
>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>
>>> Additionally we set the jobmanager.rpc.address property on all the nodes
>>> and each task manager sets the taskmanager.host property, all matching the
>>> ones on the certificate.
>>>
>>> This is working well when using Job with Parallelism set to 1. The SSL
>>> validations are good and the Jobmanager can communicate with Task manager
>>> and vice versa.
>>>
>>> But when we set the parallelism to more than 1 we have exceptions on the
>>> SSL validation like this:
>>>
>>> Caused by: java.security.cert.CertificateException: No subject
>>> alternative names matching IP address 172.30.247.163 found
>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:455)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:436)
>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>> ManagerImpl.java:252)
>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>> 9TrustManagerImpl.java:136)
>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>> ndshaker.java:1601)
>>> ... 21 more
>>>
>>>
>>> From the logs I see the Jobmanager is correctly registering the
>>> taskmanagers:
>>>
>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts
>>> is 3. Current number of alive task slots is 6.
>>>
>>> And also each taskmanager is correctly registered to use the hostname
>>> for communication:
>>>
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>> communication.
>>> ...
>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>> -svc.default.svc.cluster.local:6122]
>>> ...
>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>> [server address: flink-taskmanager-1.flink-task
>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>> default)]
>>> ...
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>> (dataPort=6121)
>>>
>>>
>>>
>>> But even with that, it seems like the taskmanagers are using the IP
>>> communicate between them and the SSL validation fails.
>>>
>>> Do you know if it's possible to make the taskmanagers to use the
>>> hostname to communicate instead of the IP ?
>>> or
>>> Do you have any advice to get the SSL configuration to work on this
>>> environment ?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Edward
>>>
>>
>>
>>
>> --
>> Christophe
>>
>
>


-- 
*Edward Alexander Rojas Clavijo*



*Software EngineerHybrid CloudIBM France*

Re: SSL config on Kubernetes - Dynamic IP

Posted by Sampath Bhat <sa...@gmail.com>.
Hi Edward,

You can use this parameter in flink-conf.yaml to supress the hostname
checking in certificates. If it suits your purpose.
security.ssl.verify-hostname: false

Secondly even I'm running flink 1.4 on K8s, I used to get the same error
stack trace as you mentioned, while the blob client was trying to connect
to blob server. But this issue was resolved by creating certificate I have
given only the job manager service name as SAN. It's working fine.
But I have not submitted job with higher parallelism. Since you are saying
that you are facing issue when the parallelism is higher I guess that
multiple task managers are not able to communicate among themselves. Make
sure if have exposed the services of task managers correctly and surely
logs will help.

Jolif, You can use statefull set object in K8s to ensure that the same IP
will be used even if the pod restarts.

On Tue, Mar 27, 2018 at 9:18 PM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Edward,
>
> could you please file a JIRA issue for this problem. It might be as simple
> as that the TaskManager's network stack uses the IP instead of the hostname
> as you suggested. But we have to look into this to be sure. Also the logs
> of the JobManager as well as the TaskManagers could be helpful.
>
> Cheers,
> Till
>
> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com>
> wrote:
>
>>
>> I suspect this relates to: https://issues.apache.org/
>> jira/browse/FLINK-5030
>>
>> For which there was a PR at some point but nothing has been done so far.
>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
>> configuration.
>>
>> Without that I'm really wondering how people are reasonably using SSL on
>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can
>> theoretically take a different IP? Or do I miss something?
>>
>> --
>> Christophe
>>
>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>> edward.rojascl@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>>> configuration based on https://ci.apache.org/proje
>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>
>>> However, as the IP of the nodes are dynamic (from the nature of
>>> kubernetes), we are using only the DNS which we can control using
>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>> flink-jobmanager DNS and also the DNS for the task managers
>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>
>>> Additionally we set the jobmanager.rpc.address property on all the nodes
>>> and each task manager sets the taskmanager.host property, all matching the
>>> ones on the certificate.
>>>
>>> This is working well when using Job with Parallelism set to 1. The SSL
>>> validations are good and the Jobmanager can communicate with Task manager
>>> and vice versa.
>>>
>>> But when we set the parallelism to more than 1 we have exceptions on the
>>> SSL validation like this:
>>>
>>> Caused by: java.security.cert.CertificateException: No subject
>>> alternative names matching IP address 172.30.247.163 found
>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:455)
>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>> tManagerImpl.java:436)
>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>> ManagerImpl.java:252)
>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>> 9TrustManagerImpl.java:136)
>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>> ndshaker.java:1601)
>>> ... 21 more
>>>
>>>
>>> From the logs I see the Jobmanager is correctly registering the
>>> taskmanagers:
>>>
>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts
>>> is 3. Current number of alive task slots is 6.
>>>
>>> And also each taskmanager is correctly registered to use the hostname
>>> for communication:
>>>
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>> communication.
>>> ...
>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>> -svc.default.svc.cluster.local:6122]
>>> ...
>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>> [server address: flink-taskmanager-1.flink-task
>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>> default)]
>>> ...
>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>> (dataPort=6121)
>>>
>>>
>>>
>>> But even with that, it seems like the taskmanagers are using the IP
>>> communicate between them and the SSL validation fails.
>>>
>>> Do you know if it's possible to make the taskmanagers to use the
>>> hostname to communicate instead of the IP ?
>>> or
>>> Do you have any advice to get the SSL configuration to work on this
>>> environment ?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Edward
>>>
>>
>>
>>
>> --
>> Christophe
>>
>
>

Re: SSL config on Kubernetes - Dynamic IP

Posted by Till Rohrmann <tr...@apache.org>.
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple
as that the TaskManager's network stack uses the IP instead of the hostname
as you suggested. But we have to look into this to be sure. Also the logs
of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cj...@gmail.com> wrote:

>
> I suspect this relates to: https://issues.apache.org/
> jira/browse/FLINK-5030
>
> For which there was a PR at some point but nothing has been done so far.
> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
> configuration.
>
> Without that I'm really wondering how people are reasonably using SSL on a
> Kubernetes Flink-based cluster as every time a pod is (re-started) it can
> theoretically take a different IP? Or do I miss something?
>
> --
> Christophe
>
> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
> edward.rojascl@gmail.com> wrote:
>
>> Hi all,
>>
>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>> configuration based on https://ci.apache.org/proje
>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>
>> However, as the IP of the nodes are dynamic (from the nature of
>> kubernetes), we are using only the DNS which we can control using
>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>> flink-jobmanager DNS and also the DNS for the task managers
>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>> flink-taskmanager-0.flink-taskmanager-svc).
>>
>> Additionally we set the jobmanager.rpc.address property on all the nodes
>> and each task manager sets the taskmanager.host property, all matching the
>> ones on the certificate.
>>
>> This is working well when using Job with Parallelism set to 1. The SSL
>> validations are good and the Jobmanager can communicate with Task manager
>> and vice versa.
>>
>> But when we set the parallelism to more than 1 we have exceptions on the
>> SSL validation like this:
>>
>> Caused by: java.security.cert.CertificateException: No subject
>> alternative names matching IP address 172.30.247.163 found
>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>> tManagerImpl.java:455)
>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>> tManagerImpl.java:436)
>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>> ManagerImpl.java:252)
>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>> 9TrustManagerImpl.java:136)
>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>> ndshaker.java:1601)
>> ... 21 more
>>
>>
>> From the logs I see the Jobmanager is correctly registering the
>> taskmanagers:
>>
>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts
>> is 3. Current number of alive task slots is 6.
>>
>> And also each taskmanager is correctly registered to use the hostname for
>> communication:
>>
>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>> use hostname/address 'flink-taskmanager-1.flink-tas
>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>> communication.
>> ...
>> akka.remote.Remoting   - Remoting started; listening on addresses
>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>> -svc.default.svc.cluster.local:6122]
>> ...
>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>> [server address: flink-taskmanager-1.flink-task
>> manager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121,
>> ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO,
>> number of server threads: 2 (manual), number of client threads: 2 (manual),
>> server connect backlog: 0 (use Netty's default), client connect timeout
>> (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
>> ...
>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>> (dataPort=6121)
>>
>>
>>
>> But even with that, it seems like the taskmanagers are using the IP
>> communicate between them and the SSL validation fails.
>>
>> Do you know if it's possible to make the taskmanagers to use the hostname
>> to communicate instead of the IP ?
>> or
>> Do you have any advice to get the SSL configuration to work on this
>> environment ?
>>
>> Thanks in advance.
>>
>> Regards,
>> Edward
>>
>
>
>
> --
> Christophe
>

Re: SSL config on Kubernetes - Dynamic IP

Posted by Christophe Jolif <cj...@gmail.com>.
I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It
seems the current code explicitly uses the IP vs Hostname for Netty SSL
configuration.

Without that I'm really wondering how people are reasonably using SSL on a
Kubernetes Flink-based cluster as every time a pod is (re-started) it can
theoretically take a different IP? Or do I miss something?

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
edward.rojascl@gmail.com> wrote:

> Hi all,
>
> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
> configuration based on https://ci.apache.org/proje
> cts/flink/flink-docs-master/ops/security-ssl.html.
>
> However, as the IP of the nodes are dynamic (from the nature of
> kubernetes), we are using only the DNS which we can control using
> kubernetes services. So we add to the Subject Alternative Name(SAN) the
> flink-jobmanager DNS and also the DNS for the task managers
> *.flink-taskmanager-svc (each task manager has a DNS in the form
> flink-taskmanager-0.flink-taskmanager-svc).
>
> Additionally we set the jobmanager.rpc.address property on all the nodes
> and each task manager sets the taskmanager.host property, all matching the
> ones on the certificate.
>
> This is working well when using Job with Parallelism set to 1. The SSL
> validations are good and the Jobmanager can communicate with Task manager
> and vice versa.
>
> But when we set the parallelism to more than 1 we have exceptions on the
> SSL validation like this:
>
> Caused by: java.security.cert.CertificateException: No subject
> alternative names matching IP address 172.30.247.163 found
> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
> tManagerImpl.java:455)
> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
> tManagerImpl.java:436)
> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
> ManagerImpl.java:252)
> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
> 9TrustManagerImpl.java:136)
> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
> ndshaker.java:1601)
> ... 21 more
>
>
> From the logs I see the Jobmanager is correctly registering the
> taskmanagers:
>
> org.apache.flink.runtime.instance.InstanceManager   - Registered
> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts
> is 3. Current number of alive task slots is 6.
>
> And also each taskmanager is correctly registered to use the hostname for
> communication:
>
> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use
> hostname/address 'flink-taskmanager-1.flink-tas
> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
> communication.
> ...
> akka.remote.Remoting   - Remoting started; listening on addresses
> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
> -svc.default.svc.cluster.local:6122]
> ...
> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
> [server address: flink-taskmanager-1.flink-task
> manager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121,
> ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO,
> number of server threads: 2 (manual), number of client threads: 2 (manual),
> server connect backlog: 0 (use Netty's default), client connect timeout
> (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
> ...
> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
> connection information: bf4a9b50e57c99c17049adb66d65f685 @
> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
> (dataPort=6121)
>
>
>
> But even with that, it seems like the taskmanagers are using the IP
> communicate between them and the SSL validation fails.
>
> Do you know if it's possible to make the taskmanagers to use the hostname
> to communicate instead of the IP ?
> or
> Do you have any advice to get the SSL configuration to work on this
> environment ?
>
> Thanks in advance.
>
> Regards,
> Edward
>



-- 
Christophe