You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Nikola Hrusov <n....@gmail.com> on 2020/08/12 07:05:53 UTC

Hostname for taskmanagers when running in docker

Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname
of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101)
instead of hostnames.

In the docker compose file we specify the hostname as such:


*hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*

Is there another way of achieving this?

Regards
,
Nikola Hrusov

Re: Hostname for taskmanagers when running in docker

Posted by Nikola Hrusov <n....@gmail.com>.
Hello,

Thank you both for your input. I accidentally have pressed Reply instead of
Reply all, thanks for bringing back the discussion to the userlist.

As it is there are 2 ways to configure the hostname 1) using docker's
hostname property under a service 2) using flink's explicit
taskmanager.host configuration.

Prior to 1.11 the taskmanager.host variable was not needed. The cluster
does not seem to have taken it into consideration, because when I go to my
jobmanager -> taskmanagers I could see the list of taskmanagers based on
internal IPs. So the default internal IPs were being used, however the
docker's hostname attribute was used for the metrics. The metrics format
was flink.host.<host>.job.xxxx and the <host> was replaced with the value I
have put in the docker hostname attribute. My only issue is that I couldn't
find anything in the documentation regarding such a change and suddenly the
metrics I have were not using the docker's hostname.

I will try to use the metrics scope and pass the name of the hostname there
instead.
Regards
,
Nikola Hrusov


On Thu, Nov 5, 2020 at 1:31 AM Chesnay Schepler <ch...@apache.org> wrote:

> There is no convenient cosmetic way to achieve what you want.
> The only approach that would currently work is hard-coding the host into
> the configuration of each taskmanager via the metrics.scope.* configuration
> options.
>
> On 11/4/2020 8:14 PM, Robert Metzger wrote:
>
> I hope it's fine that I moved our discussion back on the list.
>
> You can not put an arbitrary hostname for the Flink configuration key
> "taskmanager.host". It must be a valid, resolvable hostname within the
> Flink cluster so that the RPC services can reach each other.
> I don't think there's a way to define a custom taskmanager name in the
> Flink metrics.
>
> I'm adding Chesnay to the conversation, since he's very familiar with the
> metrics system.
>
>
> On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <n....@gmail.com> wrote:
>
>> Hello Robert,
>>
>> Thank you for your reply.
>>
>> What you have described is more or less the issue I am experiencing.
>>
>> When using the example you provided (with taskmanager.host
>> "taskmanager01" and not custom one), if you set the *scale* to 3 you
>> will get all jobs under the same hostname, even if they are run on
>> different servers. In that way I cannot identify which server has run the
>> job or which taskmanager. I can only base it on the 32-length hash which
>> changes every time a taskmanager restarts, so it is not a good indicator.
>>
>> The current setup I have is 3 servers and each one of them gets a
>> hostname like "taskmanager-node01" (where servers are node01, node02,
>> node03). That is running in docker swarm, with flink 1.10.2
>>
>>   taskmanager:
>>>     image: flinkimage
>>>     hostname: "taskmanager-{{.Node.Hostname}}"
>>>     command: taskmanager
>>>     deploy:
>>>       mode: global
>>
>>
>> This is how the taskmanager is being deployed. The `taskmanager.host` is
>> not specified and not needed. The hostname used to be used only for the
>> metrics as far as I can tell.
>>
>>
>> My goal is to be able to put the server name (hostname) in metrics so I
>> can identify which jobs are running on them. By default the metrics on
>> flink would use the hostname provided instead of the IP address.
>> Is that change in flink's behaviour intentional/on purpose?
>>
>>
>> Regards
>> ,
>> Nikola Hrusov
>>
>>
>>
>> On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hey Nikola,
>>> sorry for the delayed response.
>>>
>>> I just tried the docker-compose files you've provided, and
>>> "docker-compose -f docker-compose.yml up" works for me -- metrics are shown
>>> in the UI, and I'm able to submit jobs via the web UI and the command line
>>> client.
>>>
>>> I got to work the "docker-compose_with_hostname.yml" after the following
>>> change:
>>>
>>>
>>> diff --git a/docker-compose_with_hostname.yml
>>> b/docker-compose_with_hostname.yml
>>> index d876cb5..c34073f 100644
>>> --- a/docker-compose_with_hostname.yml
>>> +++ b/docker-compose_with_hostname.yml
>>> @@ -26,7 +26,7 @@ services:
>>>          FLINK_PROPERTIES=
>>>          jobmanager.rpc.address: jobmanager
>>>          taskmanager.numberOfTaskSlots: 2
>>> -        taskmanager.host: "taskmanager-node01"
>>> +        taskmanager.host: "taskmanager01"
>>>          metrics.reporter.grph.factory.class:
>>> org.apache.flink.metrics.graphite.GraphiteReporterFactory
>>>          metrics.reporter.grph.host: graphite
>>>          metrics.reporter.grph.port: 2003
>>>
>>> The name of the docker-compose service is the hostname.
>>>
>>> On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <n....@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am still trying to find how to properly setup a cluster with flink
>>>> 1.11 and receive metrics on the hostnames.
>>>> In my previous email I outlined I need to choose: a) receiving proper
>>>> metrics or b) running my jobs. Ideally I should be able to do both as this
>>>> is possible with flink 1.10
>>>>
>>>> Can somebody shed some light on this matter?
>>>>
>>>> Regards
>>>> ,
>>>> Nikola Hrusov
>>>>
>>>>
>>>> On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <n....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Xintong,
>>>>>
>>>>> I have tried using the configuration *taskmanager.host*, but that
>>>>> actually makes it even worse. I have made a simple setup with docker
>>>>> compose to reproduce/explain it easier. You can find the compose files
>>>>> here: https://github.com/nikobearrr/flink-hostname-metrics
>>>>>
>>>>> I have made 2 identical compose files which can start a flink
>>>>> jobmanager and taskmanager together with graphite (for metrics). One is
>>>>> called docker-compose.yml and the second
>>>>> one docker-compose_with_hostname.yml
>>>>> The only difference between those two is line #29 which is the
>>>>> taskmanager.host variable. They both expose port 8081 for flink cluster UI
>>>>> and port 8082 for graphite UI.
>>>>>
>>>>>
>>>>> Running the setup without the taskmanager.host
>>>>> When you run the compose without the taskmanager.host variable the
>>>>> cluster starts just fine and the taskmanager registers. Running a job on
>>>>> that cluster would be just fine.
>>>>> The issue is that if you check the metrics in the Graphite UI instead
>>>>> of the hostname it will show the IP (in this case 172.20.0.3). That was not
>>>>> the case with prior 1.11 version of flink.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Running the setup with the taskmanager.host
>>>>> Once I run the compose which includes the taskmanager.host variable I
>>>>> can see the cluster UI and it starts up fine. Also the metrics come
>>>>> correctly:
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> However, now I found something wrong.
>>>>>
>>>>> The first thing is that when you go to
>>>>> http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
>>>>> where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I
>>>>> start getting those logs in my console:
>>>>>
>>>>> jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>>> known]
>>>>>> jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>>> known]
>>>>>> jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>>> known]
>>>>>
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Also the metrics for the taskmanager do not show as shown on the
>>>>> picture above. If you do not use "taskmanager.host" then metrics show and
>>>>> there are no such WARN logs for UnknownHostException.
>>>>> More importantly, we also see issues with this when we run batch jobs
>>>>> for what seems the same issue. The jobs fail on submission. This only
>>>>> happens when we explicitly set "taskmanager.host" variable
>>>>>
>>>>> from taskmanager:
>>>>>
>>>>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job
>>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
>>>>>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to
>>>>>> register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>>>>>> with leader id 00000000-0000-0000-0000-000000000000.
>>>>>> 2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved
>>>>>> JobManager address, beginning registration
>>>>>> 2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService -
>>>>>> *Registration at JobManager was declined: Could not accept
>>>>>> TaskManager registration. TaskManager address taskmanager-node01 cannot be
>>>>>> resolved. taskmanager-node01: No address associated with hostname *2020-10-27T17:11:55.657Z
>>>>>> [flink-akka.actor.default-dispatcher-4] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and
>>>>>> re-attempting registration in 30000 ms
>>>>>> 2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
>>>>>> TaskSlot(index:1, state:ALLOCATED, resource profile:
>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb
>>>>>> (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb
>>>>>> (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId:
>>>>>> a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
>>>>>> 2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job
>>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.
>>>>>
>>>>>
>>>>>
>>>>> from jobmanager:
>>>>>
>>>>> 2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
>>>>>> ResourceManager akka.tcp://flink@jobmanager
>>>>>> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
>>>>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40]
>>>>>> INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved
>>>>>> ResourceManager address, beginning registration
>>>>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36]
>>>>>> INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>>> Registering job manager
>>>>>> 00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>>>>>> for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35]
>>>>>> INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>>> Registered job manager
>>>>>> 00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>>>>>> for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36]
>>>>>> INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
>>>>>> registered at ResourceManager, leader id: 00000000000000000000000000000000.
>>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36]
>>>>>> INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting
>>>>>> new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile
>>>>>> ResourceProfile{UNKNOWN} from resource manager.
>>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40]
>>>>>> INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>>> Request slot with profile ResourceProfile{UNKNOWN} for job
>>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id
>>>>>> a4dfa7dfa91bbce7cd091aa1512c0745.
>>>>>> 2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40]
>>>>>> ERROR org.apache.flink.runtime.jobmaster.JobMaster - *Could not
>>>>>> accept TaskManager registration. TaskManager address taskmanager-node01
>>>>>> cannot be resolved. taskmanager-node01: No address associated with hostname*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Both the taskmanager and jobmanager agree on 1 thing:
>>>>> "taskmanager-node01: No address associated with hostname".
>>>>> Setting the hostname explicitly helps the metrics in graphite, but
>>>>> then the job submission/execution does not work, which is even worse than
>>>>> not having the metrics.
>>>>>
>>>>> So my question is: Is there anything more which needs to be set when
>>>>> using the taskmanager.host config? Or perhaps I am doing something wrong
>>>>> with the setup?
>>>>>
>>>>> Regards
>>>>> ,
>>>>> Nikola Hrusov
>>>>>
>>>>>
>>>>> On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <to...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikola,
>>>>>>
>>>>>> I'm not entirely sure about how this happened. Would need some more
>>>>>> information to investigate, such as the complete configurations for
>>>>>> taskmanagers in your docker compose file, and the taskmanager logs.
>>>>>>
>>>>>> One quick thing you may try is to explicitly set the configuration
>>>>>> option `taskmanager.host` for your task managers, see if that is reflected
>>>>>> in the metrics.
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> After upgrading the docker image for flink to 1.11.1 from 1.9 the
>>>>>>> hostname of the taskmanagers reported to our metrics show as IPs (e.g.
>>>>>>> 10.0.23.101) instead of hostnames.
>>>>>>>
>>>>>>> In the docker compose file we specify the hostname as such:
>>>>>>>
>>>>>>>
>>>>>>> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}" *
>>>>>>>
>>>>>>> Is there another way of achieving this?
>>>>>>>
>>>>>>> Regards
>>>>>>> ,
>>>>>>> Nikola Hrusov
>>>>>>>
>>>>>>
>

Re: Hostname for taskmanagers when running in docker

Posted by Chesnay Schepler <ch...@apache.org>.
There is no convenient cosmetic way to achieve what you want.
The only approach that would currently work is hard-coding the host into 
the configuration of each taskmanager via the metrics.scope.* 
configuration options.

On 11/4/2020 8:14 PM, Robert Metzger wrote:
> I hope it's fine that I moved our discussion back on the list.
>
> You can not put an arbitrary hostname for the Flink configuration key 
> "taskmanager.host". It must be a valid, resolvable hostname within the 
> Flink cluster so that the RPC services can reach each other.
> I don't think there's a way to define a custom taskmanager name in the 
> Flink metrics.
>
> I'm adding Chesnay to the conversation, since he's very familiar with 
> the metrics system.
>
>
> On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <n.hrusov@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hello Robert,
>
>     Thank you for your reply.
>
>     What you have described is more or less the issue I am experiencing.
>
>     When using the example you provided (with taskmanager.host
>     "taskmanager01" and not custom one), if you set the *scale* to 3
>     you will get all jobs under the same hostname, even if they are
>     run on different servers. In that way I cannot identify which
>     server has run the job or which taskmanager. I can only base it on
>     the 32-length hash which changes every time a taskmanager
>     restarts, so it is not a good indicator.
>
>     The current setup I have is 3 servers and each one of them gets a
>     hostname like "taskmanager-node01" (where servers are node01,
>     node02, node03). That is running in docker swarm, with flink 1.10.2
>
>           taskmanager:
>             image: flinkimage
>             hostname: "taskmanager-{{.Node.Hostname}}"
>             command: taskmanager
>             deploy:
>               mode: global
>
>
>     This is how the taskmanager is being deployed. The
>     `taskmanager.host` is not specified and not needed. The hostname
>     used to be used only for the metrics as far as I can tell.
>
>
>     My goal is to be able to put the server name (hostname) in metrics
>     so I can identify which jobs are running on them. By default the
>     metrics on flink would use the hostname provided instead of the IP
>     address.
>     Is that change in flink's behaviour intentional/on purpose?
>
>
>     Regards
>     ,
>     Nikola Hrusov
>
>
>
>     On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger
>     <rmetzger@apache.org <ma...@apache.org>> wrote:
>
>         Hey Nikola,
>         sorry for the delayed response.
>
>         I just tried the docker-compose files you've provided, and
>         "docker-compose -f docker-compose.yml up" works for me --
>         metrics are shown in the UI, and I'm able to submit jobs via
>         the web UI and the command line client.
>
>         I got to work the "docker-compose_with_hostname.yml" after the
>         following change:
>
>
>         diff --git a/docker-compose_with_hostname.yml
>         b/docker-compose_with_hostname.yml
>         index d876cb5..c34073f 100644
>         --- a/docker-compose_with_hostname.yml
>         +++ b/docker-compose_with_hostname.yml
>         @@ -26,7 +26,7 @@ services:
>                  FLINK_PROPERTIES=
>                  jobmanager.rpc.address: jobmanager
>                  taskmanager.numberOfTaskSlots: 2
>         -        taskmanager.host: "taskmanager-node01"
>         +        taskmanager.host: "taskmanager01"
>                  metrics.reporter.grph.factory.class:
>         org.apache.flink.metrics.graphite.GraphiteReporterFactory
>                  metrics.reporter.grph.host: graphite
>                  metrics.reporter.grph.port: 2003
>
>         The name of the docker-compose service is the hostname.
>
>         On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov
>         <n.hrusov@gmail.com <ma...@gmail.com>> wrote:
>
>             Hello,
>
>             I am still trying to find how to properly setup a cluster
>             with flink 1.11 and receive metrics on the hostnames.
>             In my previous email I outlined I need to choose: a)
>             receiving proper metrics or b) running my jobs. Ideally I
>             should be able to do both as this is possible with flink 1.10
>
>             Can somebody shed some light on this matter?
>
>             Regards
>             ,
>             Nikola Hrusov
>
>
>             On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov
>             <n.hrusov@gmail.com <ma...@gmail.com>> wrote:
>
>                 Hi Xintong,
>
>                 I have tried using the configuration
>                 *taskmanager.host*, but that actually makes it even
>                 worse. I have made a simple setup with docker compose
>                 to reproduce/explain it easier. You can find the
>                 compose files here:
>                 https://github.com/nikobearrr/flink-hostname-metrics
>
>                 I have made 2 identical compose files which can start
>                 a flink jobmanager and taskmanager together with
>                 graphite (for metrics). One is
>                 called docker-compose.yml and the second
>                 one docker-compose_with_hostname.yml
>                 The only difference between those two is line #29
>                 which is the taskmanager.host variable. They both
>                 expose port 8081 for flink cluster UI and port 8082
>                 for graphite UI.
>
>
>                 Running the setup without the taskmanager.host
>                 When you run the compose without the taskmanager.host
>                 variable the cluster starts just fine and the
>                 taskmanager registers. Running a job on that cluster
>                 would be just fine.
>                 The issue is that if you check the metrics in the
>                 Graphite UI instead of the hostname it will show the
>                 IP (in this case 172.20.0.3). That was not the case
>                 with prior 1.11 version of flink.
>
>                 image.png
>
>
>
>
>
>
>                 Running the setup with the taskmanager.host
>                 Once I run the compose which includes
>                 the taskmanager.host variable I can see the cluster UI
>                 and it starts up fine. Also the metrics come correctly:
>
>                 image.png
>
>                 However, now I found something wrong.
>
>                 The first thing is that when you go to
>                 http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
>                 where  70704dc334ac8007925409c575e42d7d is the GUID of
>                 the taskmanager I start getting those logs in my console:
>
>                     jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>                      akka.remote.ReliableDeliverySupervisor          
>                             [] - Association with remote system
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]
>                     has failed, address is now gated for [50] ms.
>                     Reason: [Association failed with
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]]
>                     Caused by: [java.net.UnknownHostException:
>                     taskmanager-node01: Name or service not known]
>                     jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>                      akka.remote.ReliableDeliverySupervisor          
>                                 [] - Association with remote system
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]
>                     has failed, address is now gated for [50] ms.
>                     Reason: [Association failed with
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]]
>                     Caused by: [java.net.UnknownHostException:
>                     taskmanager-node01: Name or service not known]
>                     jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>                      akka.remote.ReliableDeliverySupervisor          
>                                 [] - Association with remote system
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]
>                     has failed, address is now gated for [50] ms.
>                     Reason: [Association failed with
>                     [akka.tcp://flink-metrics@taskmanager-node01:42269]]
>                     Caused by: [java.net.UnknownHostException:
>                     taskmanager-node01: Name or service not known]
>
>
>                 image.png
>
>                 Also the metrics for the taskmanager do not show as
>                 shown on the picture above. If you do not use
>                 "taskmanager.host" then metrics show and there are no
>                 such WARN logs for UnknownHostException.
>                 More importantly, we also see issues with this when we
>                 run batch jobs for what seems the same issue. The jobs
>                 fail on submission. This only happens when we
>                 explicitly set "taskmanager.host" variable
>
>                 from taskmanager:
>
>                     2020-10-27T17:11:55.646Z
>                     [flink-akka.actor.default-dispatcher-2] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job
>                     leader monitoring.
>                     2020-10-27T17:11:55.646Z
>                     [flink-akka.actor.default-dispatcher-2] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - Try to register at job manager
>                     akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>                     with leader id 00000000-0000-0000-0000-000000000000.
>                     2020-10-27T17:11:55.652Z
>                     [flink-akka.actor.default-dispatcher-2] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - Resolved JobManager address, beginning registration
>                     2020-10-27T17:11:55.657Z
>                     [flink-akka.actor.default-dispatcher-4] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - *Registration at JobManager was declined: Could
>                     not accept TaskManager registration. TaskManager
>                     address taskmanager-node01 cannot be resolved.
>                     taskmanager-node01: No address associated with
>                     hostname
>                     *2020-10-27T17:11:55.657Z
>                     [flink-akka.actor.default-dispatcher-4] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - Pausing and re-attempting registration in 30000 ms
>                     2020-10-27T17:12:25.646Z
>                     [flink-akka.actor.default-dispatcher-4] INFO
>                     org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl
>                     - Free slot TaskSlot(index:1, state:ALLOCATED,
>                     resource profile:
>                     ResourceProfile{cpuCores=1.0000000000000000,
>                     taskHeapMemory=62.400mb (65431141 bytes),
>                     taskOffHeapMemory=0 bytes, managedMemory=62.720mb
>                     (65766687 bytes), networkMemory=15.680mb (16441671
>                     bytes)}, allocationId:
>                     a4dfa7dfa91bbce7cd091aa1512c0745, jobId:
>                     6b6f5ee3a1ab7556fd0db64de0f7cb1d).
>                     2020-10-27T17:12:25.647Z
>                     [flink-akka.actor.default-dispatcher-4] INFO
>                     org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService
>                     - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from
>                     job leader monitoring.
>
>
>
>                 from jobmanager:
>
>                     2020-10-27T17:11:55.670Z
>                     [flink-akka.actor.default-dispatcher-36] INFO
>                     org.apache.flink.runtime.jobmaster.JobMaster -
>                     Connecting to ResourceManager
>                     akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
>                     2020-10-27T17:11:55.671Z
>                     [flink-akka.actor.default-dispatcher-40] INFO
>                     org.apache.flink.runtime.jobmaster.JobMaster -
>                     Resolved ResourceManager address, beginning
>                     registration
>                     2020-10-27T17:11:55.671Z
>                     [flink-akka.actor.default-dispatcher-36] INFO
>                     org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>                     - Registering job manager
>                     00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>                     for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>                     2020-10-27T17:11:55.672Z
>                     [flink-akka.actor.default-dispatcher-35] INFO
>                     org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>                     - Registered job manager
>                     00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>                     for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>                     2020-10-27T17:11:55.672Z
>                     [flink-akka.actor.default-dispatcher-36] INFO
>                     org.apache.flink.runtime.jobmaster.JobMaster -
>                     JobManager successfully registered at
>                     ResourceManager, leader id:
>                     00000000000000000000000000000000.
>                     2020-10-27T17:11:55.672Z
>                     [flink-akka.actor.default-dispatcher-36] INFO
>                     org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl
>                     - Requesting new slot
>                     [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}]
>                     and profile ResourceProfile{UNKNOWN} from resource
>                     manager.
>                     2020-10-27T17:11:55.672Z
>                     [flink-akka.actor.default-dispatcher-40] INFO
>                     org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>                     - Request slot with profile
>                     ResourceProfile{UNKNOWN} for job
>                     6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation
>                     id a4dfa7dfa91bbce7cd091aa1512c0745.
>                     2020-10-27T17:11:55.686Z
>                     [flink-akka.actor.default-dispatcher-40] ERROR
>                     org.apache.flink.runtime.jobmaster.JobMaster -
>                     *Could not accept TaskManager registration.
>                     TaskManager address taskmanager-node01 cannot be
>                     resolved. taskmanager-node01: No address
>                     associated with hostname*
>
>
>
>
>                 Both the taskmanager and jobmanager agree on 1 thing:
>                 "taskmanager-node01: No address associated with
>                 hostname".
>                 Setting the hostname explicitly helps the metrics in
>                 graphite, but then the job submission/execution does
>                 not work, which is even worse than not having the metrics.
>
>                 So my question is: Is there anything more which needs
>                 to be set when using the taskmanager.host config? Or
>                 perhaps I am doing something wrong with the setup?
>
>                 Regards
>                 ,
>                 Nikola Hrusov
>
>
>                 On Fri, Aug 14, 2020 at 6:26 AM Xintong Song
>                 <tonysong820@gmail.com <ma...@gmail.com>>
>                 wrote:
>
>                     Hi Nikola,
>
>                     I'm not entirely sure about how this happened.
>                     Would need some more information to investigate,
>                     such as the complete configurations for
>                     taskmanagers in your docker compose file, and the
>                     taskmanager logs.
>
>                     One quick thing you may try is to explicitly set
>                     the configuration option `taskmanager.host` for
>                     your task managers, see if that is reflected in
>                     the metrics.
>
>                     Thank you~
>
>                     Xintong Song
>
>
>
>                     On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov
>                     <n.hrusov@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         Hello,
>
>                         After upgrading the docker image for flink to
>                         1.11.1 from 1.9 the hostname of the
>                         taskmanagers reported to our metrics show as
>                         IPs (e.g. 10.0.23.101) instead of hostnames.
>
>                         In the docker compose file we specify the
>                         hostname as such:
>
>                         */hostname: "taskmanager-{{ '{{'
>                         }}.Node.Hostname{{ '}}' }}"
>                         /*
>
>                         Is there another way of achieving this?
>
>                         Regards
>                         ,
>                         Nikola Hrusov
>


Re: Hostname for taskmanagers when running in docker

Posted by Robert Metzger <rm...@apache.org>.
I hope it's fine that I moved our discussion back on the list.

You can not put an arbitrary hostname for the Flink configuration key
"taskmanager.host". It must be a valid, resolvable hostname within the
Flink cluster so that the RPC services can reach each other.
I don't think there's a way to define a custom taskmanager name in the
Flink metrics.

I'm adding Chesnay to the conversation, since he's very familiar with the
metrics system.


On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <n....@gmail.com> wrote:

> Hello Robert,
>
> Thank you for your reply.
>
> What you have described is more or less the issue I am experiencing.
>
> When using the example you provided (with taskmanager.host "taskmanager01"
> and not custom one), if you set the *scale* to 3 you will get all jobs
> under the same hostname, even if they are run on different servers. In that
> way I cannot identify which server has run the job or which taskmanager. I
> can only base it on the 32-length hash which changes every time a
> taskmanager restarts, so it is not a good indicator.
>
> The current setup I have is 3 servers and each one of them gets a hostname
> like "taskmanager-node01" (where servers are node01, node02, node03). That
> is running in docker swarm, with flink 1.10.2
>
>   taskmanager:
>>     image: flinkimage
>>     hostname: "taskmanager-{{.Node.Hostname}}"
>>     command: taskmanager
>>     deploy:
>>       mode: global
>
>
> This is how the taskmanager is being deployed. The `taskmanager.host` is
> not specified and not needed. The hostname used to be used only for the
> metrics as far as I can tell.
>
>
> My goal is to be able to put the server name (hostname) in metrics so I
> can identify which jobs are running on them. By default the metrics on
> flink would use the hostname provided instead of the IP address.
> Is that change in flink's behaviour intentional/on purpose?
>
>
> Regards
> ,
> Nikola Hrusov
>
>
>
> On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hey Nikola,
>> sorry for the delayed response.
>>
>> I just tried the docker-compose files you've provided, and
>> "docker-compose -f docker-compose.yml up" works for me -- metrics are shown
>> in the UI, and I'm able to submit jobs via the web UI and the command line
>> client.
>>
>> I got to work the "docker-compose_with_hostname.yml" after the following
>> change:
>>
>>
>> diff --git a/docker-compose_with_hostname.yml
>> b/docker-compose_with_hostname.yml
>> index d876cb5..c34073f 100644
>> --- a/docker-compose_with_hostname.yml
>> +++ b/docker-compose_with_hostname.yml
>> @@ -26,7 +26,7 @@ services:
>>          FLINK_PROPERTIES=
>>          jobmanager.rpc.address: jobmanager
>>          taskmanager.numberOfTaskSlots: 2
>> -        taskmanager.host: "taskmanager-node01"
>> +        taskmanager.host: "taskmanager01"
>>          metrics.reporter.grph.factory.class:
>> org.apache.flink.metrics.graphite.GraphiteReporterFactory
>>          metrics.reporter.grph.host: graphite
>>          metrics.reporter.grph.port: 2003
>>
>> The name of the docker-compose service is the hostname.
>>
>> On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <n....@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am still trying to find how to properly setup a cluster with flink
>>> 1.11 and receive metrics on the hostnames.
>>> In my previous email I outlined I need to choose: a) receiving proper
>>> metrics or b) running my jobs. Ideally I should be able to do both as this
>>> is possible with flink 1.10
>>>
>>> Can somebody shed some light on this matter?
>>>
>>> Regards
>>> ,
>>> Nikola Hrusov
>>>
>>>
>>> On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <n....@gmail.com>
>>> wrote:
>>>
>>>> Hi Xintong,
>>>>
>>>> I have tried using the configuration *taskmanager.host*, but that
>>>> actually makes it even worse. I have made a simple setup with docker
>>>> compose to reproduce/explain it easier. You can find the compose files
>>>> here: https://github.com/nikobearrr/flink-hostname-metrics
>>>>
>>>> I have made 2 identical compose files which can start a flink
>>>> jobmanager and taskmanager together with graphite (for metrics). One is
>>>> called docker-compose.yml and the second
>>>> one docker-compose_with_hostname.yml
>>>> The only difference between those two is line #29 which is the
>>>> taskmanager.host variable. They both expose port 8081 for flink cluster UI
>>>> and port 8082 for graphite UI.
>>>>
>>>>
>>>> Running the setup without the taskmanager.host
>>>> When you run the compose without the taskmanager.host variable the
>>>> cluster starts just fine and the taskmanager registers. Running a job on
>>>> that cluster would be just fine.
>>>> The issue is that if you check the metrics in the Graphite UI instead
>>>> of the hostname it will show the IP (in this case 172.20.0.3). That was not
>>>> the case with prior 1.11 version of flink.
>>>>
>>>> [image: image.png]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Running the setup with the taskmanager.host
>>>> Once I run the compose which includes the taskmanager.host variable I
>>>> can see the cluster UI and it starts up fine. Also the metrics come
>>>> correctly:
>>>>
>>>> [image: image.png]
>>>>
>>>> However, now I found something wrong.
>>>>
>>>> The first thing is that when you go to
>>>> http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
>>>> where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I
>>>> start getting those logs in my console:
>>>>
>>>> jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>> known]
>>>>> jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>> known]
>>>>> jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>>>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>>>> known]
>>>>
>>>>
>>>> [image: image.png]
>>>>
>>>> Also the metrics for the taskmanager do not show as shown on the
>>>> picture above. If you do not use "taskmanager.host" then metrics show and
>>>> there are no such WARN logs for UnknownHostException.
>>>> More importantly, we also see issues with this when we run batch jobs
>>>> for what seems the same issue. The jobs fail on submission. This only
>>>> happens when we explicitly set "taskmanager.host" variable
>>>>
>>>> from taskmanager:
>>>>
>>>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job
>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
>>>>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to
>>>>> register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>>>>> with leader id 00000000-0000-0000-0000-000000000000.
>>>>> 2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved
>>>>> JobManager address, beginning registration
>>>>> 2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService -
>>>>> *Registration at JobManager was declined: Could not accept TaskManager
>>>>> registration. TaskManager address taskmanager-node01 cannot be resolved.
>>>>> taskmanager-node01: No address associated with hostname*2020-10-27T17:11:55.657Z
>>>>> [flink-akka.actor.default-dispatcher-4] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and
>>>>> re-attempting registration in 30000 ms
>>>>> 2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
>>>>> TaskSlot(index:1, state:ALLOCATED, resource profile:
>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb
>>>>> (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb
>>>>> (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId:
>>>>> a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
>>>>> 2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO
>>>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job
>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.
>>>>
>>>>
>>>>
>>>> from jobmanager:
>>>>
>>>> 2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
>>>>> ResourceManager akka.tcp://flink@jobmanager
>>>>> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
>>>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager
>>>>> address, beginning registration
>>>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>> Registering job manager 00000000000000000000000000000000
>>>>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>> Registered job manager 00000000000000000000000000000000
>>>>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
>>>>> registered at ResourceManager, leader id: 00000000000000000000000000000000.
>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new
>>>>> slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile
>>>>> ResourceProfile{UNKNOWN} from resource manager.
>>>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>>> Request slot with profile ResourceProfile{UNKNOWN} for job
>>>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id
>>>>> a4dfa7dfa91bbce7cd091aa1512c0745.
>>>>> 2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40]
>>>>> ERROR org.apache.flink.runtime.jobmaster.JobMaster - *Could not
>>>>> accept TaskManager registration. TaskManager address taskmanager-node01
>>>>> cannot be resolved. taskmanager-node01: No address associated with hostname*
>>>>
>>>>
>>>>
>>>>
>>>> Both the taskmanager and jobmanager agree on 1 thing:
>>>> "taskmanager-node01: No address associated with hostname".
>>>> Setting the hostname explicitly helps the metrics in graphite, but then
>>>> the job submission/execution does not work, which is even worse than not
>>>> having the metrics.
>>>>
>>>> So my question is: Is there anything more which needs to be set when
>>>> using the taskmanager.host config? Or perhaps I am doing something wrong
>>>> with the setup?
>>>>
>>>> Regards
>>>> ,
>>>> Nikola Hrusov
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <to...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nikola,
>>>>>
>>>>> I'm not entirely sure about how this happened. Would need some more
>>>>> information to investigate, such as the complete configurations for
>>>>> taskmanagers in your docker compose file, and the taskmanager logs.
>>>>>
>>>>> One quick thing you may try is to explicitly set the configuration
>>>>> option `taskmanager.host` for your task managers, see if that is reflected
>>>>> in the metrics.
>>>>>
>>>>> Thank you~
>>>>>
>>>>> Xintong Song
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> After upgrading the docker image for flink to 1.11.1 from 1.9 the
>>>>>> hostname of the taskmanagers reported to our metrics show as IPs (e.g.
>>>>>> 10.0.23.101) instead of hostnames.
>>>>>>
>>>>>> In the docker compose file we specify the hostname as such:
>>>>>>
>>>>>>
>>>>>> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*
>>>>>>
>>>>>> Is there another way of achieving this?
>>>>>>
>>>>>> Regards
>>>>>> ,
>>>>>> Nikola Hrusov
>>>>>>
>>>>>

Re: Hostname for taskmanagers when running in docker

Posted by Robert Metzger <rm...@apache.org>.
Hey Nikola,
sorry for the delayed response.

I just tried the docker-compose files you've provided, and "docker-compose
-f docker-compose.yml up" works for me -- metrics are shown in the UI, and
I'm able to submit jobs via the web UI and the command line client.

I got to work the "docker-compose_with_hostname.yml" after the following
change:


diff --git a/docker-compose_with_hostname.yml
b/docker-compose_with_hostname.yml
index d876cb5..c34073f 100644
--- a/docker-compose_with_hostname.yml
+++ b/docker-compose_with_hostname.yml
@@ -26,7 +26,7 @@ services:
         FLINK_PROPERTIES=
         jobmanager.rpc.address: jobmanager
         taskmanager.numberOfTaskSlots: 2
-        taskmanager.host: "taskmanager-node01"
+        taskmanager.host: "taskmanager01"
         metrics.reporter.grph.factory.class:
org.apache.flink.metrics.graphite.GraphiteReporterFactory
         metrics.reporter.grph.host: graphite
         metrics.reporter.grph.port: 2003

The name of the docker-compose service is the hostname.

On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <n....@gmail.com> wrote:

> Hello,
>
> I am still trying to find how to properly setup a cluster with flink 1.11
> and receive metrics on the hostnames.
> In my previous email I outlined I need to choose: a) receiving proper
> metrics or b) running my jobs. Ideally I should be able to do both as this
> is possible with flink 1.10
>
> Can somebody shed some light on this matter?
>
> Regards
> ,
> Nikola Hrusov
>
>
> On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <n....@gmail.com> wrote:
>
>> Hi Xintong,
>>
>> I have tried using the configuration *taskmanager.host*, but that
>> actually makes it even worse. I have made a simple setup with docker
>> compose to reproduce/explain it easier. You can find the compose files
>> here: https://github.com/nikobearrr/flink-hostname-metrics
>>
>> I have made 2 identical compose files which can start a flink jobmanager
>> and taskmanager together with graphite (for metrics). One is
>> called docker-compose.yml and the second
>> one docker-compose_with_hostname.yml
>> The only difference between those two is line #29 which is the
>> taskmanager.host variable. They both expose port 8081 for flink cluster UI
>> and port 8082 for graphite UI.
>>
>>
>> Running the setup without the taskmanager.host
>> When you run the compose without the taskmanager.host variable the
>> cluster starts just fine and the taskmanager registers. Running a job on
>> that cluster would be just fine.
>> The issue is that if you check the metrics in the Graphite UI instead of
>> the hostname it will show the IP (in this case 172.20.0.3). That was not
>> the case with prior 1.11 version of flink.
>>
>> [image: image.png]
>>
>>
>>
>>
>>
>>
>> Running the setup with the taskmanager.host
>> Once I run the compose which includes the taskmanager.host variable I can
>> see the cluster UI and it starts up fine. Also the metrics come correctly:
>>
>> [image: image.png]
>>
>> However, now I found something wrong.
>>
>> The first thing is that when you go to
>> http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
>> where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I
>> start getting those logs in my console:
>>
>> jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>> known]
>>> jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>> known]
>>> jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>>>  akka.remote.ReliableDeliverySupervisor                       [] -
>>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>>> known]
>>
>>
>> [image: image.png]
>>
>> Also the metrics for the taskmanager do not show as shown on the picture
>> above. If you do not use "taskmanager.host" then metrics show and there are
>> no such WARN logs for UnknownHostException.
>> More importantly, we also see issues with this when we run batch jobs for
>> what seems the same issue. The jobs fail on submission. This only happens
>> when we explicitly set "taskmanager.host" variable
>>
>> from taskmanager:
>>
>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job
>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
>>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to
>>> register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>>> with leader id 00000000-0000-0000-0000-000000000000.
>>> 2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved
>>> JobManager address, beginning registration
>>> 2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService -
>>> *Registration at JobManager was declined: Could not accept TaskManager
>>> registration. TaskManager address taskmanager-node01 cannot be resolved.
>>> taskmanager-node01: No address associated with hostname*2020-10-27T17:11:55.657Z
>>> [flink-akka.actor.default-dispatcher-4] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and
>>> re-attempting registration in 30000 ms
>>> 2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
>>> TaskSlot(index:1, state:ALLOCATED, resource profile:
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb
>>> (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb
>>> (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId:
>>> a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
>>> 2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job
>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.
>>
>>
>>
>> from jobmanager:
>>
>> 2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
>>> ResourceManager akka.tcp://flink@jobmanager
>>> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager
>>> address, beginning registration
>>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>> Registering job manager 00000000000000000000000000000000
>>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>> Registered job manager 00000000000000000000000000000000
>>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
>>> registered at ResourceManager, leader id: 00000000000000000000000000000000.
>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new
>>> slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile
>>> ResourceProfile{UNKNOWN} from resource manager.
>>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>> Request slot with profile ResourceProfile{UNKNOWN} for job
>>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id
>>> a4dfa7dfa91bbce7cd091aa1512c0745.
>>> 2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR
>>> org.apache.flink.runtime.jobmaster.JobMaster - *Could not accept
>>> TaskManager registration. TaskManager address taskmanager-node01 cannot be
>>> resolved. taskmanager-node01: No address associated with hostname*
>>
>>
>>
>>
>> Both the taskmanager and jobmanager agree on 1 thing:
>> "taskmanager-node01: No address associated with hostname".
>> Setting the hostname explicitly helps the metrics in graphite, but then
>> the job submission/execution does not work, which is even worse than not
>> having the metrics.
>>
>> So my question is: Is there anything more which needs to be set when
>> using the taskmanager.host config? Or perhaps I am doing something wrong
>> with the setup?
>>
>> Regards
>> ,
>> Nikola Hrusov
>>
>>
>> On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <to...@gmail.com>
>> wrote:
>>
>>> Hi Nikola,
>>>
>>> I'm not entirely sure about how this happened. Would need some more
>>> information to investigate, such as the complete configurations for
>>> taskmanagers in your docker compose file, and the taskmanager logs.
>>>
>>> One quick thing you may try is to explicitly set the configuration
>>> option `taskmanager.host` for your task managers, see if that is reflected
>>> in the metrics.
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> After upgrading the docker image for flink to 1.11.1 from 1.9 the
>>>> hostname of the taskmanagers reported to our metrics show as IPs (e.g.
>>>> 10.0.23.101) instead of hostnames.
>>>>
>>>> In the docker compose file we specify the hostname as such:
>>>>
>>>>
>>>> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*
>>>>
>>>> Is there another way of achieving this?
>>>>
>>>> Regards
>>>> ,
>>>> Nikola Hrusov
>>>>
>>>

Re: Hostname for taskmanagers when running in docker

Posted by Nikola Hrusov <n....@gmail.com>.
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11
and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper
metrics or b) running my jobs. Ideally I should be able to do both as this
is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <n....@gmail.com> wrote:

> Hi Xintong,
>
> I have tried using the configuration *taskmanager.host*, but that
> actually makes it even worse. I have made a simple setup with docker
> compose to reproduce/explain it easier. You can find the compose files
> here: https://github.com/nikobearrr/flink-hostname-metrics
>
> I have made 2 identical compose files which can start a flink jobmanager
> and taskmanager together with graphite (for metrics). One is
> called docker-compose.yml and the second
> one docker-compose_with_hostname.yml
> The only difference between those two is line #29 which is the
> taskmanager.host variable. They both expose port 8081 for flink cluster UI
> and port 8082 for graphite UI.
>
>
> Running the setup without the taskmanager.host
> When you run the compose without the taskmanager.host variable the cluster
> starts just fine and the taskmanager registers. Running a job on that
> cluster would be just fine.
> The issue is that if you check the metrics in the Graphite UI instead of
> the hostname it will show the IP (in this case 172.20.0.3). That was not
> the case with prior 1.11 version of flink.
>
> [image: image.png]
>
>
>
>
>
>
> Running the setup with the taskmanager.host
> Once I run the compose which includes the taskmanager.host variable I can
> see the cluster UI and it starts up fine. Also the metrics come correctly:
>
> [image: image.png]
>
> However, now I found something wrong.
>
> The first thing is that when you go to
> http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
> where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I
> start getting those logs in my console:
>
> jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>>  akka.remote.ReliableDeliverySupervisor                       [] -
>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>> known]
>> jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>>  akka.remote.ReliableDeliverySupervisor                       [] -
>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>> known]
>> jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>>  akka.remote.ReliableDeliverySupervisor                       [] -
>> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
>> [java.net.UnknownHostException: taskmanager-node01: Name or service not
>> known]
>
>
> [image: image.png]
>
> Also the metrics for the taskmanager do not show as shown on the picture
> above. If you do not use "taskmanager.host" then metrics show and there are
> no such WARN logs for UnknownHostException.
> More importantly, we also see issues with this when we run batch jobs for
> what seems the same issue. The jobs fail on submission. This only happens
> when we explicitly set "taskmanager.host" variable
>
> from taskmanager:
>
> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job
>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
>> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to
>> register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
>> with leader id 00000000-0000-0000-0000-000000000000.
>> 2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved
>> JobManager address, beginning registration
>> 2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService -
>> *Registration at JobManager was declined: Could not accept TaskManager
>> registration. TaskManager address taskmanager-node01 cannot be resolved.
>> taskmanager-node01: No address associated with hostname*2020-10-27T17:11:55.657Z
>> [flink-akka.actor.default-dispatcher-4] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and
>> re-attempting registration in 30000 ms
>> 2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO
>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
>> TaskSlot(index:1, state:ALLOCATED, resource profile:
>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb
>> (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb
>> (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId:
>> a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
>> 2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO
>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job
>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.
>
>
>
> from jobmanager:
>
> 2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO
>> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
>> ResourceManager akka.tcp://flink@jobmanager
>> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO
>> org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager
>> address, beginning registration
>> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>> Registering job manager 00000000000000000000000000000000
>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>> Registered job manager 00000000000000000000000000000000
>> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>> org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
>> registered at ResourceManager, leader id: 00000000000000000000000000000000.
>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new
>> slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile
>> ResourceProfile{UNKNOWN} from resource manager.
>> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>> Request slot with profile ResourceProfile{UNKNOWN} for job
>> 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id
>> a4dfa7dfa91bbce7cd091aa1512c0745.
>> 2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR
>> org.apache.flink.runtime.jobmaster.JobMaster - *Could not accept
>> TaskManager registration. TaskManager address taskmanager-node01 cannot be
>> resolved. taskmanager-node01: No address associated with hostname*
>
>
>
>
> Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01:
> No address associated with hostname".
> Setting the hostname explicitly helps the metrics in graphite, but then
> the job submission/execution does not work, which is even worse than not
> having the metrics.
>
> So my question is: Is there anything more which needs to be set when using
> the taskmanager.host config? Or perhaps I am doing something wrong with the
> setup?
>
> Regards
> ,
> Nikola Hrusov
>
>
> On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <to...@gmail.com>
> wrote:
>
>> Hi Nikola,
>>
>> I'm not entirely sure about how this happened. Would need some more
>> information to investigate, such as the complete configurations for
>> taskmanagers in your docker compose file, and the taskmanager logs.
>>
>> One quick thing you may try is to explicitly set the configuration option
>> `taskmanager.host` for your task managers, see if that is reflected in the
>> metrics.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> After upgrading the docker image for flink to 1.11.1 from 1.9 the
>>> hostname of the taskmanagers reported to our metrics show as IPs (e.g.
>>> 10.0.23.101) instead of hostnames.
>>>
>>> In the docker compose file we specify the hostname as such:
>>>
>>>
>>> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*
>>>
>>> Is there another way of achieving this?
>>>
>>> Regards
>>> ,
>>> Nikola Hrusov
>>>
>>

Re: Hostname for taskmanagers when running in docker

Posted by Nikola Hrusov <n....@gmail.com>.
Hi Xintong,

I have tried using the configuration *taskmanager.host*, but that actually
makes it even worse. I have made a simple setup with docker compose to
reproduce/explain it easier. You can find the compose files here:
https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager
and taskmanager together with graphite (for metrics). One is
called docker-compose.yml and the second
one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the
taskmanager.host variable. They both expose port 8081 for flink cluster UI
and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster
starts just fine and the taskmanager registers. Running a job on that
cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of
the hostname it will show the IP (in this case 172.20.0.3). That was not
the case with prior 1.11 version of flink.

[image: image.png]






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can
see the cluster UI and it starts up fine. Also the metrics come correctly:

[image: image.png]

However, now I found something wrong.

The first thing is that when you go to
http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics
where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I
start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN
>  akka.remote.ReliableDeliverySupervisor                       [] -
> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
> [java.net.UnknownHostException: taskmanager-node01: Name or service not
> known]
> jobmanager_1     | 2020-10-27 15:59:55,236 WARN
>  akka.remote.ReliableDeliverySupervisor                       [] -
> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
> [java.net.UnknownHostException: taskmanager-node01: Name or service not
> known]
> jobmanager_1     | 2020-10-27 16:00:07,219 WARN
>  akka.remote.ReliableDeliverySupervisor                       [] -
> Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by:
> [java.net.UnknownHostException: taskmanager-node01: Name or service not
> known]


[image: image.png]

Also the metrics for the taskmanager do not show as shown on the picture
above. If you do not use "taskmanager.host" then metrics show and there are
no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for
what seems the same issue. The jobs fail on submission. This only happens
when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job
> 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
> 2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to
> register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3
> with leader id 00000000-0000-0000-0000-000000000000.
> 2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved
> JobManager address, beginning registration
> 2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService -
> *Registration at JobManager was declined: Could not accept TaskManager
> registration. TaskManager address taskmanager-node01 cannot be resolved.
> taskmanager-node01: No address associated with hostname*2020-10-27T17:11:55.657Z
> [flink-akka.actor.default-dispatcher-4] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and
> re-attempting registration in 30000 ms
> 2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
> TaskSlot(index:1, state:ALLOCATED, resource profile:
> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb
> (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb
> (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId:
> a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
> 2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO
> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job
> 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.



from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO
> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
> ResourceManager akka.tcp://flink@jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager
> address, beginning registration
> 2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 for job
> 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
> org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
> registered at ResourceManager, leader id: 00000000000000000000000000000000.
> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new
> slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id
> a4dfa7dfa91bbce7cd091aa1512c0745.
> 2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR
> org.apache.flink.runtime.jobmaster.JobMaster - *Could not accept
> TaskManager registration. TaskManager address taskmanager-node01 cannot be
> resolved. taskmanager-node01: No address associated with hostname*




Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01:
No address associated with hostname".
Setting the hostname explicitly helps the metrics in graphite, but then the
job submission/execution does not work, which is even worse than not having
the metrics.

So my question is: Is there anything more which needs to be set when using
the taskmanager.host config? Or perhaps I am doing something wrong with the
setup?

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <to...@gmail.com> wrote:

> Hi Nikola,
>
> I'm not entirely sure about how this happened. Would need some more
> information to investigate, such as the complete configurations for
> taskmanagers in your docker compose file, and the taskmanager logs.
>
> One quick thing you may try is to explicitly set the configuration option
> `taskmanager.host` for your task managers, see if that is reflected in the
> metrics.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com> wrote:
>
>> Hello,
>>
>> After upgrading the docker image for flink to 1.11.1 from 1.9 the
>> hostname of the taskmanagers reported to our metrics show as IPs (e.g.
>> 10.0.23.101) instead of hostnames.
>>
>> In the docker compose file we specify the hostname as such:
>>
>>
>> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*
>>
>> Is there another way of achieving this?
>>
>> Regards
>> ,
>> Nikola Hrusov
>>
>

Re: Hostname for taskmanagers when running in docker

Posted by Xintong Song <to...@gmail.com>.
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more
information to investigate, such as the complete configurations for
taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option
`taskmanager.host` for your task managers, see if that is reflected in the
metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <n....@gmail.com> wrote:

> Hello,
>
> After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname
> of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101)
> instead of hostnames.
>
> In the docker compose file we specify the hostname as such:
>
>
> *hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"*
>
> Is there another way of achieving this?
>
> Regards
> ,
> Nikola Hrusov
>