You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Kumar Bolar, Harshith" <hk...@arity.com> on 2020/01/16 11:06:45 UTC

Taskmanager fails to connect to Jobmanager [Could not find any IPv4 address that is not loopback or link-local. Using localhost address.]

Hi all,

We were previously using RHEL for our Flink machines. I'm currently working on moving them over to Ubuntu. When I start the task manager, it fails to connect to the job manager with the following message -

2020-01-16 10:54:42,777 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to select the network interface and address to use by connecting to the leading JobManager.
2020-01-16 10:54:42,778 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2020-01-16 10:54:52,780 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not find any IPv4 address that is not loopback or link-local. Using localhost address.

The network interface on the machine looks like this -


ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.16.75.30  netmask 255.255.255.128  broadcast 10.16.75.127
        ether 02:f1:8b:34:75:51  txqueuelen 1000  (Ethernet)
        RX packets 69370  bytes 80369110 (80.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 28787  bytes 2898540 (2.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 9562  bytes 1596138 (1.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9562  bytes 1596138 (1.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Note: On RHEL, the primary network interface was eth0. Could this be the issue?

Here's the full task manager log - https://paste.ubuntu.com/p/vgh96FHzRq/

Thanks
Harshith

Re: Re: Taskmanager fails to connect to Jobmanager [Could not find any IPv4 address that is not loopback or link-local. Using localhost address.]

Posted by "Kumar Bolar, Harshith" <hk...@arity.com>.
Thank you, Yangze and Yang. 

Turns out the high-availability.cluster-id parameter on the TM and JM were different. After updating it, the issue went away.

On 17/01/20, 3:14 PM, "Yangze Guo" <ka...@gmail.com> wrote:

    Hi, Harshith
    
    As a supplementary note to Yang, the issue seems to be that something
    went wrong when trying to connect to the ResourceManager.
    There may be two possibilities, the leader of ResourceManager does not
    write the znode or the TaskExecutor fails to connect to it. If you
    turn on the DEBUG log, it will help a lot. Also, you could watch the
    content znode "/leader/resource_manager_lock" of ZooKeeper.
    
    Best,
    Yangze Guo
    
    On Fri, Jan 17, 2020 at 5:11 PM Yang Wang <da...@gmail.com> wrote:
    >
    > Hi Kumar Bolar, Harshith,
    >
    > Could you please check the jobmanager log to find out what address the akka is listening?
    > Also the address could be used to connected to the jobmanager on the taskmanger machine.
    >
    > BTW, if you could share the debug level logs of jobmanger and taskmanger. It will help a lot to find
    > the root cause.
    >
    >
    > Best,
    > Yang
    >
    > Kumar Bolar, Harshith <hk...@arity.com> 于2020年1月16日周四 下午7:10写道:
    >>
    >> Hi all,
    >>
    >>
    >>
    >> We were previously using RHEL for our Flink machines. I'm currently working on moving them over to Ubuntu. When I start the task manager, it fails to connect to the job manager with the following message -
    >>
    >>
    >>
    >> 2020-01-16 10:54:42,777 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to select the network interface and address to use by connecting to the leading JobManager.
    >>
    >> 2020-01-16 10:54:42,778 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
    >>
    >> 2020-01-16 10:54:52,780 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not find any IPv4 address that is not loopback or link-local. Using localhost address.
    >>
    >>
    >>
    >> The network interface on the machine looks like this -
    >>
    >>
    >>
    >>
    >>
    >> ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
    >>
    >>         inet https://urldefense.proofpoint.com/v2/url?u=http-3A__10.16.75.30&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=53OA1njQf1iaaB9btpV8bgi7qYya9rwWK9DUUr9A580&e=   netmask https://urldefense.proofpoint.com/v2/url?u=http-3A__255.255.255.128&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=VAkjHsvEbPDirc3_I_ZhOOrimAfIYYsGUUEVaC1rhH8&e=   broadcast https://urldefense.proofpoint.com/v2/url?u=http-3A__10.16.75.127&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=hXAq7oqzjzyE4mGfBDaQQhn-LhoxC7tjLRTzYvNufxE&e= 
    >>
    >>         ether 02:f1:8b:34:75:51  txqueuelen 1000  (Ethernet)
    >>
    >>         RX packets 69370  bytes 80369110 (80.3 MB)
    >>
    >>         RX errors 0  dropped 0  overruns 0  frame 0
    >>
    >>         TX packets 28787  bytes 2898540 (2.8 MB)
    >>
    >>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    >>
    >>
    >>
    >> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
    >>
    >>         inet https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=qzQ1Wkjhm6A1Y5WVp7oWupNN6xESCglqAofLiJnyXXg&e=   netmask https://urldefense.proofpoint.com/v2/url?u=http-3A__255.0.0.0&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=bdn272PCxu0YYSd6dH2BPxITorcdno5flP1nOB379ns&e= 
    >>
    >>         loop  txqueuelen 1000  (Local Loopback)
    >>
    >>         RX packets 9562  bytes 1596138 (1.5 MB)
    >>
    >>         RX errors 0  dropped 0  overruns 0  frame 0
    >>
    >>         TX packets 9562  bytes 1596138 (1.5 MB)
    >>
    >>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    >>
    >>
    >>
    >>
    >>
    >> Note: On RHEL, the primary network interface was eth0. Could this be the issue?
    >>
    >>
    >>
    >> Here's the full task manager log - https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_p_vgh96FHzRq_&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=-xUFfpbE8LMFd_Z2bLTC60iRhyX6kRY17t4_2KSy_xs&s=xslfIsmeNZvi6en0rEcsuZ-0ODtAxVYaNgAJViglZDY&e= 
    >>
    >>
    >>
    >> Thanks
    >>
    >> Harshith
    


Re: Taskmanager fails to connect to Jobmanager [Could not find any IPv4 address that is not loopback or link-local. Using localhost address.]

Posted by Yangze Guo <ka...@gmail.com>.
Hi, Harshith

As a supplementary note to Yang, the issue seems to be that something
went wrong when trying to connect to the ResourceManager.
There may be two possibilities, the leader of ResourceManager does not
write the znode or the TaskExecutor fails to connect to it. If you
turn on the DEBUG log, it will help a lot. Also, you could watch the
content znode "/leader/resource_manager_lock" of ZooKeeper.

Best,
Yangze Guo

On Fri, Jan 17, 2020 at 5:11 PM Yang Wang <da...@gmail.com> wrote:
>
> Hi Kumar Bolar, Harshith,
>
> Could you please check the jobmanager log to find out what address the akka is listening?
> Also the address could be used to connected to the jobmanager on the taskmanger machine.
>
> BTW, if you could share the debug level logs of jobmanger and taskmanger. It will help a lot to find
> the root cause.
>
>
> Best,
> Yang
>
> Kumar Bolar, Harshith <hk...@arity.com> 于2020年1月16日周四 下午7:10写道:
>>
>> Hi all,
>>
>>
>>
>> We were previously using RHEL for our Flink machines. I'm currently working on moving them over to Ubuntu. When I start the task manager, it fails to connect to the job manager with the following message -
>>
>>
>>
>> 2020-01-16 10:54:42,777 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to select the network interface and address to use by connecting to the leading JobManager.
>>
>> 2020-01-16 10:54:42,778 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
>>
>> 2020-01-16 10:54:52,780 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not find any IPv4 address that is not loopback or link-local. Using localhost address.
>>
>>
>>
>> The network interface on the machine looks like this -
>>
>>
>>
>>
>>
>> ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
>>
>>         inet 10.16.75.30  netmask 255.255.255.128  broadcast 10.16.75.127
>>
>>         ether 02:f1:8b:34:75:51  txqueuelen 1000  (Ethernet)
>>
>>         RX packets 69370  bytes 80369110 (80.3 MB)
>>
>>         RX errors 0  dropped 0  overruns 0  frame 0
>>
>>         TX packets 28787  bytes 2898540 (2.8 MB)
>>
>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>>
>>
>> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
>>
>>         inet 127.0.0.1  netmask 255.0.0.0
>>
>>         loop  txqueuelen 1000  (Local Loopback)
>>
>>         RX packets 9562  bytes 1596138 (1.5 MB)
>>
>>         RX errors 0  dropped 0  overruns 0  frame 0
>>
>>         TX packets 9562  bytes 1596138 (1.5 MB)
>>
>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>>
>>
>>
>>
>> Note: On RHEL, the primary network interface was eth0. Could this be the issue?
>>
>>
>>
>> Here's the full task manager log - https://paste.ubuntu.com/p/vgh96FHzRq/
>>
>>
>>
>> Thanks
>>
>> Harshith

Re: Taskmanager fails to connect to Jobmanager [Could not find any IPv4 address that is not loopback or link-local. Using localhost address.]

Posted by Yang Wang <da...@gmail.com>.
Hi Kumar Bolar, Harshith,

Could you please check the jobmanager log to find out what address the akka
is listening?
Also the address could be used to connected to the jobmanager on the
taskmanger machine.

BTW, if you could share the debug level logs of jobmanger and taskmanger.
It will help a lot to find
the root cause.


Best,
Yang

Kumar Bolar, Harshith <hk...@arity.com> 于2020年1月16日周四 下午7:10写道:

> Hi all,
>
>
>
> We were previously using RHEL for our Flink machines. I'm currently
> working on moving them over to Ubuntu. When I start the task manager, it
> fails to connect to the job manager with the following message -
>
>
>
> 2020-01-16 10:54:42,777 INFO
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
> select the network interface and address to use by connecting to the
> leading JobManager.
>
> 2020-01-16 10:54:42,778 INFO
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
> will try to connect for 10000 milliseconds before falling back to heuristics
>
> 2020-01-16 10:54:52,780 WARN
> org.apache.flink.runtime.net.ConnectionUtils                  - Could not
> find any IPv4 address that is not loopback or link-local. Using localhost
> address.
>
>
>
> The network interface on the machine looks like this -
>
>
>
>
>
> ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
>
>         inet 10.16.75.30  netmask 255.255.255.128  broadcast 10.16.75.127
>
>         ether 02:f1:8b:34:75:51  txqueuelen 1000  (Ethernet)
>
>         RX packets 69370  bytes 80369110 (80.3 MB)
>
>         RX errors 0  dropped 0  overruns 0  frame 0
>
>         TX packets 28787  bytes 2898540 (2.8 MB)
>
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
>
>         inet 127.0.0.1  netmask 255.0.0.0
>
>         loop  txqueuelen 1000  (Local Loopback)
>
>         RX packets 9562  bytes 1596138 (1.5 MB)
>
>         RX errors 0  dropped 0  overruns 0  frame 0
>
>         TX packets 9562  bytes 1596138 (1.5 MB)
>
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
>
>
> Note: On RHEL, the primary network interface was eth0. Could this be the
> issue?
>
>
>
> Here's the full task manager log - https://paste.ubuntu.com/p/vgh96FHzRq/
>
>
>
> Thanks
>
> Harshith
>