You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Niklas Semmler (Jira)" <ji...@apache.org> on 2022/03/06 21:02:00 UTC

[jira] [Created] (FLINK-26503) Multi-network deployments may lead to hard-to-debug issues

Niklas Semmler created FLINK-26503:
--------------------------------------

             Summary: Multi-network deployments may lead to hard-to-debug issues
                 Key: FLINK-26503
                 URL: https://issues.apache.org/jira/browse/FLINK-26503
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.15.0
            Reporter: Niklas Semmler


The way the TaskManager defines it's receiving network address may create hard-to-debug issues for deployment setups where a TaskManager can reach a JobManager over more than one interface. Additionally. the following conditions have to be met:
* The JobManager is unreachable for some time on the TaskManager's start-up
* The receiving address {{taskmanager.host}} is not specified while the {{taskmanager.bind-host}} is set to a concrete value (and not {{0.0.0.0}}).

*Background*
A TaskManager has two settings to configure its network: {{taskmanager.bind-host}} and {{taskmanager.host}}. 
* {{taskmanager.bind-host}} decides to what interfaces the TaskManager binds on start-up. When not set it defaults to {{0.0.0.0}}, the most permissive setting). With FLINK-24474, this setting is now set to {{localhost}} via the configuration file. This line is disabled for docker setups.
* {{taskmanager.host}} specifies what the TaskManager announces as receiving address (i.e., where it wants to be reached by the JobManager). When not set, the TaskManager cycles through the available interfaces (a slight simplification) and attempts to connect to the JobManager via each interface. It chooses the address of the interface that allows a successful connection. If no connection can be made before a timeout it chooses the address of the hostname.

In network scenarios with multiple interfaces connecting TaskManager to the JobManager, a user could specify an address & interface with {{taskmanager.bind-host}} that is different to the address & network defined by the {{taskmanager.host}}.

*Example*
A straightforward way to create such a scenario is to set the {{taskmanager.bind-host}} to {{localhost}}. In this situation (and under the conditions mentioned above), the TaskManager will fall back to the external IP address of the hostname while binding to the localhost. This means that the JobManager will be able to receive packets from the TaskManager, but not respond.  

*Proposed solution*
Can we help users debug these scenarios, by adding log messages? Could we note that the IP address of the TaskManager is not on the JobManager's network (as defined by the interface the JobManager binds to) and give this as a hint?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)