You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Komal Mariam <ko...@gmail.com> on 2019/09/11 05:13:35 UTC

Problem starting taskexecutor daemons in 3 node cluster

I'm trying to set up a 3 node Flink cluster (version 1.9) on the following
machines:

Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS

I have followed the instructions on:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html

I have defined the IP/address of "jobmanager.rpc.address" in
conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname

Slaves as conf/slaves:  slave@slave-node2-hostname
                        slave@slave-node3-hostname
                        master@master-node1-hostname (using master machine
for task execution too)


However my problem is when running bin/start-cluster.sh on Master node, it
fails to start taskexecutor daemon on* both Slave nodes.* It only starts
both taskexecutor daemon and standalonesession daemon on
master@master-node1-hostname (Node 1)

I have tried both passwordless ssh and password ssh on all machines but the
result is the same. In the latter case, it does ask for
slave@slave-node2-hostname, slave@slave-node3-hostname passowords but fails
to display any message like "starting taskexecutor daemon on xxxx" after
that.

I switched my master node to Node 2 and set Node 1 to slave. It was able to
start taskexecutor daemons on* both Node 2 and Node 3 *successfully but did
nothing for Node 1.

I'd appreciate if you can advice on what the problem here could be and how
I can resolve it.

Best Regards,
Komal

Re: Problem starting taskexecutor daemons in 3 node cluster

Posted by Till Rohrmann <tr...@apache.org>.

SSH access to the nodes and nodes being able to talk to each other are
separate issues. The former is only used for starting the Flink cluster.
Once the cluster is started, Flink only requires that nodes can talk to
each other (independent of SSH).

Cheers,
Till

On Tue, Sep 17, 2019 at 7:39 AM Komal Mariam <ko...@gmail.com> wrote:

> Hi Till,
>
> Thank you for the reply. I tried to ssh each of the nodes individually
> with each other and they all can connect to each other.  Its just that all
> the other worker nodes cannot for some reason. connect to the job manager
> on 150.82.218.218:6123. (Node 1)
>
> I got around the problem by setting the master node (JobManager) on Node 2
> and making 150.82.218.218 as a slave (TaskManager).
>
> now, all nodes including 150.82.218.218 are showing up in the new
> jobmanager's UI and I can see my jobs getting distributed between them too.
>
> For now, all my nodes have password enabled SSH. Do you think this issue
> could be because I have not set passwordless SSH? If the start-cluster.yaml
> can instantiate the nodes with password ssh why is it important to set
> passwordless SSH (aside from convenience)?
>
> Best Regards,
> Komal
>
> On Fri, 13 Sep 2019 at 18:31, Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi Komal,
>>
>> could you check that every node can reach the other nodes? It looks a
>> little bit as if the TaskManager cannot talk to the JobManager running on
>> 150.82.218.218:6123.
>>
>> Cheers,
>> Till
>>
>> On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <ko...@gmail.com>
>> wrote:
>>
>>> I managed to fix it however ran into another problem that I could
>>> appreciate help in resolving.
>>>
>>> it turns out that the username for all three nodes was different. having
>>> the same username for them fixed the issue. i.e
>>> same_username@slave-node2-hostname
>>> same_username@slave-node3-hostname
>>> same_username@master-node1-hostname
>>>
>>> Infact, because the usernames are the same, I can just save them in the
>>> conf files as:
>>> slave-node2-hostname
>>> slave-node3-hostname
>>> master-node1-hostname
>>>
>>> However, for some reason my worker nodes dont show up in the available
>>> task manager in the web UI.
>>>
>>> The taskexecutor log says the following:
>>> ... (clipped for brevity)
>>> 2019-09-12 15:56:36,625 INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -
>>> --------------------------------------------------------------------------------
>>> 2019-09-12 15:56:36,631 INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered
>>> UNIX signal handlers for [TERM, HUP, INT]
>>> 2019-09-12 15:56:36,647 INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum
>>> number of open file descriptors is 1048576.
>>> 2019-09-12 15:56:36,710 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: jobmanager.rpc.address, 150.82.218.218
>>> 2019-09-12 15:56:36,711 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: jobmanager.rpc.port, 6123
>>> 2019-09-12 15:56:36,712 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: jobmanager.heap.size, 1024m
>>> 2019-09-12 15:56:36,713 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: taskmanager.heap.size, 1024m
>>> 2019-09-12 15:56:36,714 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: taskmanager.numberOfTaskSlots, 1
>>> 2019-09-12 15:56:36,715 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: parallelism.default, 1
>>> 2019-09-12 15:56:36,717 INFO
>>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>>> configuration property: jobmanager.execution.failover-strategy, region
>>> 2019-09-12 15:56:37,097 INFO  org.apache.flink.core.fs.FileSystem
>>>                     - Hadoop is not in the classpath/dependencies. The
>>> extended set of supported File Systems via Hadoop is not available.
>>> 2019-09-12 15:56:37,221 INFO
>>>  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
>>> create Hadoop Security Module because Hadoop cannot be found in the
>>> Classpath.
>>> 2019-09-12 15:56:37,305 INFO
>>>  org.apache.flink.runtime.security.SecurityUtils               - Cannot
>>> install HadoopSecurityContext because Hadoop cannot be found in the
>>> Classpath.
>>> 2019-09-12 15:56:38,142 INFO
>>>  org.apache.flink.configuration.Configuration                  - Config
>>> uses fallback configuration key 'jobmanager.rpc.address' instead of key
>>> 'rest.address'
>>> 2019-09-12 15:56:38,169 INFO
>>>  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
>>> select the network interface and address to use by connecting to the
>>> leading JobManager.
>>> 2019-09-12 15:56:38,170 INFO
>>>  org.apache.flink.runtime.util.LeaderRetrievalUtils            -
>>> TaskManager will try to connect for 10000 milliseconds before falling back
>>> to heuristics
>>> 2019-09-12 15:56:38,185 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved
>>> new target address /150.82.218.218:6123.
>>> 2019-09-12 15:56:39,691 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
>>> connect to address /150.82.218.218:6123
>>> 2019-09-12 15:56:39,693 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address 'salman-hpc/127.0.1.1': Invalid argument (connect
>>> failed)
>>> 2019-09-12 15:56:39,696 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/150.82.219.73': No route to host (Host
>>> unreachable)
>>> 2019-09-12 15:56:39,698 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is
>>> unreachable (connect failed)
>>> 2019-09-12 15:56:39,748 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/150.82.219.73': connect timed out
>>> 2019-09-12 15:56:39,750 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/0:0:0:0:0:0:0:1%lo': Network is unreachable (connect
>>> failed)
>>> 2019-09-12 15:56:39,751 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/127.0.0.1': Invalid argument (connect failed)
>>> 2019-09-12 15:56:39,753 INFO
>>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>>> connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is
>>> unreachable (connect failed)
>>> "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C
>>>
>>> I'd appreciate help regarding the issue.
>>>
>>> Best Regards,
>>> Komal
>>>
>>>
>>> On Wed, 11 Sep 2019 at 14:13, Komal Mariam <ko...@gmail.com>
>>> wrote:
>>>
>>>> I'm trying to set up a 3 node Flink cluster (version 1.9) on the
>>>> following machines:
>>>>
>>>> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
>>>> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
>>>> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
>>>>
>>>> I have followed the instructions on:
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
>>>>
>>>> I have defined the IP/address of "jobmanager.rpc.address" in
>>>> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
>>>>
>>>> Slaves as conf/slaves:  slave@slave-node2-hostname
>>>>                         slave@slave-node3-hostname
>>>>                         master@master-node1-hostname (using master
>>>> machine for task execution too)
>>>>
>>>>
>>>> However my problem is when running bin/start-cluster.sh on Master node,
>>>> it fails to start taskexecutor daemon on* both Slave nodes.* It only
>>>> starts both taskexecutor daemon and standalonesession daemon on
>>>> master@master-node1-hostname (Node 1)
>>>>
>>>> I have tried both passwordless ssh and password ssh on all machines but
>>>> the result is the same. In the latter case, it does ask for
>>>> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but
>>>> fails to display any message like "starting taskexecutor daemon on xxxx"
>>>> after that.
>>>>
>>>> I switched my master node to Node 2 and set Node 1 to slave. It was
>>>> able to start taskexecutor daemons on* both Node 2 and Node 3 *successfully
>>>> but did nothing for Node 1.
>>>>
>>>> I'd appreciate if you can advice on what the problem here could be and
>>>> how I can resolve it.
>>>>
>>>> Best Regards,
>>>> Komal
>>>>
>>>>
>>>>
>>>>
>>>>

Re: Problem starting taskexecutor daemons in 3 node cluster

Posted by Komal Mariam <ko...@gmail.com>.

Hi Till,

Thank you for the reply. I tried to ssh each of the nodes individually with
each other and they all can connect to each other.  Its just that all the
other worker nodes cannot for some reason. connect to the job manager on
150.82.218.218:6123. (Node 1)

I got around the problem by setting the master node (JobManager) on Node 2
and making 150.82.218.218 as a slave (TaskManager).

now, all nodes including 150.82.218.218 are showing up in the new
jobmanager's UI and I can see my jobs getting distributed between them too.

For now, all my nodes have password enabled SSH. Do you think this issue
could be because I have not set passwordless SSH? If the start-cluster.yaml
can instantiate the nodes with password ssh why is it important to set
passwordless SSH (aside from convenience)?

Best Regards,
Komal

On Fri, 13 Sep 2019 at 18:31, Till Rohrmann <tr...@apache.org> wrote:

> Hi Komal,
>
> could you check that every node can reach the other nodes? It looks a
> little bit as if the TaskManager cannot talk to the JobManager running on
> 150.82.218.218:6123.
>
> Cheers,
> Till
>
> On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <ko...@gmail.com>
> wrote:
>
>> I managed to fix it however ran into another problem that I could
>> appreciate help in resolving.
>>
>> it turns out that the username for all three nodes was different. having
>> the same username for them fixed the issue. i.e
>> same_username@slave-node2-hostname
>> same_username@slave-node3-hostname
>> same_username@master-node1-hostname
>>
>> Infact, because the usernames are the same, I can just save them in the
>> conf files as:
>> slave-node2-hostname
>> slave-node3-hostname
>> master-node1-hostname
>>
>> However, for some reason my worker nodes dont show up in the available
>> task manager in the web UI.
>>
>> The taskexecutor log says the following:
>> ... (clipped for brevity)
>> 2019-09-12 15:56:36,625 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -
>> --------------------------------------------------------------------------------
>> 2019-09-12 15:56:36,631 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered
>> UNIX signal handlers for [TERM, HUP, INT]
>> 2019-09-12 15:56:36,647 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum
>> number of open file descriptors is 1048576.
>> 2019-09-12 15:56:36,710 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: jobmanager.rpc.address, 150.82.218.218
>> 2019-09-12 15:56:36,711 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: jobmanager.rpc.port, 6123
>> 2019-09-12 15:56:36,712 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: jobmanager.heap.size, 1024m
>> 2019-09-12 15:56:36,713 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: taskmanager.heap.size, 1024m
>> 2019-09-12 15:56:36,714 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: taskmanager.numberOfTaskSlots, 1
>> 2019-09-12 15:56:36,715 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: parallelism.default, 1
>> 2019-09-12 15:56:36,717 INFO
>>  org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: jobmanager.execution.failover-strategy, region
>> 2019-09-12 15:56:37,097 INFO  org.apache.flink.core.fs.FileSystem
>>                   - Hadoop is not in the classpath/dependencies. The
>> extended set of supported File Systems via Hadoop is not available.
>> 2019-09-12 15:56:37,221 INFO
>>  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
>> create Hadoop Security Module because Hadoop cannot be found in the
>> Classpath.
>> 2019-09-12 15:56:37,305 INFO
>>  org.apache.flink.runtime.security.SecurityUtils               - Cannot
>> install HadoopSecurityContext because Hadoop cannot be found in the
>> Classpath.
>> 2019-09-12 15:56:38,142 INFO
>>  org.apache.flink.configuration.Configuration                  - Config
>> uses fallback configuration key 'jobmanager.rpc.address' instead of key
>> 'rest.address'
>> 2019-09-12 15:56:38,169 INFO
>>  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
>> select the network interface and address to use by connecting to the
>> leading JobManager.
>> 2019-09-12 15:56:38,170 INFO
>>  org.apache.flink.runtime.util.LeaderRetrievalUtils            -
>> TaskManager will try to connect for 10000 milliseconds before falling back
>> to heuristics
>> 2019-09-12 15:56:38,185 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved
>> new target address /150.82.218.218:6123.
>> 2019-09-12 15:56:39,691 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
>> connect to address /150.82.218.218:6123
>> 2019-09-12 15:56:39,693 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address 'salman-hpc/127.0.1.1': Invalid argument (connect
>> failed)
>> 2019-09-12 15:56:39,696 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/150.82.219.73': No route to host (Host
>> unreachable)
>> 2019-09-12 15:56:39,698 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is
>> unreachable (connect failed)
>> 2019-09-12 15:56:39,748 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/150.82.219.73': connect timed out
>> 2019-09-12 15:56:39,750 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/0:0:0:0:0:0:0:1%lo': Network is unreachable (connect
>> failed)
>> 2019-09-12 15:56:39,751 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/127.0.0.1': Invalid argument (connect failed)
>> 2019-09-12 15:56:39,753 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
>> connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is
>> unreachable (connect failed)
>> "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C
>>
>> I'd appreciate help regarding the issue.
>>
>> Best Regards,
>> Komal
>>
>>
>> On Wed, 11 Sep 2019 at 14:13, Komal Mariam <ko...@gmail.com>
>> wrote:
>>
>>> I'm trying to set up a 3 node Flink cluster (version 1.9) on the
>>> following machines:
>>>
>>> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
>>> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
>>> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
>>>
>>> I have followed the instructions on:
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
>>>
>>> I have defined the IP/address of "jobmanager.rpc.address" in
>>> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
>>>
>>> Slaves as conf/slaves:  slave@slave-node2-hostname
>>>                         slave@slave-node3-hostname
>>>                         master@master-node1-hostname (using master
>>> machine for task execution too)
>>>
>>>
>>> However my problem is when running bin/start-cluster.sh on Master node,
>>> it fails to start taskexecutor daemon on* both Slave nodes.* It only
>>> starts both taskexecutor daemon and standalonesession daemon on
>>> master@master-node1-hostname (Node 1)
>>>
>>> I have tried both passwordless ssh and password ssh on all machines but
>>> the result is the same. In the latter case, it does ask for
>>> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but
>>> fails to display any message like "starting taskexecutor daemon on xxxx"
>>> after that.
>>>
>>> I switched my master node to Node 2 and set Node 1 to slave. It was able
>>> to start taskexecutor daemons on* both Node 2 and Node 3 *successfully
>>> but did nothing for Node 1.
>>>
>>> I'd appreciate if you can advice on what the problem here could be and
>>> how I can resolve it.
>>>
>>> Best Regards,
>>> Komal
>>>
>>>
>>>
>>>
>>>

Re: Problem starting taskexecutor daemons in 3 node cluster

Posted by Till Rohrmann <tr...@apache.org>.

Hi Komal,

could you check that every node can reach the other nodes? It looks a
little bit as if the TaskManager cannot talk to the JobManager running on
150.82.218.218:6123.

Cheers,
Till

On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <ko...@gmail.com> wrote:

> I managed to fix it however ran into another problem that I could
> appreciate help in resolving.
>
> it turns out that the username for all three nodes was different. having
> the same username for them fixed the issue. i.e
> same_username@slave-node2-hostname
> same_username@slave-node3-hostname
> same_username@master-node1-hostname
>
> Infact, because the usernames are the same, I can just save them in the
> conf files as:
> slave-node2-hostname
> slave-node3-hostname
> master-node1-hostname
>
> However, for some reason my worker nodes dont show up in the available
> task manager in the web UI.
>
> The taskexecutor log says the following:
> ... (clipped for brevity)
> 2019-09-12 15:56:36,625 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -
> --------------------------------------------------------------------------------
> 2019-09-12 15:56:36,631 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2019-09-12 15:56:36,647 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum
> number of open file descriptors is 1048576.
> 2019-09-12 15:56:36,710 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, 150.82.218.218
> 2019-09-12 15:56:36,711 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2019-09-12 15:56:36,712 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.size, 1024m
> 2019-09-12 15:56:36,713 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.size, 1024m
> 2019-09-12 15:56:36,714 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2019-09-12 15:56:36,715 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: parallelism.default, 1
> 2019-09-12 15:56:36,717 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.execution.failover-strategy, region
> 2019-09-12 15:56:37,097 INFO  org.apache.flink.core.fs.FileSystem
>                   - Hadoop is not in the classpath/dependencies. The
> extended set of supported File Systems via Hadoop is not available.
> 2019-09-12 15:56:37,221 INFO
>  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
> create Hadoop Security Module because Hadoop cannot be found in the
> Classpath.
> 2019-09-12 15:56:37,305 INFO
>  org.apache.flink.runtime.security.SecurityUtils               - Cannot
> install HadoopSecurityContext because Hadoop cannot be found in the
> Classpath.
> 2019-09-12 15:56:38,142 INFO  org.apache.flink.configuration.Configuration
>                  - Config uses fallback configuration key
> 'jobmanager.rpc.address' instead of key 'rest.address'
> 2019-09-12 15:56:38,169 INFO
>  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
> select the network interface and address to use by connecting to the
> leading JobManager.
> 2019-09-12 15:56:38,170 INFO
>  org.apache.flink.runtime.util.LeaderRetrievalUtils            -
> TaskManager will try to connect for 10000 milliseconds before falling back
> to heuristics
> 2019-09-12 15:56:38,185 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Retrieved new target address /150.82.218.218:6123.
> 2019-09-12 15:56:39,691 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Trying to connect to address /150.82.218.218:6123
> 2019-09-12 15:56:39,693 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address 'salman-hpc/127.0.1.1':
> Invalid argument (connect failed)
> 2019-09-12 15:56:39,696 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/150.82.219.73': No
> route to host (Host unreachable)
> 2019-09-12 15:56:39,698 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address
> '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
> failed)
> 2019-09-12 15:56:39,748 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/150.82.219.73':
> connect timed out
> 2019-09-12 15:56:39,750 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/0:0:0:0:0:0:0:1%lo':
> Network is unreachable (connect failed)
> 2019-09-12 15:56:39,751 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/127.0.0.1': Invalid
> argument (connect failed)
> 2019-09-12 15:56:39,753 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address
> '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
> failed)
> "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C
>
> I'd appreciate help regarding the issue.
>
> Best Regards,
> Komal
>
>
> On Wed, 11 Sep 2019 at 14:13, Komal Mariam <ko...@gmail.com> wrote:
>
>> I'm trying to set up a 3 node Flink cluster (version 1.9) on the
>> following machines:
>>
>> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
>> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
>> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
>>
>> I have followed the instructions on:
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
>>
>> I have defined the IP/address of "jobmanager.rpc.address" in
>> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
>>
>> Slaves as conf/slaves:  slave@slave-node2-hostname
>>                         slave@slave-node3-hostname
>>                         master@master-node1-hostname (using master
>> machine for task execution too)
>>
>>
>> However my problem is when running bin/start-cluster.sh on Master node,
>> it fails to start taskexecutor daemon on* both Slave nodes.* It only
>> starts both taskexecutor daemon and standalonesession daemon on
>> master@master-node1-hostname (Node 1)
>>
>> I have tried both passwordless ssh and password ssh on all machines but
>> the result is the same. In the latter case, it does ask for
>> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but
>> fails to display any message like "starting taskexecutor daemon on xxxx"
>> after that.
>>
>> I switched my master node to Node 2 and set Node 1 to slave. It was able
>> to start taskexecutor daemons on* both Node 2 and Node 3 *successfully
>> but did nothing for Node 1.
>>
>> I'd appreciate if you can advice on what the problem here could be and
>> how I can resolve it.
>>
>> Best Regards,
>> Komal
>>
>>
>>
>>
>>

Re: Problem starting taskexecutor daemons in 3 node cluster

Posted by Komal Mariam <ko...@gmail.com>.

I managed to fix it however ran into another problem that I could
appreciate help in resolving.

it turns out that the username for all three nodes was different. having
the same username for them fixed the issue. i.e
same_username@slave-node2-hostname
same_username@slave-node3-hostname
same_username@master-node1-hostname

Infact, because the usernames are the same, I can just save them in the
conf files as:
slave-node2-hostname
slave-node3-hostname
master-node1-hostname

However, for some reason my worker nodes dont show up in the available task
manager in the web UI.

The taskexecutor log says the following:
... (clipped for brevity)
2019-09-12 15:56:36,625 INFO
 org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -
--------------------------------------------------------------------------------
2019-09-12 15:56:36,631 INFO
 org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered
UNIX signal handlers for [TERM, HUP, INT]
2019-09-12 15:56:36,647 INFO
 org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum
number of open file descriptors is 1048576.
2019-09-12 15:56:36,710 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.address, 150.82.218.218
2019-09-12 15:56:36,711 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.port, 6123
2019-09-12 15:56:36,712 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.heap.size, 1024m
2019-09-12 15:56:36,713 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.heap.size, 1024m
2019-09-12 15:56:36,714 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 1
2019-09-12 15:56:36,715 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: parallelism.default, 1
2019-09-12 15:56:36,717 INFO
 org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.execution.failover-strategy, region
2019-09-12 15:56:37,097 INFO  org.apache.flink.core.fs.FileSystem
                - Hadoop is not in the classpath/dependencies. The extended
set of supported File Systems via Hadoop is not available.
2019-09-12 15:56:37,221 INFO
 org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
create Hadoop Security Module because Hadoop cannot be found in the
Classpath.
2019-09-12 15:56:37,305 INFO
 org.apache.flink.runtime.security.SecurityUtils               - Cannot
install HadoopSecurityContext because Hadoop cannot be found in the
Classpath.
2019-09-12 15:56:38,142 INFO  org.apache.flink.configuration.Configuration
                 - Config uses fallback configuration key
'jobmanager.rpc.address' instead of key 'rest.address'
2019-09-12 15:56:38,169 INFO
 org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
select the network interface and address to use by connecting to the
leading JobManager.
2019-09-12 15:56:38,170 INFO
 org.apache.flink.runtime.util.LeaderRetrievalUtils            -
TaskManager will try to connect for 10000 milliseconds before falling back
to heuristics
2019-09-12 15:56:38,185 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Retrieved new target address /150.82.218.218:6123.
2019-09-12 15:56:39,691 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Trying to connect to address /150.82.218.218:6123
2019-09-12 15:56:39,693 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address 'salman-hpc/127.0.1.1':
Invalid argument (connect failed)
2019-09-12 15:56:39,696 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address '/150.82.219.73': No
route to host (Host unreachable)
2019-09-12 15:56:39,698 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address
'/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
failed)
2019-09-12 15:56:39,748 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address '/150.82.219.73': connect
timed out
2019-09-12 15:56:39,750 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address '/0:0:0:0:0:0:0:1%lo':
Network is unreachable (connect failed)
2019-09-12 15:56:39,751 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address '/127.0.0.1': Invalid
argument (connect failed)
2019-09-12 15:56:39,753 INFO  org.apache.flink.runtime.net.ConnectionUtils
                 - Failed to connect from address
'/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
failed)
"flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C

I'd appreciate help regarding the issue.

Best Regards,
Komal


On Wed, 11 Sep 2019 at 14:13, Komal Mariam <ko...@gmail.com> wrote:

> I'm trying to set up a 3 node Flink cluster (version 1.9) on the following
> machines:
>
> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
>
> I have followed the instructions on:
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
>
> I have defined the IP/address of "jobmanager.rpc.address" in
> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
>
> Slaves as conf/slaves:  slave@slave-node2-hostname
>                         slave@slave-node3-hostname
>                         master@master-node1-hostname (using master
> machine for task execution too)
>
>
> However my problem is when running bin/start-cluster.sh on Master node, it
> fails to start taskexecutor daemon on* both Slave nodes.* It only starts
> both taskexecutor daemon and standalonesession daemon on
> master@master-node1-hostname (Node 1)
>
> I have tried both passwordless ssh and password ssh on all machines but
> the result is the same. In the latter case, it does ask for
> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but
> fails to display any message like "starting taskexecutor daemon on xxxx"
> after that.
>
> I switched my master node to Node 2 and set Node 1 to slave. It was able
> to start taskexecutor daemons on* both Node 2 and Node 3 *successfully
> but did nothing for Node 1.
>
> I'd appreciate if you can advice on what the problem here could be and how
> I can resolve it.
>
> Best Regards,
> Komal
>
>
>
>
>