You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Matthias Seiler <ma...@campus.tu-berlin.de> on 2021/03/15 10:12:42 UTC
Re: Netty LocalTransportException: Sending the partition request to
'null' failed
Hi Arvid,
I listened to ports with netcat and connected via telnet and each node
can connect to the other and itself.
The `/etc/hosts` file looks like this
```
127.0.0.1 localhost
127.0.1.1 node-2.example.com node-2
<ip-node-1> node-1
```
Is the second line the reason it fails? I also replaced all hostnames
with IP addresses in the config files (flink-conf, workers, masters) but
without effect...
Do you have any ideas what else I could try?
Thanks again,
Matthias
On 2/24/21 2:17 PM, Arvid Heise wrote:
> Hi Matthias,
>
> most of the debug statements are just noise. You can ignore that.
>
> Something with your network seems fishy to me. Either taskmanager 1
> cannot connect to taskmanager 2 (and vice versa), or the taskmanager
> cannot connect locally.
>
> I found this fragment, which seems suspicious
>
> Failed to connect to /127.0.*1*.1:32797. Giving up.
>
> localhost is usually 127.0.0.1. Can you double check that you connect
> from all machines to all machines (including themselves) by opening
> trivial text sockets on random ports?
>
> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
> <matthias.seiler@campus.tu-berlin.de
> <ma...@campus.tu-berlin.de>> wrote:
>
> Hi Till,
>
> thanks for the hint, you seem about right. Setting the log level
> to DEBUG reveals more information, but I don't know what to do
> about it.
>
> All logs throw some Java related exceptions:
> `java.lang.UnsupportedOperationException: Reflective
> setAccessible(true) disabled`
> and
> `java.lang.IllegalAccessException: class
> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
> cannot access class jdk.internal.misc.Unsafe (in module java.base)
> because module java.base does not export jdk.internal.misc to
> unnamed module`
>
> The log of node-2's TaskManager reveals connection problems:
> `org.apache.flink.runtime.net.ConnectionUtils [] -
> Failed to connect from address 'node-2/127.0.1.1
> <http://127.0.1.1>': Invalid argument (connect failed)`
> `java.net.ConnectException: Invalid argument (connect failed)`
>
> What's more, both TaskManagers (node-1 and node-2) are having
> trouble to load
> `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
> but load some version eventually.
>
>
> There is quite a lot going on here that I don't understand. Can
> you (or someone) shed some light on it and tell me what I could try?
>
> Some more information:
> I appended the following to the `/etc/hosts` file:
> ```
> <ip-node-1> node-1
> <ip-node-2> node-2
> ```
> And the `flink/conf/workers` consists of:
> ```
> node-1
> node-2
> ```
>
> Thanks,
> Matthias
>
> P.S. I attached the logs for further reference. `<ip-node-1>` is
> of course the real IP address instead.
>
>
> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>> Hi Matthias,
>>
>> Can you make sure that node-1 and node-2 can talk to each other?
>> It looks to me that node-2 fails to open a connection to the
>> other TaskManager. Maybe the logs give some more insights. You
>> can change the log level to DEBUG to gather more information.
>>
>> Cheers,
>> Till
>>
>> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>> <matthias.seiler@campus.tu-berlin.de
>> <ma...@campus.tu-berlin.de>> wrote:
>>
>> Hi Everyone,
>>
>> I'm trying to setup a Flink cluster in standealone mode with two
>> machines. However, running a job throws the following exception:
>> `org.apache.flink.runtime.io
>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed`
>>
>> Here is some background:
>>
>> Machines:
>> - node-1: JobManager, TaskManager
>> - node-2: TaskManager
>>
>> flink-conf-yaml looks like this:
>> ```
>> jobmanager.rpc.address: node-1
>> taskmanager.numberOfTaskSlots: 8
>> parallelism.default: 2
>> cluster.evenly-spread-out-slots: true
>> ```
>>
>> Deploying the cluster works: I can see both TaskManagers in
>> the WebUI.
>>
>> I ran the streaming WordCount example: `flink run
>> flink-1.12.1/examples/streaming/WordCount.jar --input
>> lorem-ipsum.txt`
>> - the job has been submitted
>> - job failed (with the above exception)
>> - the log of the node-2 also shows the exception, the other
>> logs are
>> fine (graceful stop)
>>
>> I played around with the config and observed that
>> - if parallelism is set to 1, node-1 gets all the slots and
>> node-2 none
>> - if parallelism is set to 2, each TaskManager occupies 1
>> TaskSlot (but
>> fails because of node-2)
>>
>> I suspect, that the problem must be with the communication
>> between
>> TaskManagers
>> - job runs successful if
>> - node-1 is the **only** node with x TaskManagers (tested
>> with x=1
>> and x=2)
>> - node-2 is the **only** node with x TaskManagers (tested
>> with x=1
>> and x=2)
>> - job fails if
>> - node-1 **and** node-2 have one TaskManager
>>
>> The full exception is:
>> ```
>> org.apache.flink.client.program.ProgramInvocationException:
>> The main
>> method caused an error:
>> org.apache.flink.client.program.ProgramInvocationException:
>> Job failed
>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>> // ... Job failed, Recovery is suppressed by
>> NoRestartBackoffTimeStrategy, ...
>> Caused by:
>> org.apache.flink.runtime.io
>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed.
>> at
>> org.apache.flink.runtime.io
>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>> at
>> org.apache.flink.runtime.io
>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> at java.base/java.lang.Thread.run(Thread.java:834)
>> Caused by: java.nio.channels.ClosedChannelException
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>> ... 11 more
>> ```
>>
>> Thanks in advance,
>> Matthias
>>
Re: Netty LocalTransportException: Sending the partition request to
'null' failed
Posted by Matthias Seiler <ma...@campus.tu-berlin.de>.
Thanks a bunch! I replaced 127.0.1.1 with the actual IP address and it
works now :)
On 3/15/21 3:22 PM, Robert Metzger wrote:
> Hey Matthias,
>
> are you sure you can connect to 127.0.1.1, since everything between
> 127.0.0.1 and 127.255.255.255 is bound to the loopback device?:
> https://serverfault.com/a/363098 <https://serverfault.com/a/363098>
>
>
>
> On Mon, Mar 15, 2021 at 11:13 AM Matthias Seiler
> <matthias.seiler@campus.tu-berlin.de
> <ma...@campus.tu-berlin.de>> wrote:
>
> Hi Arvid,
>
> I listened to ports with netcat and connected via telnet and each
> node can connect to the other and itself.
>
> The `/etc/hosts` file looks like this
> ```
> 127.0.0.1 localhost
> 127.0.1.1 node-2.example.com <http://node-2.example.com> node-2
>
> <ip-node-1> node-1
> ```
> Is the second line the reason it fails? I also replaced all
> hostnames with IP addresses in the config files (flink-conf,
> workers, masters) but without effect...
>
> Do you have any ideas what else I could try?
>
> Thanks again,
> Matthias
>
> On 2/24/21 2:17 PM, Arvid Heise wrote:
>> Hi Matthias,
>>
>> most of the debug statements are just noise. You can ignore that.
>>
>> Something with your network seems fishy to me. Either taskmanager
>> 1 cannot connect to taskmanager 2 (and vice versa), or the
>> taskmanager cannot connect locally.
>>
>> I found this fragment, which seems suspicious
>>
>> Failed to connect to /127.0.*1*.1:32797. Giving up.
>>
>> localhost is usually 127.0.0.1. Can you double check that you
>> connect from all machines to all machines (including themselves)
>> by opening trivial text sockets on random ports?
>>
>> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
>> <matthias.seiler@campus.tu-berlin.de
>> <ma...@campus.tu-berlin.de>> wrote:
>>
>> Hi Till,
>>
>> thanks for the hint, you seem about right. Setting the log
>> level to DEBUG reveals more information, but I don't know
>> what to do about it.
>>
>> All logs throw some Java related exceptions:
>> `java.lang.UnsupportedOperationException: Reflective
>> setAccessible(true) disabled`
>> and
>> `java.lang.IllegalAccessException: class
>> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>> cannot access class jdk.internal.misc.Unsafe (in module
>> java.base) because module java.base does not export
>> jdk.internal.misc to unnamed module`
>>
>> The log of node-2's TaskManager reveals connection problems:
>> `org.apache.flink.runtime.net.ConnectionUtils
>> [] - Failed to connect from address 'node-2/127.0.1.1
>> <http://127.0.1.1>': Invalid argument (connect failed)`
>> `java.net.ConnectException: Invalid argument (connect failed)`
>>
>> What's more, both TaskManagers (node-1 and node-2) are having
>> trouble to load
>> `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>> but load some version eventually.
>>
>>
>> There is quite a lot going on here that I don't understand.
>> Can you (or someone) shed some light on it and tell me what I
>> could try?
>>
>> Some more information:
>> I appended the following to the `/etc/hosts` file:
>> ```
>> <ip-node-1> node-1
>> <ip-node-2> node-2
>> ```
>> And the `flink/conf/workers` consists of:
>> ```
>> node-1
>> node-2
>> ```
>>
>> Thanks,
>> Matthias
>>
>> P.S. I attached the logs for further reference. `<ip-node-1>`
>> is of course the real IP address instead.
>>
>>
>> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>> Hi Matthias,
>>>
>>> Can you make sure that node-1 and node-2 can talk to each
>>> other? It looks to me that node-2 fails to open a connection
>>> to the other TaskManager. Maybe the logs give some more
>>> insights. You can change the log level to DEBUG to gather
>>> more information.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>>> <matthias.seiler@campus.tu-berlin.de
>>> <ma...@campus.tu-berlin.de>> wrote:
>>>
>>> Hi Everyone,
>>>
>>> I'm trying to setup a Flink cluster in standealone mode
>>> with two
>>> machines. However, running a job throws the following
>>> exception:
>>> `org.apache.flink.runtime.io
>>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed`
>>>
>>> Here is some background:
>>>
>>> Machines:
>>> - node-1: JobManager, TaskManager
>>> - node-2: TaskManager
>>>
>>> flink-conf-yaml looks like this:
>>> ```
>>> jobmanager.rpc.address: node-1
>>> taskmanager.numberOfTaskSlots: 8
>>> parallelism.default: 2
>>> cluster.evenly-spread-out-slots: true
>>> ```
>>>
>>> Deploying the cluster works: I can see both TaskManagers
>>> in the WebUI.
>>>
>>> I ran the streaming WordCount example: `flink run
>>> flink-1.12.1/examples/streaming/WordCount.jar --input
>>> lorem-ipsum.txt`
>>> - the job has been submitted
>>> - job failed (with the above exception)
>>> - the log of the node-2 also shows the exception, the
>>> other logs are
>>> fine (graceful stop)
>>>
>>> I played around with the config and observed that
>>> - if parallelism is set to 1, node-1 gets all the slots
>>> and node-2 none
>>> - if parallelism is set to 2, each TaskManager occupies
>>> 1 TaskSlot (but
>>> fails because of node-2)
>>>
>>> I suspect, that the problem must be with the
>>> communication between
>>> TaskManagers
>>> - job runs successful if
>>> - node-1 is the **only** node with x TaskManagers
>>> (tested with x=1
>>> and x=2)
>>> - node-2 is the **only** node with x TaskManagers
>>> (tested with x=1
>>> and x=2)
>>> - job fails if
>>> - node-1 **and** node-2 have one TaskManager
>>>
>>> The full exception is:
>>> ```
>>> org.apache.flink.client.program.ProgramInvocationException:
>>> The main
>>> method caused an error:
>>> org.apache.flink.client.program.ProgramInvocationException:
>>> Job failed
>>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>> // ... Job failed, Recovery is suppressed by
>>> NoRestartBackoffTimeStrategy, ...
>>> Caused by:
>>> org.apache.flink.runtime.io
>>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed.
>>> at
>>> org.apache.flink.runtime.io
>>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>> at
>>> org.apache.flink.runtime.io
>>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>> at java.base/java.lang.Thread.run(Thread.java:834)
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>> ... 11 more
>>> ```
>>>
>>> Thanks in advance,
>>> Matthias
>>>
Re: Netty LocalTransportException: Sending the partition request to
'null' failed
Posted by Robert Metzger <rm...@apache.org>.
Hey Matthias,
are you sure you can connect to 127.0.1.1, since everything between
127.0.0.1 and 127.255.255.255 is bound to the loopback device?:
https://serverfault.com/a/363098
On Mon, Mar 15, 2021 at 11:13 AM Matthias Seiler <
matthias.seiler@campus.tu-berlin.de> wrote:
> Hi Arvid,
>
> I listened to ports with netcat and connected via telnet and each node can
> connect to the other and itself.
>
> The `/etc/hosts` file looks like this
> ```
> 127.0.0.1 localhost
> 127.0.1.1 node-2.example.com node-2
>
> <ip-node-1> node-1
> ```
> Is the second line the reason it fails? I also replaced all hostnames with
> IP addresses in the config files (flink-conf, workers, masters) but without
> effect...
>
> Do you have any ideas what else I could try?
>
> Thanks again,
> Matthias
>
> On 2/24/21 2:17 PM, Arvid Heise wrote:
>
> Hi Matthias,
>
> most of the debug statements are just noise. You can ignore that.
>
> Something with your network seems fishy to me. Either taskmanager 1 cannot
> connect to taskmanager 2 (and vice versa), or the taskmanager cannot
> connect locally.
>
> I found this fragment, which seems suspicious
>
> Failed to connect to /127.0.*1*.1:32797. Giving up.
>
> localhost is usually 127.0.0.1. Can you double check that you connect from
> all machines to all machines (including themselves) by opening trivial text
> sockets on random ports?
>
> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler <
> matthias.seiler@campus.tu-berlin.de> wrote:
>
>> Hi Till,
>>
>> thanks for the hint, you seem about right. Setting the log level to DEBUG
>> reveals more information, but I don't know what to do about it.
>>
>> All logs throw some Java related exceptions:
>> `java.lang.UnsupportedOperationException: Reflective setAccessible(true)
>> disabled`
>> and
>> `java.lang.IllegalAccessException: class
>> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>> cannot access class jdk.internal.misc.Unsafe (in module java.base) because
>> module java.base does not export jdk.internal.misc to unnamed module`
>>
>> The log of node-2's TaskManager reveals connection problems:
>> `org.apache.flink.runtime.net.ConnectionUtils [] - Failed
>> to connect from address 'node-2/127.0.1.1': Invalid argument (connect
>> failed)`
>> `java.net.ConnectException: Invalid argument (connect failed)`
>>
>> What's more, both TaskManagers (node-1 and node-2) are having trouble to
>> load `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>> but load some version eventually.
>>
>>
>> There is quite a lot going on here that I don't understand. Can you (or
>> someone) shed some light on it and tell me what I could try?
>>
>> Some more information:
>> I appended the following to the `/etc/hosts` file:
>> ```
>> <ip-node-1> node-1
>> <ip-node-2> node-2
>> ```
>> And the `flink/conf/workers` consists of:
>> ```
>> node-1
>> node-2
>> ```
>>
>> Thanks,
>> Matthias
>>
>> P.S. I attached the logs for further reference. `<ip-node-1>` is of
>> course the real IP address instead.
>>
>>
>> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>
>> Hi Matthias,
>>
>> Can you make sure that node-1 and node-2 can talk to each other? It looks
>> to me that node-2 fails to open a connection to the other TaskManager.
>> Maybe the logs give some more insights. You can change the log level to
>> DEBUG to gather more information.
>>
>> Cheers,
>> Till
>>
>> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler <
>> matthias.seiler@campus.tu-berlin.de> wrote:
>>
>>> Hi Everyone,
>>>
>>> I'm trying to setup a Flink cluster in standealone mode with two
>>> machines. However, running a job throws the following exception:
>>> `org.apache.flink.runtime.io
>>> .network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed`
>>>
>>> Here is some background:
>>>
>>> Machines:
>>> - node-1: JobManager, TaskManager
>>> - node-2: TaskManager
>>>
>>> flink-conf-yaml looks like this:
>>> ```
>>> jobmanager.rpc.address: node-1
>>> taskmanager.numberOfTaskSlots: 8
>>> parallelism.default: 2
>>> cluster.evenly-spread-out-slots: true
>>> ```
>>>
>>> Deploying the cluster works: I can see both TaskManagers in the WebUI.
>>>
>>> I ran the streaming WordCount example: `flink run
>>> flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
>>> - the job has been submitted
>>> - job failed (with the above exception)
>>> - the log of the node-2 also shows the exception, the other logs are
>>> fine (graceful stop)
>>>
>>> I played around with the config and observed that
>>> - if parallelism is set to 1, node-1 gets all the slots and node-2 none
>>> - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
>>> fails because of node-2)
>>>
>>> I suspect, that the problem must be with the communication between
>>> TaskManagers
>>> - job runs successful if
>>> - node-1 is the **only** node with x TaskManagers (tested with x=1
>>> and x=2)
>>> - node-2 is the **only** node with x TaskManagers (tested with x=1
>>> and x=2)
>>> - job fails if
>>> - node-1 **and** node-2 have one TaskManager
>>>
>>> The full exception is:
>>> ```
>>> org.apache.flink.client.program.ProgramInvocationException: The main
>>> method caused an error:
>>> org.apache.flink.client.program.ProgramInvocationException: Job failed
>>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>> // ... Job failed, Recovery is suppressed by
>>> NoRestartBackoffTimeStrategy, ...
>>> Caused by:
>>> org.apache.flink.runtime.io
>>> .network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed.
>>> at
>>> org.apache.flink.runtime.io
>>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>> at
>>> org.apache.flink.runtime.io
>>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>> at java.base/java.lang.Thread.run(Thread.java:834)
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>> ... 11 more
>>> ```
>>>
>>> Thanks in advance,
>>> Matthias
>>>
>>>