You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Matthias Seiler <ma...@campus.tu-berlin.de> on 2021/02/16 09:57:03 UTC

Netty LocalTransportException: Sending the partition request to 'null' failed

Hi Everyone,

I'm trying to setup a Flink cluster in standealone mode with two
machines. However, running a job throws the following exception:
`org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
Sending the partition request to 'null' failed`

Here is some background:

Machines:
- node-1: JobManager, TaskManager
- node-2: TaskManager

flink-conf-yaml looks like this:
```
jobmanager.rpc.address: node-1
taskmanager.numberOfTaskSlots: 8
parallelism.default: 2
cluster.evenly-spread-out-slots: true
```

Deploying the cluster works: I can see both TaskManagers in the WebUI.

I ran the streaming WordCount example: `flink run
flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
- the job has been submitted
- job failed (with the above exception)
- the log of the node-2 also shows the exception, the other logs are
fine (graceful stop)

I played around with the config and observed that
- if parallelism is set to 1, node-1 gets all the slots and node-2 none
- if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
fails because of node-2)

I suspect, that the problem must be with the communication between
TaskManagers
- job runs successful if
    - node-1 is the **only** node with x TaskManagers (tested with x=1
and x=2)
    - node-2 is the **only** node with x TaskManagers (tested with x=1
and x=2)
- job fails if
    - node-1 **and** node-2 have one TaskManager

The full exception is:
```
org.apache.flink.client.program.ProgramInvocationException: The main
method caused an error:
org.apache.flink.client.program.ProgramInvocationException: Job failed
(JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
// ... Job failed, Recovery is suppressed by
NoRestartBackoffTimeStrategy, ...
Caused by:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
Sending the partition request to 'null' failed.
    at
org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
    at
org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
    at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.nio.channels.ClosedChannelException
    at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
    ... 11 more
```

Thanks in advance,
Matthias


Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Matthias Seiler <ma...@campus.tu-berlin.de>.
Thanks a bunch! I replaced 127.0.1.1 with the actual IP address and it
works now :)

On 3/15/21 3:22 PM, Robert Metzger wrote:
> Hey Matthias,
>
> are you sure you can connect to 127.0.1.1, since everything between
> 127.0.0.1 and  127.255.255.255 is bound to the loopback device?:
> https://serverfault.com/a/363098 <https://serverfault.com/a/363098>
>
>
>
> On Mon, Mar 15, 2021 at 11:13 AM Matthias Seiler
> <matthias.seiler@campus.tu-berlin.de
> <ma...@campus.tu-berlin.de>> wrote:
>
>     Hi Arvid,
>
>     I listened to ports with netcat and connected via telnet and each
>     node can connect to the other and itself.
>
>     The `/etc/hosts` file looks like this
>     ```
>     127.0.0.1   localhost
>     127.0.1.1   node-2.example.com <http://node-2.example.com>   node-2
>
>     <ip-node-1>   node-1
>     ```
>     Is the second line the reason it fails? I also replaced all
>     hostnames with IP addresses in the config files (flink-conf,
>     workers, masters) but without effect...
>
>     Do you have any ideas what else I could try?
>
>     Thanks again,
>     Matthias
>
>     On 2/24/21 2:17 PM, Arvid Heise wrote:
>>     Hi Matthias,
>>
>>     most of the debug statements are just noise. You can ignore that.
>>
>>     Something with your network seems fishy to me. Either taskmanager
>>     1 cannot connect to taskmanager 2 (and vice versa), or the
>>     taskmanager cannot connect locally.
>>
>>     I found this fragment, which seems suspicious
>>
>>     Failed to connect to /127.0.*1*.1:32797. Giving up.
>>
>>     localhost is usually 127.0.0.1. Can you double check that you
>>     connect from all machines to all machines (including themselves)
>>     by opening trivial text sockets on random ports?
>>
>>     On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
>>     <matthias.seiler@campus.tu-berlin.de
>>     <ma...@campus.tu-berlin.de>> wrote:
>>
>>         Hi Till,
>>
>>         thanks for the hint, you seem about right. Setting the log
>>         level to DEBUG reveals more information, but I don't know
>>         what to do about it.
>>
>>         All logs throw some Java related exceptions:
>>         `java.lang.UnsupportedOperationException: Reflective
>>         setAccessible(true) disabled`
>>         and
>>         `java.lang.IllegalAccessException: class
>>         org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>>         cannot access class jdk.internal.misc.Unsafe (in module
>>         java.base) because module java.base does not export
>>         jdk.internal.misc to unnamed module`
>>
>>         The log of node-2's TaskManager reveals connection problems:
>>         `org.apache.flink.runtime.net.ConnectionUtils                
>>         [] - Failed to connect from address 'node-2/127.0.1.1
>>         <http://127.0.1.1>': Invalid argument (connect failed)`
>>         `java.net.ConnectException: Invalid argument (connect failed)`
>>
>>         What's more, both TaskManagers (node-1 and node-2) are having
>>         trouble to load
>>         `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>>         but load some version eventually.
>>
>>
>>         There is quite a lot going on here that I don't understand.
>>         Can you (or someone) shed some light on it and tell me what I
>>         could try?
>>
>>         Some more information:
>>         I appended the following to the `/etc/hosts` file:
>>         ```
>>         <ip-node-1> node-1
>>         <ip-node-2> node-2
>>         ```
>>         And the `flink/conf/workers` consists of:
>>         ```
>>         node-1
>>         node-2
>>         ```
>>
>>         Thanks,
>>         Matthias
>>
>>         P.S. I attached the logs for further reference. `<ip-node-1>`
>>         is of course the real IP address instead.
>>
>>
>>         On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>>         Hi Matthias,
>>>
>>>         Can you make sure that node-1 and node-2 can talk to each
>>>         other? It looks to me that node-2 fails to open a connection
>>>         to the other TaskManager. Maybe the logs give some more
>>>         insights. You can change the log level to DEBUG to gather
>>>         more information.
>>>
>>>         Cheers,
>>>         Till
>>>
>>>         On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>>>         <matthias.seiler@campus.tu-berlin.de
>>>         <ma...@campus.tu-berlin.de>> wrote:
>>>
>>>             Hi Everyone,
>>>
>>>             I'm trying to setup a Flink cluster in standealone mode
>>>             with two
>>>             machines. However, running a job throws the following
>>>             exception:
>>>             `org.apache.flink.runtime.io
>>>             <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>>             Sending the partition request to 'null' failed`
>>>
>>>             Here is some background:
>>>
>>>             Machines:
>>>             - node-1: JobManager, TaskManager
>>>             - node-2: TaskManager
>>>
>>>             flink-conf-yaml looks like this:
>>>             ```
>>>             jobmanager.rpc.address: node-1
>>>             taskmanager.numberOfTaskSlots: 8
>>>             parallelism.default: 2
>>>             cluster.evenly-spread-out-slots: true
>>>             ```
>>>
>>>             Deploying the cluster works: I can see both TaskManagers
>>>             in the WebUI.
>>>
>>>             I ran the streaming WordCount example: `flink run
>>>             flink-1.12.1/examples/streaming/WordCount.jar --input
>>>             lorem-ipsum.txt`
>>>             - the job has been submitted
>>>             - job failed (with the above exception)
>>>             - the log of the node-2 also shows the exception, the
>>>             other logs are
>>>             fine (graceful stop)
>>>
>>>             I played around with the config and observed that
>>>             - if parallelism is set to 1, node-1 gets all the slots
>>>             and node-2 none
>>>             - if parallelism is set to 2, each TaskManager occupies
>>>             1 TaskSlot (but
>>>             fails because of node-2)
>>>
>>>             I suspect, that the problem must be with the
>>>             communication between
>>>             TaskManagers
>>>             - job runs successful if
>>>                 - node-1 is the **only** node with x TaskManagers
>>>             (tested with x=1
>>>             and x=2)
>>>                 - node-2 is the **only** node with x TaskManagers
>>>             (tested with x=1
>>>             and x=2)
>>>             - job fails if
>>>                 - node-1 **and** node-2 have one TaskManager
>>>
>>>             The full exception is:
>>>             ```
>>>             org.apache.flink.client.program.ProgramInvocationException:
>>>             The main
>>>             method caused an error:
>>>             org.apache.flink.client.program.ProgramInvocationException:
>>>             Job failed
>>>             (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>>             // ... Job failed, Recovery is suppressed by
>>>             NoRestartBackoffTimeStrategy, ...
>>>             Caused by:
>>>             org.apache.flink.runtime.io
>>>             <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>>             Sending the partition request to 'null' failed.
>>>                 at
>>>             org.apache.flink.runtime.io
>>>             <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>>                 at
>>>             org.apache.flink.runtime.io
>>>             <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>                 at java.base/java.lang.Thread.run(Thread.java:834)
>>>             Caused by: java.nio.channels.ClosedChannelException
>>>                 at
>>>             org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>>                 ... 11 more
>>>             ```
>>>
>>>             Thanks in advance,
>>>             Matthias
>>>

Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Robert Metzger <rm...@apache.org>.
Hey Matthias,

are you sure you can connect to 127.0.1.1, since everything between
127.0.0.1 and  127.255.255.255 is bound to the loopback device?:
https://serverfault.com/a/363098



On Mon, Mar 15, 2021 at 11:13 AM Matthias Seiler <
matthias.seiler@campus.tu-berlin.de> wrote:

> Hi Arvid,
>
> I listened to ports with netcat and connected via telnet and each node can
> connect to the other and itself.
>
> The `/etc/hosts` file looks like this
> ```
> 127.0.0.1   localhost
> 127.0.1.1   node-2.example.com   node-2
>
> <ip-node-1>   node-1
> ```
> Is the second line the reason it fails? I also replaced all hostnames with
> IP addresses in the config files (flink-conf, workers, masters) but without
> effect...
>
> Do you have any ideas what else I could try?
>
> Thanks again,
> Matthias
>
> On 2/24/21 2:17 PM, Arvid Heise wrote:
>
> Hi Matthias,
>
> most of the debug statements are just noise. You can ignore that.
>
> Something with your network seems fishy to me. Either taskmanager 1 cannot
> connect to taskmanager 2 (and vice versa), or the taskmanager cannot
> connect locally.
>
> I found this fragment, which seems suspicious
>
> Failed to connect to /127.0.*1*.1:32797. Giving up.
>
> localhost is usually 127.0.0.1. Can you double check that you connect from
> all machines to all machines (including themselves) by opening trivial text
> sockets on random ports?
>
> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler <
> matthias.seiler@campus.tu-berlin.de> wrote:
>
>> Hi Till,
>>
>> thanks for the hint, you seem about right. Setting the log level to DEBUG
>> reveals more information, but I don't know what to do about it.
>>
>> All logs throw some Java related exceptions:
>> `java.lang.UnsupportedOperationException: Reflective setAccessible(true)
>> disabled`
>> and
>> `java.lang.IllegalAccessException: class
>> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>> cannot access class jdk.internal.misc.Unsafe (in module java.base) because
>> module java.base does not export jdk.internal.misc to unnamed module`
>>
>> The log of node-2's TaskManager reveals connection problems:
>> `org.apache.flink.runtime.net.ConnectionUtils                 [] - Failed
>> to connect from address 'node-2/127.0.1.1': Invalid argument (connect
>> failed)`
>> `java.net.ConnectException: Invalid argument (connect failed)`
>>
>> What's more, both TaskManagers (node-1 and node-2) are having trouble to
>> load `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>> but load some version eventually.
>>
>>
>> There is quite a lot going on here that I don't understand. Can you (or
>> someone) shed some light on it and tell me what I could try?
>>
>> Some more information:
>> I appended the following to the `/etc/hosts` file:
>> ```
>> <ip-node-1> node-1
>> <ip-node-2> node-2
>> ```
>> And the `flink/conf/workers` consists of:
>> ```
>> node-1
>> node-2
>> ```
>>
>> Thanks,
>> Matthias
>>
>> P.S. I attached the logs for further reference. `<ip-node-1>` is of
>> course the real IP address instead.
>>
>>
>> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>
>> Hi Matthias,
>>
>> Can you make sure that node-1 and node-2 can talk to each other? It looks
>> to me that node-2 fails to open a connection to the other TaskManager.
>> Maybe the logs give some more insights. You can change the log level to
>> DEBUG to gather more information.
>>
>> Cheers,
>> Till
>>
>> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler <
>> matthias.seiler@campus.tu-berlin.de> wrote:
>>
>>> Hi Everyone,
>>>
>>> I'm trying to setup a Flink cluster in standealone mode with two
>>> machines. However, running a job throws the following exception:
>>> `org.apache.flink.runtime.io
>>> .network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed`
>>>
>>> Here is some background:
>>>
>>> Machines:
>>> - node-1: JobManager, TaskManager
>>> - node-2: TaskManager
>>>
>>> flink-conf-yaml looks like this:
>>> ```
>>> jobmanager.rpc.address: node-1
>>> taskmanager.numberOfTaskSlots: 8
>>> parallelism.default: 2
>>> cluster.evenly-spread-out-slots: true
>>> ```
>>>
>>> Deploying the cluster works: I can see both TaskManagers in the WebUI.
>>>
>>> I ran the streaming WordCount example: `flink run
>>> flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
>>> - the job has been submitted
>>> - job failed (with the above exception)
>>> - the log of the node-2 also shows the exception, the other logs are
>>> fine (graceful stop)
>>>
>>> I played around with the config and observed that
>>> - if parallelism is set to 1, node-1 gets all the slots and node-2 none
>>> - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
>>> fails because of node-2)
>>>
>>> I suspect, that the problem must be with the communication between
>>> TaskManagers
>>> - job runs successful if
>>>     - node-1 is the **only** node with x TaskManagers (tested with x=1
>>> and x=2)
>>>     - node-2 is the **only** node with x TaskManagers (tested with x=1
>>> and x=2)
>>> - job fails if
>>>     - node-1 **and** node-2 have one TaskManager
>>>
>>> The full exception is:
>>> ```
>>> org.apache.flink.client.program.ProgramInvocationException: The main
>>> method caused an error:
>>> org.apache.flink.client.program.ProgramInvocationException: Job failed
>>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>> // ... Job failed, Recovery is suppressed by
>>> NoRestartBackoffTimeStrategy, ...
>>> Caused by:
>>> org.apache.flink.runtime.io
>>> .network.netty.exception.LocalTransportException:
>>> Sending the partition request to 'null' failed.
>>>     at
>>> org.apache.flink.runtime.io
>>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>>     at
>>> org.apache.flink.runtime.io
>>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>     at java.base/java.lang.Thread.run(Thread.java:834)
>>> Caused by: java.nio.channels.ClosedChannelException
>>>     at
>>>
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>>     ... 11 more
>>> ```
>>>
>>> Thanks in advance,
>>> Matthias
>>>
>>>

Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Matthias Seiler <ma...@campus.tu-berlin.de>.
Hi Arvid,

I listened to ports with netcat and connected via telnet and each node
can connect to the other and itself.

The `/etc/hosts` file looks like this
```
127.0.0.1   localhost
127.0.1.1   node-2.example.com   node-2

<ip-node-1>   node-1
```
Is the second line the reason it fails? I also replaced all hostnames
with IP addresses in the config files (flink-conf, workers, masters) but
without effect...

Do you have any ideas what else I could try?

Thanks again,
Matthias

On 2/24/21 2:17 PM, Arvid Heise wrote:
> Hi Matthias,
>
> most of the debug statements are just noise. You can ignore that.
>
> Something with your network seems fishy to me. Either taskmanager 1
> cannot connect to taskmanager 2 (and vice versa), or the taskmanager
> cannot connect locally.
>
> I found this fragment, which seems suspicious
>
> Failed to connect to /127.0.*1*.1:32797. Giving up.
>
> localhost is usually 127.0.0.1. Can you double check that you connect
> from all machines to all machines (including themselves) by opening
> trivial text sockets on random ports?
>
> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
> <matthias.seiler@campus.tu-berlin.de
> <ma...@campus.tu-berlin.de>> wrote:
>
>     Hi Till,
>
>     thanks for the hint, you seem about right. Setting the log level
>     to DEBUG reveals more information, but I don't know what to do
>     about it.
>
>     All logs throw some Java related exceptions:
>     `java.lang.UnsupportedOperationException: Reflective
>     setAccessible(true) disabled`
>     and
>     `java.lang.IllegalAccessException: class
>     org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>     cannot access class jdk.internal.misc.Unsafe (in module java.base)
>     because module java.base does not export jdk.internal.misc to
>     unnamed module`
>
>     The log of node-2's TaskManager reveals connection problems:
>     `org.apache.flink.runtime.net.ConnectionUtils                 [] -
>     Failed to connect from address 'node-2/127.0.1.1
>     <http://127.0.1.1>': Invalid argument (connect failed)`
>     `java.net.ConnectException: Invalid argument (connect failed)`
>
>     What's more, both TaskManagers (node-1 and node-2) are having
>     trouble to load
>     `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>     but load some version eventually.
>
>
>     There is quite a lot going on here that I don't understand. Can
>     you (or someone) shed some light on it and tell me what I could try?
>
>     Some more information:
>     I appended the following to the `/etc/hosts` file:
>     ```
>     <ip-node-1> node-1
>     <ip-node-2> node-2
>     ```
>     And the `flink/conf/workers` consists of:
>     ```
>     node-1
>     node-2
>     ```
>
>     Thanks,
>     Matthias
>
>     P.S. I attached the logs for further reference. `<ip-node-1>` is
>     of course the real IP address instead.
>
>
>     On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>     Hi Matthias,
>>
>>     Can you make sure that node-1 and node-2 can talk to each other?
>>     It looks to me that node-2 fails to open a connection to the
>>     other TaskManager. Maybe the logs give some more insights. You
>>     can change the log level to DEBUG to gather more information.
>>
>>     Cheers,
>>     Till
>>
>>     On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>>     <matthias.seiler@campus.tu-berlin.de
>>     <ma...@campus.tu-berlin.de>> wrote:
>>
>>         Hi Everyone,
>>
>>         I'm trying to setup a Flink cluster in standealone mode with two
>>         machines. However, running a job throws the following exception:
>>         `org.apache.flink.runtime.io
>>         <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>         Sending the partition request to 'null' failed`
>>
>>         Here is some background:
>>
>>         Machines:
>>         - node-1: JobManager, TaskManager
>>         - node-2: TaskManager
>>
>>         flink-conf-yaml looks like this:
>>         ```
>>         jobmanager.rpc.address: node-1
>>         taskmanager.numberOfTaskSlots: 8
>>         parallelism.default: 2
>>         cluster.evenly-spread-out-slots: true
>>         ```
>>
>>         Deploying the cluster works: I can see both TaskManagers in
>>         the WebUI.
>>
>>         I ran the streaming WordCount example: `flink run
>>         flink-1.12.1/examples/streaming/WordCount.jar --input
>>         lorem-ipsum.txt`
>>         - the job has been submitted
>>         - job failed (with the above exception)
>>         - the log of the node-2 also shows the exception, the other
>>         logs are
>>         fine (graceful stop)
>>
>>         I played around with the config and observed that
>>         - if parallelism is set to 1, node-1 gets all the slots and
>>         node-2 none
>>         - if parallelism is set to 2, each TaskManager occupies 1
>>         TaskSlot (but
>>         fails because of node-2)
>>
>>         I suspect, that the problem must be with the communication
>>         between
>>         TaskManagers
>>         - job runs successful if
>>             - node-1 is the **only** node with x TaskManagers (tested
>>         with x=1
>>         and x=2)
>>             - node-2 is the **only** node with x TaskManagers (tested
>>         with x=1
>>         and x=2)
>>         - job fails if
>>             - node-1 **and** node-2 have one TaskManager
>>
>>         The full exception is:
>>         ```
>>         org.apache.flink.client.program.ProgramInvocationException:
>>         The main
>>         method caused an error:
>>         org.apache.flink.client.program.ProgramInvocationException:
>>         Job failed
>>         (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>         // ... Job failed, Recovery is suppressed by
>>         NoRestartBackoffTimeStrategy, ...
>>         Caused by:
>>         org.apache.flink.runtime.io
>>         <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>         Sending the partition request to 'null' failed.
>>             at
>>         org.apache.flink.runtime.io
>>         <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>             at
>>         org.apache.flink.runtime.io
>>         <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>             at java.base/java.lang.Thread.run(Thread.java:834)
>>         Caused by: java.nio.channels.ClosedChannelException
>>             at
>>         org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>             ... 11 more
>>         ```
>>
>>         Thanks in advance,
>>         Matthias
>>

Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Arvid Heise <ar...@apache.org>.
Hi Matthias,

most of the debug statements are just noise. You can ignore that.

Something with your network seems fishy to me. Either taskmanager 1 cannot
connect to taskmanager 2 (and vice versa), or the taskmanager cannot
connect locally.

I found this fragment, which seems suspicious

Failed to connect to /127.0.*1*.1:32797. Giving up.

localhost is usually 127.0.0.1. Can you double check that you connect from
all machines to all machines (including themselves) by opening trivial text
sockets on random ports?

On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler <
matthias.seiler@campus.tu-berlin.de> wrote:

> Hi Till,
>
> thanks for the hint, you seem about right. Setting the log level to DEBUG
> reveals more information, but I don't know what to do about it.
>
> All logs throw some Java related exceptions:
> `java.lang.UnsupportedOperationException: Reflective setAccessible(true)
> disabled`
> and
> `java.lang.IllegalAccessException: class
> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
> cannot access class jdk.internal.misc.Unsafe (in module java.base) because
> module java.base does not export jdk.internal.misc to unnamed module`
>
> The log of node-2's TaskManager reveals connection problems:
> `org.apache.flink.runtime.net.ConnectionUtils                 [] - Failed
> to connect from address 'node-2/127.0.1.1': Invalid argument (connect
> failed)`
> `java.net.ConnectException: Invalid argument (connect failed)`
>
> What's more, both TaskManagers (node-1 and node-2) are having trouble to
> load `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
> but load some version eventually.
>
>
> There is quite a lot going on here that I don't understand. Can you (or
> someone) shed some light on it and tell me what I could try?
>
> Some more information:
> I appended the following to the `/etc/hosts` file:
> ```
> <ip-node-1> node-1
> <ip-node-2> node-2
> ```
> And the `flink/conf/workers` consists of:
> ```
> node-1
> node-2
> ```
>
> Thanks,
> Matthias
>
> P.S. I attached the logs for further reference. `<ip-node-1>` is of course
> the real IP address instead.
>
>
> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>
> Hi Matthias,
>
> Can you make sure that node-1 and node-2 can talk to each other? It looks
> to me that node-2 fails to open a connection to the other TaskManager.
> Maybe the logs give some more insights. You can change the log level to
> DEBUG to gather more information.
>
> Cheers,
> Till
>
> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler <
> matthias.seiler@campus.tu-berlin.de> wrote:
>
>> Hi Everyone,
>>
>> I'm trying to setup a Flink cluster in standealone mode with two
>> machines. However, running a job throws the following exception:
>> `org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed`
>>
>> Here is some background:
>>
>> Machines:
>> - node-1: JobManager, TaskManager
>> - node-2: TaskManager
>>
>> flink-conf-yaml looks like this:
>> ```
>> jobmanager.rpc.address: node-1
>> taskmanager.numberOfTaskSlots: 8
>> parallelism.default: 2
>> cluster.evenly-spread-out-slots: true
>> ```
>>
>> Deploying the cluster works: I can see both TaskManagers in the WebUI.
>>
>> I ran the streaming WordCount example: `flink run
>> flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
>> - the job has been submitted
>> - job failed (with the above exception)
>> - the log of the node-2 also shows the exception, the other logs are
>> fine (graceful stop)
>>
>> I played around with the config and observed that
>> - if parallelism is set to 1, node-1 gets all the slots and node-2 none
>> - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
>> fails because of node-2)
>>
>> I suspect, that the problem must be with the communication between
>> TaskManagers
>> - job runs successful if
>>     - node-1 is the **only** node with x TaskManagers (tested with x=1
>> and x=2)
>>     - node-2 is the **only** node with x TaskManagers (tested with x=1
>> and x=2)
>> - job fails if
>>     - node-1 **and** node-2 have one TaskManager
>>
>> The full exception is:
>> ```
>> org.apache.flink.client.program.ProgramInvocationException: The main
>> method caused an error:
>> org.apache.flink.client.program.ProgramInvocationException: Job failed
>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>> // ... Job failed, Recovery is suppressed by
>> NoRestartBackoffTimeStrategy, ...
>> Caused by:
>> org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed.
>>     at
>> org.apache.flink.runtime.io
>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>     at
>> org.apache.flink.runtime.io
>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>     at java.base/java.lang.Thread.run(Thread.java:834)
>> Caused by: java.nio.channels.ClosedChannelException
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>     ... 11 more
>> ```
>>
>> Thanks in advance,
>> Matthias
>>
>>

Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Matthias Seiler <ma...@campus.tu-berlin.de>.
Hi Till,

thanks for the hint, you seem about right. Setting the log level to
DEBUG reveals more information, but I don't know what to do about it.

All logs throw some Java related exceptions:
`java.lang.UnsupportedOperationException: Reflective setAccessible(true)
disabled`
and
`java.lang.IllegalAccessException: class
org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
cannot access class jdk.internal.misc.Unsafe (in module java.base)
because module java.base does not export jdk.internal.misc to unnamed
module`

The log of node-2's TaskManager reveals connection problems:
`org.apache.flink.runtime.net.ConnectionUtils                 [] -
Failed to connect from address 'node-2/127.0.1.1': Invalid argument
(connect failed)`
`java.net.ConnectException: Invalid argument (connect failed)`

What's more, both TaskManagers (node-1 and node-2) are having trouble to
load
`org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
but load some version eventually.


There is quite a lot going on here that I don't understand. Can you (or
someone) shed some light on it and tell me what I could try?

Some more information:
I appended the following to the `/etc/hosts` file:
```
<ip-node-1> node-1
<ip-node-2> node-2
```
And the `flink/conf/workers` consists of:
```
node-1
node-2
```

Thanks,
Matthias

P.S. I attached the logs for further reference. `<ip-node-1>` is of
course the real IP address instead.


On 2/16/21 1:56 PM, Till Rohrmann wrote:
> Hi Matthias,
>
> Can you make sure that node-1 and node-2 can talk to each other? It
> looks to me that node-2 fails to open a connection to the other
> TaskManager. Maybe the logs give some more insights. You can change
> the log level to DEBUG to gather more information.
>
> Cheers,
> Till
>
> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
> <matthias.seiler@campus.tu-berlin.de
> <ma...@campus.tu-berlin.de>> wrote:
>
>     Hi Everyone,
>
>     I'm trying to setup a Flink cluster in standealone mode with two
>     machines. However, running a job throws the following exception:
>     `org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>     Sending the partition request to 'null' failed`
>
>     Here is some background:
>
>     Machines:
>     - node-1: JobManager, TaskManager
>     - node-2: TaskManager
>
>     flink-conf-yaml looks like this:
>     ```
>     jobmanager.rpc.address: node-1
>     taskmanager.numberOfTaskSlots: 8
>     parallelism.default: 2
>     cluster.evenly-spread-out-slots: true
>     ```
>
>     Deploying the cluster works: I can see both TaskManagers in the WebUI.
>
>     I ran the streaming WordCount example: `flink run
>     flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
>     - the job has been submitted
>     - job failed (with the above exception)
>     - the log of the node-2 also shows the exception, the other logs are
>     fine (graceful stop)
>
>     I played around with the config and observed that
>     - if parallelism is set to 1, node-1 gets all the slots and node-2
>     none
>     - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot
>     (but
>     fails because of node-2)
>
>     I suspect, that the problem must be with the communication between
>     TaskManagers
>     - job runs successful if
>         - node-1 is the **only** node with x TaskManagers (tested with x=1
>     and x=2)
>         - node-2 is the **only** node with x TaskManagers (tested with x=1
>     and x=2)
>     - job fails if
>         - node-1 **and** node-2 have one TaskManager
>
>     The full exception is:
>     ```
>     org.apache.flink.client.program.ProgramInvocationException: The main
>     method caused an error:
>     org.apache.flink.client.program.ProgramInvocationException: Job failed
>     (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>     // ... Job failed, Recovery is suppressed by
>     NoRestartBackoffTimeStrategy, ...
>     Caused by:
>     org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>     Sending the partition request to 'null' failed.
>         at
>     org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>         at
>     org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at
>     org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at java.base/java.lang.Thread.run(Thread.java:834)
>     Caused by: java.nio.channels.ClosedChannelException
>         at
>     org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>         ... 11 more
>     ```
>
>     Thanks in advance,
>     Matthias
>

Re: Netty LocalTransportException: Sending the partition request to 'null' failed

Posted by Till Rohrmann <tr...@apache.org>.
Hi Matthias,

Can you make sure that node-1 and node-2 can talk to each other? It looks
to me that node-2 fails to open a connection to the other TaskManager.
Maybe the logs give some more insights. You can change the log level to
DEBUG to gather more information.

Cheers,
Till

On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler <
matthias.seiler@campus.tu-berlin.de> wrote:

> Hi Everyone,
>
> I'm trying to setup a Flink cluster in standealone mode with two
> machines. However, running a job throws the following exception:
> `org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> Sending the partition request to 'null' failed`
>
> Here is some background:
>
> Machines:
> - node-1: JobManager, TaskManager
> - node-2: TaskManager
>
> flink-conf-yaml looks like this:
> ```
> jobmanager.rpc.address: node-1
> taskmanager.numberOfTaskSlots: 8
> parallelism.default: 2
> cluster.evenly-spread-out-slots: true
> ```
>
> Deploying the cluster works: I can see both TaskManagers in the WebUI.
>
> I ran the streaming WordCount example: `flink run
> flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
> - the job has been submitted
> - job failed (with the above exception)
> - the log of the node-2 also shows the exception, the other logs are
> fine (graceful stop)
>
> I played around with the config and observed that
> - if parallelism is set to 1, node-1 gets all the slots and node-2 none
> - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
> fails because of node-2)
>
> I suspect, that the problem must be with the communication between
> TaskManagers
> - job runs successful if
>     - node-1 is the **only** node with x TaskManagers (tested with x=1
> and x=2)
>     - node-2 is the **only** node with x TaskManagers (tested with x=1
> and x=2)
> - job fails if
>     - node-1 **and** node-2 have one TaskManager
>
> The full exception is:
> ```
> org.apache.flink.client.program.ProgramInvocationException: The main
> method caused an error:
> org.apache.flink.client.program.ProgramInvocationException: Job failed
> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
> // ... Job failed, Recovery is suppressed by
> NoRestartBackoffTimeStrategy, ...
> Caused by:
> org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> Sending the partition request to 'null' failed.
>     at
> org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>     at
> org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>     at
>
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.nio.channels.ClosedChannelException
>     at
>
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>     ... 11 more
> ```
>
> Thanks in advance,
> Matthias
>
>