You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by chanamper <ch...@163.com> on 2020/02/24 02:32:49 UTC

Flink读写kafka数据聚集任务失败问题

大家好,请教一下,flink任务读取kafka数据进行聚集操作后将结果写回kafka,flink版本为1.8.0。任务运行一段时间后出现如下异常,之后flink任务异常挂掉,请问一下这个问题该如何解决呢?多谢

2020-02-19 10:45:45,314 ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue  - Encountered error while consuming partitions
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
        at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
        at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
        at java.lang.Thread.run(Thread.java:748)
2020-02-19 10:45:45,317 INFO  org.apache.kafka.clients.producer.KafkaProducer               - [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
2020-02-19 10:45:45,412 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Memory usage stats: [HEAP: 98/6912/6912 MB, NON HEAP: 81/83/-1 MB (used/committed/max)]

2020-02-19 10:45:45,413 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Direct memory stats: Count: 24596, Total Capacity: 806956211, Used Memory: 806956212






2020-02-19 10:50:31,351 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: aj-flinknode01/9.186.36.80:56983
2020-02-19 10:50:31,351 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@aj-flinknode01:56983] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@aj-flinknode01:56983]] Caused by: [Connection refused: aj-flinknode01/9.186.36.80:56983]
2020-02-19 10:50:55,419 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@aj-flinknode01:45703] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-02-19 10:50:56,370 INFO  org.apache.flink.yarn.YarnResourceManager                     - Closing TaskExecutor connection container_1578492316659_0830_01_000006 because: Container [pid=30031,containerID=container_1578492316659_0830_01_000006] is running beyond physical memory limits. Current usage: 10.0 GB of 10 GB physical memory used; 11.8 GB of 21 GB virtual memory used. Killing container.
Dump of the process-tree for container_1578492316659_0830_01_000006 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 30068 30031 30031 30031 (java) 277668 18972 12626370560 2630988 /data/jdk1.8.0_211/bin/java -Xms6912m -Xmx6912m -XX:MaxDirectMemorySize=3328m -XX:+UseG1GC -Dlog.file=/data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
        |- 30031 30027 30031 30031 (bash) 0 0 11001856 329 /bin/bash -c /data/jdk1.8.0_211/bin/java -Xms6912m -Xmx6912m -XX:MaxDirectMemorySize=3328m -XX:+UseG1GC -Dlog.file=/data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> /data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.out 2> /data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.err

Re: Flink读写kafka数据聚集任务失败问题

Posted by zhisheng <zh...@gmail.com>.
看到异常信息                      - Closing TaskExecutor connection
container_1578492316659_0830_01_000006 because: Container
[pid=30031,containerID=container_1578492316659_0830_01_000006] is running
beyond physical memory limits. Current usage: 10.0 GB of 10 GB physical
memory used; 11.8 GB of 21 GB virtual memory used. Killing container.

应该是超内存了,容器被 kill 了

chanamper <ch...@163.com> 于 2020年2月24日周一 上午10:33写道:

>
> 大家好,请教一下,flink任务读取kafka数据进行聚集操作后将结果写回kafka,flink版本为1.8.0。任务运行一段时间后出现如下异常,之后flink任务异常挂掉,请问一下这个问题该如何解决呢?多谢
>
> 2020-02-19 10:45:45,314 ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue
> - Encountered error while consuming partitions
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at
> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>         at
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
>         at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>         at java.lang.Thread.run(Thread.java:748)
> 2020-02-19 10:45:45,317 INFO
> org.apache.kafka.clients.producer.KafkaProducer               - [Producer
> clientId=producer-1] Closing the Kafka producer with timeoutMillis =
> 9223372036854775807 ms.
> 2020-02-19 10:45:45,412 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Memory
> usage stats: [HEAP: 98/6912/6912 MB, NON HEAP: 81/83/-1 MB
> (used/committed/max)]
>
> 2020-02-19 10:45:45,413 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Direct
> memory stats: Count: 24596, Total Capacity: 806956211, Used Memory:
> 806956212
>
>
>
>
>
>
> 2020-02-19 10:50:31,351 WARN  akka.remote.transport.netty.NettyTransport
>                   - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: aj-flinknode01/
> 9.186.36.80:56983
> 2020-02-19 10:50:31,351 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@aj-flinknode01:56983] has failed, address is now gated
> for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@aj-flinknode01:56983]] Caused by: [Connection refused:
> aj-flinknode01/9.186.36.80:56983]
> 2020-02-19 10:50:55,419 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@aj-flinknode01:45703] has failed, address is now gated
> for [50] ms. Reason: [Disassociated]
> 2020-02-19 10:50:56,370 INFO  org.apache.flink.yarn.YarnResourceManager
>                  - Closing TaskExecutor connection
> container_1578492316659_0830_01_000006 because: Container
> [pid=30031,containerID=container_1578492316659_0830_01_000006] is running
> beyond physical memory limits. Current usage: 10.0 GB of 10 GB physical
> memory used; 11.8 GB of 21 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1578492316659_0830_01_000006 :
>         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>         |- 30068 30031 30031 30031 (java) 277668 18972 12626370560 2630988
> /data/jdk1.8.0_211/bin/java -Xms6912m -Xmx6912m
> -XX:MaxDirectMemorySize=3328m -XX:+UseG1GC
> -Dlog.file=/data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.log
> -Dlogback.configurationFile=file:./logback.xml
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
>         |- 30031 30027 30031 30031 (bash) 0 0 11001856 329 /bin/bash -c
> /data/jdk1.8.0_211/bin/java -Xms6912m -Xmx6912m
> -XX:MaxDirectMemorySize=3328m -XX:+UseG1GC
> -Dlog.file=/data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.log
> -Dlogback.configurationFile=file:./logback.xml
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1>
> /data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.out
> 2>
> /data/hadoop-2.6.0-cdh5.16.1/logs/userlogs/application_1578492316659_0830/container_1578492316659_0830_01_000006/taskmanager.err