You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "liuchenhong (Jira)" <ji...@apache.org> on 2022/03/11 09:14:00 UTC

[jira] [Created] (FLINK-26602) The Rocksdb task failed savepoint, and then checkpoint failed several times

liuchenhong created FLINK-26602:
-----------------------------------

             Summary: The Rocksdb task failed savepoint, and then checkpoint failed several times
                 Key: FLINK-26602
                 URL: https://issues.apache.org/jira/browse/FLINK-26602
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.11.2
            Reporter: liuchenhong


The Rocksdb task failed savepoint (2022-03-10 19:55:**), and then checkpoint failed several times (2022-03-11)。Savepoint fails because it is Out Of Memory. But I'd like to know why checkpoint fails and why it goes “beyond physical Memory limits”. I checked the number of data sources and there was no exception . Could it be that savePoint failed, but memory was never freed?
{code:java}
//代码占位符
job manager log
2022-03-11 00:58:24,891 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 108412 (type=CHECKPOINT) @ 1646931504738 for job d90b4aca73c5802e0dbbd50ca8af97e0.
2022-03-11 00:58:27,605 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 108412 for job d90b4aca73c5802e0dbbd50ca8af97e0 (9815989304 bytes in 2801 ms).
2022-03-11 01:00:06,603 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Closing TaskExecutor connection container_e06_1603181034156_0493_01_000023 because: Container [pid=177263,containerID=container_e06_1603181034156_0493_01_000023] is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory used; 14.3 GB of 25.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e06_1603181034156_0493_01_000023 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 177263 177261 177263 177263 (bash) 2 2 116015104 357 /bin/bash -c /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M -Dlog4j2.formatMsgNoLookups=true -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=2652142028b -D taskmanager.memory.task.off-heap.size=536870912b --configDir . -Djobmanager.rpc.address='' -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc' -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab' -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.out 2> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.err 
    |- 177416 177263 177263 177263 (java) 484303004 122930506 15252447232 3145560 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M -Dlog4j2.formatMsgNoLookups=true -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=2652142028b -D taskmanager.memory.task.off-heap.size=536870912b --configDir . -Djobmanager.rpc.address= -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address{code}
{code:java}
//job manager日志
022-03-11 07:04:54,253 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 108594 (type=CHECKPOINT) @ 1646953494183 for job d90b4aca73c5802e0dbbd50ca8af97e0. 2022-03-11 07:04:55,334 INFO  org.apache.flink.yarn.YarnResourceManager                    [] - Closing TaskExecutor connection container_e06_1603181034156_0493_01_000021 because: Container [pid=17068,containerID=container_e06_1603181034156_0493_01_000021] is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory used; 14.2 GB of 25.2 GB virtual memory used. Killing container. Dump of the process-tree for container_e06_1603181034156_0493_01_000021 :     |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE     |- 17068 17061 17068 17068 (bash) 1 2 116015104 356 /bin/bash -c /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M -Dlog4j2.formatMsgNoLookups=true -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=2652142028b -D taskmanager.memory.task.off-heap.size=536870912b --configDir . -Djobmanager.rpc.address='' -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc' -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab' -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.out 2> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.err      |- 17442 17068 17068 17068 (java) 476051309 120830693 15178711040 3145582 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M -Dlog4j2.formatMsgNoLookups=true -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=2652142028b -D taskmanager.memory.task.off-heap.size=536870912b --configDir . -Djobmanager.rpc.address= -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address= 
 
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)