You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by Chenyu Zheng <ch...@hulu.com.INVALID> on 2021/08/10 11:13:18 UTC

Hi 开发者,

我正尝试在k8s上部署flink集群,但是当我将并行度调的比较大(128)时,会经常遇到Jobmanager/Taskmanager的各种超时错误,然后我的任务会被自动取消。

我确定这不是一个网络问题,因为:

  *   在32/64并行度从没有出现过这个问题,但是在128并行度,每次运行都会出现这个错误
  *   我们的flink是部署在生产环境的k8s集群中,没有其他容器反馈遇到了网络问题
  *   将heartbeat.timeout调大(300s)可以解决这个问题

我的flink环境:
·        Flink 1.12.5 with java8, scala 2.11
·        Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/jobmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=1073741824b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=15703474176b -D jobmanager.memory.jvm-overhead.max=1073741824b
·        Taskmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/taskmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=359703515b -D taskmanager.memory.network.min=359703515b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=1530082070b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.jvm-metaspace.size=268435456b -D taskmanager.memory.jvm-overhead.max=429496736b -D taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf -Djobmanager.rpc.address='10.50.132.154' -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' -Drest.address='10.50.132.154' -Djobmanager.memory.jvm-overhead.max='1073741824b' -Djobmanager.memory.jvm-overhead.min='1073741824b' -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17' -Dexecution.target='embedded' -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='15703474176b'

请问这种超时现象是一种正确的表现吗?我应该做什么来定位这种超时现象的根源呢?

谢谢!

Chenyu

Re:

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

超时的原因可能有特别多。但从你的描述来看,可能是因为并发度增加导致的资源紧张。是否观察过 gc log 看看有没有长时间的 full
gc?另外也可以在某一个 tm 上一次心跳特别长的时候 jstack 看一下栈,都能帮助分析原因。

Chenyu Zheng <ch...@hulu.com.invalid> 于2021年8月10日周二 下午7:13写道:

> Hi 开发者,
>
>
> 我正尝试在k8s上部署flink集群,但是当我将并行度调的比较大(128)时,会经常遇到Jobmanager/Taskmanager的各种超时错误,然后我的任务会被自动取消。
>
> 我确定这不是一个网络问题,因为:
>
>   *   在32/64并行度从没有出现过这个问题,但是在128并行度,每次运行都会出现这个错误
>   *   我们的flink是部署在生产环境的k8s集群中,没有其他容器反馈遇到了网络问题
>   *   将heartbeat.timeout调大(300s)可以解决这个问题
>
> 我的flink环境:
> ·        Flink 1.12.5 with java8, scala 2.11
> ·        Jobmanager Start command: $JAVA_HOME/bin/java -classpath
> $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176
> -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure
> -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=1
> -Dlog.file=/opt/flink/log/jobmanager.log
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
> -D jobmanager.memory.off-heap.size=134217728b -D
> jobmanager.memory.jvm-overhead.min=1073741824b -D
> jobmanager.memory.jvm-metaspace.size=268435456b -D
> jobmanager.memory.heap.size=15703474176b -D
> jobmanager.memory.jvm-overhead.max=1073741824b
> ·        Taskmanager Start command: $JAVA_HOME/bin/java -classpath
> $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798
> -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause
> -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=1
> -Dlog.file=/opt/flink/log/taskmanager.log
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=359703515b -D
> taskmanager.memory.network.min=359703515b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=1530082070b -D
> taskmanager.memory.task.off-heap.size=0b -D
> taskmanager.memory.jvm-metaspace.size=268435456b -D
> taskmanager.memory.jvm-overhead.max=429496736b -D
> taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf
> -Djobmanager.rpc.address='10.50.132.154'
> -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar'
> -Djobmanager.memory.off-heap.size='134217728b'
> -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76'
> -Drest.address='10.50.132.154'
> -Djobmanager.memory.jvm-overhead.max='1073741824b'
> -Djobmanager.memory.jvm-overhead.min='1073741824b'
> -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17'
> -Dexecution.target='embedded'
> -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar'
> -Djobmanager.memory.jvm-metaspace.size='268435456b'
> -Djobmanager.memory.heap.size='15703474176b'
>
> 请问这种超时现象是一种正确的表现吗?我应该做什么来定位这种超时现象的根源呢?
>
> 谢谢!
>
> Chenyu
>