You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by 范瑞 <83...@qq.com> on 2020/09/02 05:56:45 UTC
回复： Flink 使用 RocksDB CPU 打满

Hi,感谢 Congxian 大佬回复！


1、 Flink version 1.10&nbsp;
2、 subtask 之间数据非常均匀，get 函数调用频次较多，但是每个 subtask 应该是类似的。
其他正常的 subtask 火焰图看着并不会大量停留在这里，而是火焰图上有各个位置的代码。
3、 sst 文件小的时候也会出现该问题，60M 的 sst 并不大，每个 butask 的 blockcache 能分到 1g 左右。


补充：
4、正常和非正常的火焰图来看，是 NewIndexIterator 这一步出了问题，有问题的火焰图走到了 NewIndexIterator 代码，但正常的火焰图基本 NewIndexIterator 不耗时。
NewIndexIterator 这一步应该是读索引相关操作，是不是 index 在 Cache 中放不下了，被频繁淘汰呢？

正常火焰图与非正常火焰图差异图示：https://drive.google.com/file/d/1DG4eTbaaozG_4ZxPQc_svmoyEXCdKZqb/view?usp=sharing
5、 调大一些 block cache 相关参数，调节之前运行 5 分钟任务就出现上述问题，调节之后运行 30 分钟出问题。
调节后的参数如下所示：
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.5
taskmanager.memory.jvm-overhead.fraction: 0.05
state.backend.incremental: true
state.backend.rocksdb.block.cache-size: 3 gb
taskmanager.memory.managed.fraction: 0.5



6、 目前 RocksDB LOG 等级调到了 DEBUG，附上 LOG 文件：https://drive.google.com/file/d/1JpHMjwTyW0ej-GtPbg6NLigohG7o3U4Y/view?usp=sharing


Thanks
fanrui


------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <qcx978132955@gmail.com&gt;;
发送时间:&nbsp;2020年9月2日(星期三) 中午1:38
收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;

主题:&nbsp;Re: Flink 使用 RocksDB CPU 打满



Hi
&nbsp;&nbsp;&nbsp; 从火焰图看，RocksDB#get 操作占用的时间较多，contains&nbsp; 会调用 RocksDB 的 get 函数
&nbsp;&nbsp;&nbsp; 1. 你使用的是哪个版本的 Flink？
&nbsp;&nbsp;&nbsp; 2. 不同 subtask 之间的数据是否均匀呢？这里主要想知道调用 RocksDB 的 get 函数调用频次是否符合预期
&nbsp;&nbsp;&nbsp; 3. 如果我理解没错的话，有 snappy 的压缩，这个会有 IO 的操作（也就是从磁盘 load 数据），可能还需要看下为什么这个
subtask 的数据大量落盘
Best,
Congxian


fanrui <836961905@qq.com&gt; 于2020年9月1日周二 下午9:14写道：

&gt; 备注一下：
&gt; Flink 任务并行度 1024，运行几分钟，就会有四五个 subtask 出现上述现象，其余 subtask 正常。
&gt; 正常的 subtask 打出的火焰图是正常的：代码中每一步都占用了一部分 CPU，而不是 MapState 的 contains 操作占用了了大量
&gt; CPU。
&gt;
&gt;
&gt;
&gt; --
&gt; Sent from: http://apache-flink.147419.n8.nabble.com/