You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@linkis.apache.org by rita <ri...@163.com> on 2022/07/11 12:11:42 UTC
[DISCUSS]kill-yarn-jobs.sh

Dear:

The chat records of WeChat group“Apache linkis Community Development
Group”are as follows：

微信群"Apache Linkis 社区开发群"的聊天记录如下:―――――  2022-7-9
―――――

Heisenberg 7-9 10:40:06

 

https://sm.ms/image/Z2Ce8mhuzO1wHon

 

Heisenberg 7-9 10:40:06

@ peacewong@WDS Brother Ping, kill yarn jobs under shell engine SH is used
in the engine process. If the user directly kills the -9 engine process, the
spark application launched by the user through the shell script will not be
killed, and the user can only find it and kill it manually

peacewong@WDS  7-9 10:43:29

Yes, brother Long Ping

peacewong@WDS  7-9 10:43:44

Kill -9, there's no way

Heisenberg 7-9 10:45:20

The logic of kill yarn application, which I understand, is not placed in EC,
but in ECM

Heisenberg 7-9 10:45:20

https://s2.loli.net/2022/07/11/qSQPNWxEkHb5Rjm.png

Heisenberg 7-9 10:45:23

？

peacewong@WDS  7-9 10:45:41

yes

Heisenberg 7-9 10:46:08

When ECM kills the engine, it is detected that there are running yarn apps
started by the engine. Kill these yarn apps, and then kill the engine

Heisenberg 7-9 10:46:09

good

peacewong@WDS  7-9 10:46:16

It's about ECM. Cache the application ID of EC

peacewong@WDS  7-9 10:46:33

Uh huh, yes

Heisenberg 7-9 10:47:14

What applications can ECM know about EC startup?

peacewong@WDS  7-9 10:50:16

Spark is better. Hive and shell can only be obtained by scanning log files.
If it is made into general logic, it feels better to scan the log to
retrieve the application ID. But too many logs may also be a problem.

Heisenberg 7-9 10:50:52

Uh huh, I think the shell engine scans the log to extract the application ID

peacewong@WDS  7-9 10:52:02

yes

Heisenberg 7-9 10:52:33

A key operation here is to collect the application ID, and then call kill
yarn job SH or kill tasks through the yarn rest API

Heisenberg 7-9 10:53:16

Let me take a look at the collection of yarn application ID first

peacewong@WDS  7-9 10:57:44

Yes, you can get it by scanning logs, but the performance may be poor.

Heisenberg 7-9 10:59:45

I think the scanning log was made at EC

Heisenberg 7-9 11:01:19

If it is realized by scanning logs, will the scanning logs still be
processed in EC? Then EC scans the application ID and reports it to ECM,

Or is the behavior of scanning logs done in ECM?

peacewong@WDS  7-9 11:09:11

It seems reasonable for ECM to do this. There is a better processing logic,
that is, the logappender or sendappender on the EC side parses and judges.
If the log with application ID is printed, print it to the error level log,
print it to stderr, and then ECM parses the application ID in stderr.
Generally, the number of stderr log lines is relatively small, so the
performance should not be poor

peacewong@WDS  7-9 11:10:06

In this way, as long as ECM is doing the kill engine, go to grep to re apply
the applicaiton ID. Judge the status and kill it

Heisenberg 7-9 11:11:41

Uh huh, let me first look at the process of sending logs by logapper |
sendapender, and then receiving logs by ECM

peacewong@WDS  7-9 11:12:47

ECM can directly get the log path of stderr for processing, and there is no
need to receive it

Heisenberg 7-9 11:13:19

https://s2.loli.net/2022/07/11/g4yXxAkKqobWsT8.png

 

Heisenberg 7-9 11:13:31

stderr

peacewong@WDS  7-9 11:17:12

Yes, ECM can use enginecon object conn.getengineconnmanagerenv
Engineconnlogdirs can get the specific path

Heisenberg 7-9 11:23:10

The logappender or sendappender on the EC side parses and judges if the log
with application ID is printed

You still need to perform regular parsing on each line of log output in EC.
For spark engine, after parsing to app ID, you can stop; For the shell
engine, it may need to continue parsing, because the shell engine may submit
multiple spark apps and generate multiple app IDs. Is the performance
time-consuming point still in log parsing?

Heisenberg 7-9 11:24:39

Is this function added to 1.2.0?

peacewong@WDS  7-9 11:25:22

Yes, mainly in log4j. I just saw that when each engine prints application
ID, the log class also looks inconsistent

peacewong@WDS  7-9 11:26:56

Uh huh, if time is OK. 1.2.1 it's OK.

peacewong@WDS  7-9 11:27:29

If it is unified, it will be better. Just print the log4j configuration of
the printing place directly

Heisenberg 7-9 11:28:33

understand

Heisenberg 7-9 11:30:37

Configure log4j to print applicationid. I think the shell engine uses
yarnappidextractor to regularly parse each line of log output

Heisenberg 7-9 11:30:48

https://s2.loli.net/2022/07/11/hGJBr5y82OPL9if.png

 

 

 

 

peacewong@WDS  7-9 11:31:56

Yes, the shell engine can parse it here and call stderr

Heisenberg 7-9 11:38:06

Is spark engine configured with log4j? Or in this way

peacewong@WDS  7-9 11:54:48

Spark and hive feel that log4j configuration is better

Heisenberg 7-9 11:57:12

OK, let me see how to configure it

Heisenberg 7-9 16:26:01

https://s2.loli.net/2022/07/11/lf1KaJ7rubjRW8G.png

 

 

Heisenberg 7-9 16:26:01

When debugging locally and starting the gateway, an error is reported, which
seems to be related to knife4j. Have you ever encountered similar problems

casion 7-9 16:28:08

The package needs to be excluded. It is removed when packaging. There may be
something missing when referencing dependencies

casion 7-9 16:30:11

https://github.com/apache/incubator-linkis/pull/2434/files

Can you help me see if the exclude is missing when POM is introduced

Xu Ling 7-9 16:31:00

This is typed in the shared bag. It should be

Heisenberg 7-9 16:31:59

OK, I'll see it later

Heisenberg 7-9 16:33:21

Previously, our colleagues can debug all micro services including ECM and EC
locally based on version 1.0.3

Xu Ling 7-9 16:33:22

 

https://s2.loli.net/2022/07/11/ICB4n5pxyQeTqNX.png

 

 

casion 7-9 16:39:10

Added in 1.2.0, some dependencies of knif4j integrated with swagger conflict
with gateway. It should be that the dependencies are not eliminated

casion 7-9 16:39:23

 

https://s2.loli.net/2022/07/11/cyVnh2QFiPIATG8.png

 

 

Xu Ling 7-9 16:43:09

This is useless

Xu Ling 7-9 16:44:27

It's in the shared package. It's useless to exclude it here

Xu Ling 7-9 16:45:08

public-module

peacewong@WDS  7-9 16:46:02

The gateway does not add public module

Xu Ling 7-9 16:49:08

https://s2.loli.net/2022/07/11/3wL19mnqhtfHPK8.png

 

Xu Ling 7-9 16:49:26

According to this, it should be added

peacewong@WDS  7-9 16:51:11

Yes, the introduction of links module should be effective, which may be
related to Maven version* Exclude, local Maven with a lower version will not
take effect

Xu Ling 7-9 16:54:33

I mean, public module should be added to the gateway

peacewong@WDS  7-9 16:55:45

The gateway was not added to the classpath when it was started, because
there was a conflict.

casion 7-9 17:07:09

Gateway uses weblux, which is incompatible with webmvc of other services.
Public mudule is not relied on when starting. At present, there are still a
lot of dependent packages packed by gateway. I feel that the dependence can
be simplified. If brother Xu Ling is interested, you can see if you can
simplify the dependent packages of gateway

Xu Ling 7-9 17:09:31

I want to sort out all the dependencies

peacewong@WDS  7-9 17:18:18

Yes, brother Xu Ling海森堡 7-9 10:40:06

[图片]

 

海森堡 7-9 10:40:06

@peacewong@WDS 平哥，shell引擎下的kill-yarn-jobs.sh，是作用在引擎进程中的，
如果用户直接kill -9 引擎进程，那用户通过shell脚本启动的Spark application，也
不会被杀了，只能用户自己手动去找出来kill掉

 

peacewong@WDS 7-9 10:43:29

是的，龙平兄

 

peacewong@WDS 7-9 10:43:44

kill -9就没办法了

 

海森堡 7-9 10:45:20

那这个我理解的kill yarn appliction的逻辑，不是放在EC，而是放在ECM那里

 

海森堡 7-9 10:45:20

[图片]

 

海森堡 7-9 10:45:23

？

 

peacewong@WDS 7-9 10:45:41

是的

 

海森堡 7-9 10:46:08

ECM kill引擎的时候，检测引擎启动的有在运行的yarn app，就kill 这些yarn app，
然后在kill 引擎

 

海森堡 7-9 10:46:09

好的

 

peacewong@WDS 7-9 10:46:16

就是ecm那里做一下兜底，把ec的application id缓存下

 

peacewong@WDS 7-9 10:46:33

嗯嗯，是的

 

海森堡 7-9 10:47:14

ECM 可以知道EC 启动的有哪些application？

 

peacewong@WDS 7-9 10:50:16

Spark这个比较好知道，hive和shell这种可能只能去扫描下日志文件获取。如果做成通
用逻辑的话，感觉还是通过扫描日志去重获取application id好点。但是日志可能过多
也是个问题。

 

海森堡 7-9 10:50:52

嗯嗯 我看shell 引擎是扫描日志 正则提取的application id

 

peacewong@WDS 7-9 10:52:02

是的

 

海森堡 7-9 10:52:33

这里的一个关键操作是 application id的收集，然后去调用kill-yarn-job.sh 或者通
过yarn rest api来杀掉任务

 

海森堡 7-9 10:53:16

我先看下yarn application id收集这块

 

peacewong@WDS 7-9 10:57:44

是的，通过扫描日志是可以拿到的，就是可能会性能比较差。

 

海森堡 7-9 10:59:45

我看扫描日志是在EC那里做的

 

海森堡 7-9 11:01:19

假如扫描日志来实现的话，扫描日志还是在EC处理嘛？然后EC扫描到application id之
后，汇报给ECM，

 

还是扫描日志的行为是在ECM做的？

 

peacewong@WDS 7-9 11:09:11

感觉给ECM来做合理点，有个比较好的处理逻辑是这样，就是EC 那边的logApender或者
sendApender那边解析判断如果是打印了application id的日志，就往error级别日志里
面打印下打印到stderr，然后ecm那边解析stderr里面的application id。因为一般
stderr日志行数是比较少的，这样性能应该不会差

 

peacewong@WDS 7-9 11:10:06

这样ECM那边只要做kill 引擎的时候，去grep 去重一把applicaiton id。进行状态判
断kill下就好

 

海森堡 7-9 11:11:41

嗯嗯 我先看下 这个logApender|sendApender发送日志，然后ECM接收日志的流程

 

peacewong@WDS 7-9 11:12:47

ECM那边直接拿到stderr的日志路径处理就好的，不用接收的

 

海森堡 7-9 11:13:19

[图片]

 

海森堡 7-9 11:13:31

stderr

 

peacewong@WDS 7-9 11:17:12

是的 ecm可以通过engineConn对象conn.getEngineConnManagerEnv.engineConnLogDirs
这个可以拿到具体的路径

 

海森堡 7-9 11:23:10

EC 那边的logApender或者sendApender那边解析判断如果是打印了application id的日
志

 

还是需要对EC中输出的每一行日志进行正则解析的，针对Spark引擎，解析到app id之
后，就可以停止了；针对shell引擎，可能要持续解析，因为shell引擎可能会提交多个
Spark app 产生多个app id，性能耗时的点还是在日志解析这里？

 

海森堡 7-9 11:24:39

这个功能加到1.2.0中嘛？

 

peacewong@WDS 7-9 11:25:22

是的，主要在log4j这里。我刚刚看了下每个引擎打印application id的时候日志类看
起来也是不统一的

 

peacewong@WDS 7-9 11:26:56

嗯嗯，时间上可以的话。1.2.1也可以的。

 

peacewong@WDS 7-9 11:27:29

如果统一的话，就好点，直接将打印的地方log4j配置下打印过去就好了

 

海森堡 7-9 11:28:33

了解

 

海森堡 7-9 11:30:37

配置log4j 就可以打印applicationId，我看Shell引擎是用YarnAppIdExtractor，来正
则解析每一行日志输出实现的

 

海森堡 7-9 11:30:48

[图片]

 

peacewong@WDS 7-9 11:31:56

是的，shell引擎这里可以这里解析出来往stderr打一下

 

海森堡 7-9 11:38:06

Spark引擎是配置log4j？还是也用这样的方式呢

 

peacewong@WDS 7-9 11:54:48

spark和hive感觉配置log4j好点

 

海森堡 7-9 11:57:12

好的 我看下怎么配置

 

海森堡 7-9 16:26:01

[图片]

 

海森堡 7-9 16:26:01

本地调试，启动gateway时，报错，貌似跟knife4j 有关，类似问题大家有遇到过吗

 

casion 7-9 16:28:08

需要exclude下包，打包时候有移除，引用依赖的时候可能有地方漏了

 

casion 7-9 16:30:11

https://github.com/apache/incubator-linkis/pull/2434/files

可以帮看下是不是pom引入时候exclude有遗漏

 

许灵 7-9 16:31:00

这个在共享包里打进去了，应该

 

海森堡 7-9 16:31:59

好的 我晚点再看看

 

海森堡 7-9 16:33:21

之前我们同事基于1.0.3版本就可以在本地debug所有微服务 包括ecm和ec

 

许灵 7-9 16:33:22

[图片]

 

casion 7-9 16:39:10

1.2.0加进来的，集成swagger的knif4j有些依赖和gateway有冲突，应该是依赖没排除
干净

 

casion 7-9 16:39:23

[图片]

 

许灵 7-9 16:43:09

这个没用

 

许灵 7-9 16:44:27

共享包里有，这里排除没用

 

许灵 7-9 16:45:08

public-module

 

peacewong@WDS 7-9 16:46:02

gateway没有加public-module的

 

许灵 7-9 16:49:08

[图片]

 

许灵 7-9 16:49:26

按照这个看，应该是加了吧

 

peacewong@WDS 7-9 16:51:11

是的，引入linkis-module，这个排除应该是生效的，可能是maven版本啥的有关系。*
排除，本地maven版本低的不会生效

 

许灵 7-9 16:54:33

我是说，gateway应该加了public-module 

 

peacewong@WDS 7-9 16:55:45

gateway启动的时候没有加到classpath的，因为存在冲突。

 

casion 7-9 17:07:09

gateway使用的是webflux,和其他服务的webmvc不兼容。启动时候没依赖
public-mudule。现在gateway打包的依赖的包还是挺多的，感觉依赖还可以精简些,[呲
牙]许灵兄感兴趣的话，可以看看能不能精简下gateway的依赖包

 

许灵 7-9 17:09:31

我想把所有的依赖都整理一下

 

peacewong@WDS 7-9 17:18:18

可以的，许灵兄