You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@dolphinscheduler.apache.org by 王维饶 <wa...@gmail.com> on 2022/09/23 04:14:12 UTC

[DISCUSS] Improved way to collect yarn job's appIds

Hi, DolphinScheduler Community

I'm student of SOC-2022, responsible for optimizing the way to collect yarn
applicationId. The old way which parse applicationId from log file does
cause some problem in production environment [1], also, have other
potential problems such as wasting CPU resource and fetching confused
applicationId due to uncontrollable log output in task. I've already
created an issue about it [2].

My main idea is to intercept yarn's submitApplication function by AOP and
fetch appId from application context. I've already verified it for most
types of yarn job (like Mapreduce, Hive, Spark, Flink, etc...) and modified
relative code parts.

To be specific, all yarn jobs will call submitApplication to create new
application, applicationId can be written in {user.dir}/appInfo.log and can
be directly fetched by getAppIdsFromAppInfoFile rather than parsed from log
file in getAppIdsFromLogFile like the old way. It's an efficient way to
fetch applicationId and can avoid potential problems mentioned earlier.

However, this solution still have some questions to discuss about and we
held a community meeting at 19:00, September 22(GMT-8), organized by
GabryWu. The following are meeting summary.

🍊* Issue1:* Evaluate & review the idea and design

*> Conclusion1:* It seems reasonable and more efficient than the old way,
especially, not nobtrusive to task code. Eric suggest add configuration to
choose ways to fetch applicationId (old or new way) for stability.


🍓* Issue2: *Whether to create a new module in source code for AOP codes?

*> Conclusion2:* It's better to do so. Otherwise, AOP code will behave like
black box and will be invalid if user replace the submitApplication in
secondary development.


🍈* Issue3:* Some environment variable configurations need to be added in
dolphinscheduler_env.sh like:

export
HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"

However, I don't think it's elegant to hard-code the version of dependency
which will bring potential operational problems. One possible solution is
to package known-dependencies.txt in binary package, and the version can be
parsed from it.

*> Conclustion3: *Ruan thinks it's not a big deal. The version will not be
easily changed and will not cause too much operational cost.


*> Other conclusions: *We should declare in user doc that not to override
aop-related environment variables when DS is running (manually or configure
in DS ui).


Really looking forward to other suggestions or more discuss about these
questions.

Thanks!

———————————————————————————————————————

DolphinScheduler社区各位大家好

我是开源之夏2022的参与者，负责优化yarn
applicationId的收集方式。原来的方式是从日志中通过正则匹配解析获取，这种方式在生产环境中会产生很多问题[1],
另外也会造成CPU资源占用过高，以及因为用户代码中的自定义输出匹配到歧义applicationId的情况。我已经对这个问题提出issue[2]。

主要的思路是通过AOP拦截submitApplication
方法，已经验证当前DS支持的几个依赖yarn调度的计算任务都可以通过配置环境变量进行拦截，并且已经初步修改了对应代码。

详细地说，所有yarn job均会通过submitApplication
申请创建新的application并进行资源的分配与作业调度，Aop对该方法拦截到的applicationId可以写入
{user.dir}/appInfo.log文件中并通过getAppIdsFromAppInfoFile方法直接获取，而不需要像原来一样从
getAppIdsFromLogFile方法中进行日志解析。新的方法是一个更有效的获取applicaitionId的方式并且可以避免前面提到的潜在问题。

然而，这个解决方案仍然有几个遗留问题需要讨论，并且我们在9.22
19:00（GMT-8）进了由导师GabryWu组织的社区会议，参会人为我、GabryWu、Ruanwenjun、GabryWu、Eric。以下是对会议内容的简要记录：

🍊 议题一：评估思路和设计的可行性

>
结论一：这个方案看起来比原来更加高效，而且它没有任何对作业代码的侵入。Eric建议为了平滑过渡，提供配置可以供用户选择使用哪一种applicationId的获取方式（新或者旧的方法）


🍓 议题二：是否为Aop的代码在源码中创建新的模块？

> 结论二：最好这样做，否则Aop代码对于用户将是一个黑盒，而且如果用户二次开发了yarn代码修改了submitApplication
代码，Aop将变得无效。


🍈 议题三：该方案需要在dolphinscheduler_env.sh 中添加环境变量比如:

export
HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"

然而，我不认为这是一个理想的方式硬编码依赖版本号，会造成潜在的运维问题，一个解决方案是在二进制包中添加dependencies.txt
，可以从该文件中解析版本号。

> 结论三：Ruan认为这不是一个大的问题，依赖版本不会被轻易改变，即不会造成太大的运维问题。


> 其它结论：

应该在用户文档中声明在DS运行时用户不要覆盖Aop相关的环境变量（手动修改或者在DS ui中修改环境配置）。


我非常期待关于这个方案更多的建议和讨论。

谢谢大家！


*Related issue：*

[1] https://github.com/apache/dolphinscheduler/issues/11214

[2] https://github.com/apache/dolphinscheduler/issues/11262


*Meeting playback：*

[1] Google Drive

link:
https://drive.google.com/file/d/1JGShE4aNl3wJEF7jX0OuQD_anEaEotE3/view?usp=sharing

[2] Baidu Netdisk

link: https://pan.baidu.com/s/1h5fmtEsOk86G9JBPGUcPDg

code: dhy3


_____________________

Best Wishes

Radeity (Aaron Wang)

_____________________

Re: [DISCUSS] Improved way to collect yarn job's appIds

Posted by wenjun <we...@apache.org>.

Hi Rick,

Good idea, if we can make this as a strategy, users can choose the way
to parse application_id.

Thanks,
Wenjun

On Fri, Sep 23, 2022 at 2:34 PM Rick Cheng <ri...@gmail.com> wrote:
>
> Hi, Aaron Wang
>
> I agree that the current way of getting the yarn application id from the
> log is not elegant.
> Just for discussion, there is another way to get yarn application id as
> below:
>
> We can put some unique tags on tasks submitted from DS to yarn. E.g., for
> spark tasks, we can add the configuration "--conf spark.yarn.tags
> some_unique_tag".
> After the task is submitted, DS can query the corresponding yarn
> application id (or other info) through this unique tag.
>
> What do you think? Any comments or discussions are welcome.
>
>
> 王维饶 <wa...@gmail.com> 于2022年9月23日周五 12:15写道：
>
> > Hi, DolphinScheduler Community
> >
> > I'm student of SOC-2022, responsible for optimizing the way to collect yarn
> > applicationId. The old way which parse applicationId from log file does
> > cause some problem in production environment [1], also, have other
> > potential problems such as wasting CPU resource and fetching confused
> > applicationId due to uncontrollable log output in task. I've already
> > created an issue about it [2].
> >
> > My main idea is to intercept yarn's submitApplication function by AOP and
> > fetch appId from application context. I've already verified it for most
> > types of yarn job (like Mapreduce, Hive, Spark, Flink, etc...) and modified
> > relative code parts.
> >
> > To be specific, all yarn jobs will call submitApplication to create new
> > application, applicationId can be written in {user.dir}/appInfo.log and can
> > be directly fetched by getAppIdsFromAppInfoFile rather than parsed from log
> > file in getAppIdsFromLogFile like the old way. It's an efficient way to
> > fetch applicationId and can avoid potential problems mentioned earlier.
> >
> > However, this solution still have some questions to discuss about and we
> > held a community meeting at 19:00, September 22(GMT-8), organized by
> > GabryWu. The following are meeting summary.
> >
> > 🍊* Issue1:* Evaluate & review the idea and design
> >
> > *> Conclusion1:* It seems reasonable and more efficient than the old way,
> > especially, not nobtrusive to task code. Eric suggest add configuration to
> > choose ways to fetch applicationId (old or new way) for stability.
> >
> >
> > 🍓* Issue2: *Whether to create a new module in source code for AOP codes?
> >
> > *> Conclusion2:* It's better to do so. Otherwise, AOP code will behave like
> > black box and will be invalid if user replace the submitApplication in
> > secondary development.
> >
> >
> > 🍈* Issue3:* Some environment variable configurations need to be added in
> > dolphinscheduler_env.sh like:
> >
> > export
> >
> > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
> >
> > However, I don't think it's elegant to hard-code the version of dependency
> > which will bring potential operational problems. One possible solution is
> > to package known-dependencies.txt in binary package, and the version can be
> > parsed from it.
> >
> > *> Conclustion3: *Ruan thinks it's not a big deal. The version will not be
> > easily changed and will not cause too much operational cost.
> >
> >
> > *> Other conclusions: *We should declare in user doc that not to override
> > aop-related environment variables when DS is running (manually or configure
> > in DS ui).
> >
> >
> > Really looking forward to other suggestions or more discuss about these
> > questions.
> >
> > Thanks!
> >
> > ———————————————————————————————————————
> >
> > DolphinScheduler社区各位大家好
> >
> > 我是开源之夏2022的参与者，负责优化yarn
> > applicationId的收集方式。原来的方式是从日志中通过正则匹配解析获取，这种方式在生产环境中会产生很多问题[1],
> > 另外也会造成CPU资源占用过高，以及因为用户代码中的自定义输出匹配到歧义applicationId的情况。我已经对这个问题提出issue[2]。
> >
> > 主要的思路是通过AOP拦截submitApplication
> > 方法，已经验证当前DS支持的几个依赖yarn调度的计算任务都可以通过配置环境变量进行拦截，并且已经初步修改了对应代码。
> >
> > 详细地说，所有yarn job均会通过submitApplication
> > 申请创建新的application并进行资源的分配与作业调度，Aop对该方法拦截到的applicationId可以写入
> > {user.dir}/appInfo.log文件中并通过getAppIdsFromAppInfoFile方法直接获取，而不需要像原来一样从
> >
> > getAppIdsFromLogFile方法中进行日志解析。新的方法是一个更有效的获取applicaitionId的方式并且可以避免前面提到的潜在问题。
> >
> > 然而，这个解决方案仍然有几个遗留问题需要讨论，并且我们在9.22
> >
> > 19:00（GMT-8）进了由导师GabryWu组织的社区会议，参会人为我、GabryWu、Ruanwenjun、GabryWu、Eric。以下是对会议内容的简要记录：
> >
> > 🍊 议题一：评估思路和设计的可行性
> >
> > >
> >
> > 结论一：这个方案看起来比原来更加高效，而且它没有任何对作业代码的侵入。Eric建议为了平滑过渡，提供配置可以供用户选择使用哪一种applicationId的获取方式（新或者旧的方法）
> >
> >
> > 🍓 议题二：是否为Aop的代码在源码中创建新的模块？
> >
> > > 结论二：最好这样做，否则Aop代码对于用户将是一个黑盒，而且如果用户二次开发了yarn代码修改了submitApplication
> > 代码，Aop将变得无效。
> >
> >
> > 🍈 议题三：该方案需要在dolphinscheduler_env.sh 中添加环境变量比如:
> >
> > export
> >
> > HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
> >
> > 然而，我不认为这是一个理想的方式硬编码依赖版本号，会造成潜在的运维问题，一个解决方案是在二进制包中添加dependencies.txt
> > ，可以从该文件中解析版本号。
> >
> > > 结论三：Ruan认为这不是一个大的问题，依赖版本不会被轻易改变，即不会造成太大的运维问题。
> >
> >
> > > 其它结论：
> >
> > 应该在用户文档中声明在DS运行时用户不要覆盖Aop相关的环境变量（手动修改或者在DS ui中修改环境配置）。
> >
> >
> > 我非常期待关于这个方案更多的建议和讨论。
> >
> > 谢谢大家！
> >
> >
> > *Related issue：*
> >
> > [1] https://github.com/apache/dolphinscheduler/issues/11214
> >
> > [2] https://github.com/apache/dolphinscheduler/issues/11262
> >
> >
> > *Meeting playback：*
> >
> > [1] Google Drive
> >
> > link:
> >
> > https://drive.google.com/file/d/1JGShE4aNl3wJEF7jX0OuQD_anEaEotE3/view?usp=sharing
> >
> > [2] Baidu Netdisk
> >
> > link: https://pan.baidu.com/s/1h5fmtEsOk86G9JBPGUcPDg
> >
> > code: dhy3
> >
> >
> > _____________________
> >
> > Best Wishes
> >
> > Radeity (Aaron Wang)
> >
> > _____________________
> >

Re: [DISCUSS] Improved way to collect yarn job's appIds

Posted by Rick Cheng <ri...@gmail.com>.

Hi, Aaron Wang

I agree that the current way of getting the yarn application id from the
log is not elegant.
Just for discussion, there is another way to get yarn application id as
below:

We can put some unique tags on tasks submitted from DS to yarn. E.g., for
spark tasks, we can add the configuration "--conf spark.yarn.tags
some_unique_tag".
After the task is submitted, DS can query the corresponding yarn
application id (or other info) through this unique tag.

What do you think? Any comments or discussions are welcome.


王维饶 <wa...@gmail.com> 于2022年9月23日周五 12:15写道：

> Hi, DolphinScheduler Community
>
> I'm student of SOC-2022, responsible for optimizing the way to collect yarn
> applicationId. The old way which parse applicationId from log file does
> cause some problem in production environment [1], also, have other
> potential problems such as wasting CPU resource and fetching confused
> applicationId due to uncontrollable log output in task. I've already
> created an issue about it [2].
>
> My main idea is to intercept yarn's submitApplication function by AOP and
> fetch appId from application context. I've already verified it for most
> types of yarn job (like Mapreduce, Hive, Spark, Flink, etc...) and modified
> relative code parts.
>
> To be specific, all yarn jobs will call submitApplication to create new
> application, applicationId can be written in {user.dir}/appInfo.log and can
> be directly fetched by getAppIdsFromAppInfoFile rather than parsed from log
> file in getAppIdsFromLogFile like the old way. It's an efficient way to
> fetch applicationId and can avoid potential problems mentioned earlier.
>
> However, this solution still have some questions to discuss about and we
> held a community meeting at 19:00, September 22(GMT-8), organized by
> GabryWu. The following are meeting summary.
>
> 🍊* Issue1:* Evaluate & review the idea and design
>
> *> Conclusion1:* It seems reasonable and more efficient than the old way,
> especially, not nobtrusive to task code. Eric suggest add configuration to
> choose ways to fetch applicationId (old or new way) for stability.
>
>
> 🍓* Issue2: *Whether to create a new module in source code for AOP codes?
>
> *> Conclusion2:* It's better to do so. Otherwise, AOP code will behave like
> black box and will be invalid if user replace the submitApplication in
> secondary development.
>
>
> 🍈* Issue3:* Some environment variable configurations need to be added in
> dolphinscheduler_env.sh like:
>
> export
>
> HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
>
> However, I don't think it's elegant to hard-code the version of dependency
> which will bring potential operational problems. One possible solution is
> to package known-dependencies.txt in binary package, and the version can be
> parsed from it.
>
> *> Conclustion3: *Ruan thinks it's not a big deal. The version will not be
> easily changed and will not cause too much operational cost.
>
>
> *> Other conclusions: *We should declare in user doc that not to override
> aop-related environment variables when DS is running (manually or configure
> in DS ui).
>
>
> Really looking forward to other suggestions or more discuss about these
> questions.
>
> Thanks!
>
> ———————————————————————————————————————
>
> DolphinScheduler社区各位大家好
>
> 我是开源之夏2022的参与者，负责优化yarn
> applicationId的收集方式。原来的方式是从日志中通过正则匹配解析获取，这种方式在生产环境中会产生很多问题[1],
> 另外也会造成CPU资源占用过高，以及因为用户代码中的自定义输出匹配到歧义applicationId的情况。我已经对这个问题提出issue[2]。
>
> 主要的思路是通过AOP拦截submitApplication
> 方法，已经验证当前DS支持的几个依赖yarn调度的计算任务都可以通过配置环境变量进行拦截，并且已经初步修改了对应代码。
>
> 详细地说，所有yarn job均会通过submitApplication
> 申请创建新的application并进行资源的分配与作业调度，Aop对该方法拦截到的applicationId可以写入
> {user.dir}/appInfo.log文件中并通过getAppIdsFromAppInfoFile方法直接获取，而不需要像原来一样从
>
> getAppIdsFromLogFile方法中进行日志解析。新的方法是一个更有效的获取applicaitionId的方式并且可以避免前面提到的潜在问题。
>
> 然而，这个解决方案仍然有几个遗留问题需要讨论，并且我们在9.22
>
> 19:00（GMT-8）进了由导师GabryWu组织的社区会议，参会人为我、GabryWu、Ruanwenjun、GabryWu、Eric。以下是对会议内容的简要记录：
>
> 🍊 议题一：评估思路和设计的可行性
>
> >
>
> 结论一：这个方案看起来比原来更加高效，而且它没有任何对作业代码的侵入。Eric建议为了平滑过渡，提供配置可以供用户选择使用哪一种applicationId的获取方式（新或者旧的方法）
>
>
> 🍓 议题二：是否为Aop的代码在源码中创建新的模块？
>
> > 结论二：最好这样做，否则Aop代码对于用户将是一个黑盒，而且如果用户二次开发了yarn代码修改了submitApplication
> 代码，Aop将变得无效。
>
>
> 🍈 议题三：该方案需要在dolphinscheduler_env.sh 中添加环境变量比如:
>
> export
>
> HADOOP_CLIENT_OPTS="-javaagent:{DOLPHINSCHEDULER_HOME}/tools/libs/aspectjweaver-1.9.7.jar"
>
> 然而，我不认为这是一个理想的方式硬编码依赖版本号，会造成潜在的运维问题，一个解决方案是在二进制包中添加dependencies.txt
> ，可以从该文件中解析版本号。
>
> > 结论三：Ruan认为这不是一个大的问题，依赖版本不会被轻易改变，即不会造成太大的运维问题。
>
>
> > 其它结论：
>
> 应该在用户文档中声明在DS运行时用户不要覆盖Aop相关的环境变量（手动修改或者在DS ui中修改环境配置）。
>
>
> 我非常期待关于这个方案更多的建议和讨论。
>
> 谢谢大家！
>
>
> *Related issue：*
>
> [1] https://github.com/apache/dolphinscheduler/issues/11214
>
> [2] https://github.com/apache/dolphinscheduler/issues/11262
>
>
> *Meeting playback：*
>
> [1] Google Drive
>
> link:
>
> https://drive.google.com/file/d/1JGShE4aNl3wJEF7jX0OuQD_anEaEotE3/view?usp=sharing
>
> [2] Baidu Netdisk
>
> link: https://pan.baidu.com/s/1h5fmtEsOk86G9JBPGUcPDg
>
> code: dhy3
>
>
> _____________________
>
> Best Wishes
>
> Radeity (Aaron Wang)
>
> _____________________
>