You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/08/02 11:02:08 UTC

[GitHub] [dolphinscheduler] Radeity opened a new issue, #11262: [Improvement][Task] Improve way to collect yarn job's appIds

Radeity opened a new issue, #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   Current way to collect appIds is scan log files and parse them, it's inefficient and will cause OOM if log file is large, which has been mentioned in [issue#11214](https://github.com/apache/dolphinscheduler/issues/11214). This potential problem can only be permanently solved by changing a new way to collect appIds which avoid reading log files.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1255880142

   > Hi, @Radeity @ruanwenjun
   > 
   > I agree that the current way of getting the yarn application id from the log is not elegant. Just for discussion, there is another way to get `yarn application id` as below:
   > 
   > 1. We can put some **unique tags** on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration `--conf spark.yarn.tags some_unique_tag`.
   > 2. After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.
   > 
   > What do you think? Any comments or discussions are welcome.
   
   Hi, @rickchengx 
   
   First, thanks for your idea! 
   
   However, i think this way have two problems as follow:
   1. Users may create ShellTask and submit not only one yarn job via command lines which is hard to add configuration.
   2. Aop way will simply fetch applicationId and write it into appInfo.log file. I think it's maybe more efficiency than query it through unique tag. In fact, I don't get how your idea work? Would you like to explain more about it?
   
   My best wishes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] rickchengx commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

rickchengx commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1256023943

   @Radeity , thanks for the reply.
   
   Here is more info about the way by tagging:
   1. DS can add some unique tags while building the command of yarn tasks (spark, flink, sqoop, mapreduce, etc.) But `ShellTask` is not included because DS is not responsible for building commands in shell task. **The tag is automatically added by DS, and the user is unaware of it.**
   2. After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.Specifically, through a yarn client.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1203683393

   @ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
   
   Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
   
   Please let me know if you have any good suggestions!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1256380005

   @rickchengx 
   
   Thanks for your detailed explanation.
   
   Compared with the tag way, aop can handle shell task, in addition, not invade into DS task definition code. Also, an additional jar package is required, you're right, however, this temporary appInfo log file is just for fetching applicationId in time, when the task is done, appId will be written into TaskExecutionContext as same as original way.
   
   Moreover, extra maintenance is only need when compute engines change their supported way to add configuration like java-opts or yarn client change its submit function which i really think not a big deal, cuz they have remained unchanged for many years. Think of, for example, Wechat pay has been used for many years and we can scan QR code to pay for something, it's already in widely use and will not suffer a sudden change. Anyway, i have to say, yarn client may update, new compute engine will come out,  but for this aop way in DS, the cost of potential maintenance is relatively smaller enough than other code part, such as generated command line to submit spark task.
   
   For the last point, i agree with you, stability is worth considering.  For smooth transmition, my opinion is to keep both original and new aop way, provide extra configuration for user to choose how to fetch applicationId. If the aop way is stable enough, we can then consider whether to complete replace the original way.
   
   What do you think of it? I'll be appreciated if you have any more elegant idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1205251304

   > already
   
   
   
   > > @ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
   > > Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
   > > Please let me know if you have any good suggestions!
   > 
   > In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.
   
   I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as `APPLICATION_WEB_PROXY_BASE`. All yarn jobs' `AM` maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each  type of task.
   
   So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue([#4025](https://github.com/apache/dolphinscheduler/issues/4025)). I'll carefully think about that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1203682361

   Yeh, maybe a practicable solution, we can simply talk about it.
   
   Before submitting a yarn job, the client apply the application context from `RM` first, and get appId which will be then written into `NM`'s environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take `taskInstanceId` as input of agent program. However, where to store this mapping relationship need to be further considered. 
   
   Please let me know if you have any good suggestions!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] gabrywu closed issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

gabrywu closed issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds
URL: https://github.com/apache/dolphinscheduler/issues/11262


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1211870492

   > > already
   > 
   > > > @ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
   > > > Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
   > > > Please let me know if you have any good suggestions!
   > > 
   > > 
   > > In fact, there is already a issue(#4025) talk about use agent to collect the appId, but I think it isn't a good way 😢 , we need to maintain a agent and we may need to maintain different version agant.
   > 
   > I think there's no need to maintain different version agent, for example, we can parse the appId from some environment variables such as `APPLICATION_WEB_PROXY_BASE`. All yarn jobs' `AM` maintain this environment variable, i've already verified it in Flink, Spark, Hive, MR, Spark-SQL. The only difference is how to set java options which can be defined in each type of task.
   > 
   > So, it seems like yarn jobs submitted by shell command can all get appId in this way. Anyway, there are some other design problems, like where to store the mapping relationship, as mentioned in issue([#4025](https://github.com/apache/dolphinscheduler/issues/4025)). I'll carefully think about that.
   
   You need to make sure the agent can work for all yarn client.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1211935164

   > 
   
   @ruanwenjun  I think most of yarn clients can share the same agent, cuz in these clients, AOP will intercept func `submitApplication`, except for submitting yarn job with JDBC connection, like `beeline`(hive server2, as mentioned in issue(https://github.com/apache/dolphinscheduler/issues/4025)), however, beeline may create an external JDBC connection, we can not kill an external yarn job, right? So, if we don't consider these special situations, we can use the same agent for all other yarn clients.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] rickchengx commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

rickchengx commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1255845285

   Hi, @Radeity @ruanwenjun 
   
   I agree that the current way of getting the yarn application id from the log is not elegant.
   Just for discussion, there is another way to get `yarn application id` as below:
   
   We can put some **unique tags** on tasks submitted from DS to yarn. E.g., for spark tasks, we can add the configuration `--conf spark.yarn.tags some_unique_tag`. 
   After the task is submitted, DS can query the corresponding yarn application id (or other info) through this unique tag.
   
   What do you think? Any comments or discussions are welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1211805872

   @ruanwenjun 
   Hi, i wanna ask some maybe dumb question. When worker failover, in the function of `killYarnJob`, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1211867162

   > @ruanwenjun Hi, i wanna ask some maybe dumb question. When worker failover, in the function of `killYarnJob`, the logic is send a view log request to worker and then parse it. However, worker is just the client to submit yarn job, the worker failover will not auto-kill submitted yarn jobs, so in the situation that worker is failover and how can it response the log info?
   > 
   > Feel sorry that i don't have the production environment, so I'm not sure whether it's a bug or i understand it wrong.
   
   This is a history issue, in the before, there exist a LogServer deploy at the worker's machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #11262: [Improvement][Task] Improve way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1202368549

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1203569535

   Do you have any good idea? AFAIK, we can use xx task SDK to submit task, and can get the appId from SDK, then we don't need to parse from log. Or we can optimize the currently `parse` method to avoid OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1203709389

   > @ruanwenjun Yeh, maybe a practicable solution, we can simply talk about it.
   > 
   > Before submitting a yarn job, the client apply the application context from RM first, and get appId which will be then written into NM's environment variable. We can use java agent to read it before executing yarn job's JAR file, also, can take taskInstanceId as input of agent program. However, where to store this mapping relationship need to be further considered.
   > 
   > Please let me know if you have any good suggestions!
   
   In fact, there is already a issue talk about use agent to collect the appId, but I think it isn't a good way 😢 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] Radeity commented on issue #11262: [Improvement][Task] Improved way to collect yarn job's appIds

Posted by GitBox <gi...@apache.org>.

Radeity commented on issue #11262:
URL: https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1255877159

   Hi, Rick
   
   First, thanks for your idea!
   
   However, i think this way have two problems as follow:
   1. Users may create ShellTask and submit not only one yarn job via command
   lines which is hard to add configuration.
   2. Aop way will simply fetch applicationId and write it into appInfo.log
   file. I think it's maybe more efficiency than query it through unique tag.
   In fact, I don't get how your idea work? Would you like to explain more
   about it?
   
   Best wishes
   
   rickchengx ***@***.***> 于2022年9月23日周五 14:35写道：
   
   > Hi, @Radeity <https://github.com/Radeity> @ruanwenjun
   > <https://github.com/ruanwenjun>
   >
   > I agree that the current way of getting the yarn application id from the
   > log is not elegant.
   > Just for discussion, there is another way to get yarn application id as
   > below:
   >
   > We can put some *unique tags* on tasks submitted from DS to yarn. E.g.,
   > for spark tasks, we can add the configuration --conf spark.yarn.tags
   > some_unique_tag.
   > After the task is submitted, DS can query the corresponding yarn
   > application id (or other info) through this unique tag.
   >
   > What do you think? Any comments or discussions are welcome.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/dolphinscheduler/issues/11262#issuecomment-1255845285>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AKY23YTHFBEZSS7FEWPLJ4TV7VFTTANCNFSM55KW2JOQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org