You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/09/14 03:38:46 UTC

[GitHub] [dolphinscheduler] SbloodyS commented on issue #11652: [Feature] DS can support task running on remote host, not just worker server.

SbloodyS commented on issue #11652:
URL: https://github.com/apache/dolphinscheduler/issues/11652#issuecomment-1246191405

   > 6. Add **taskServerInfo** (just contain ip, user, password, name) entity in dolphinscheduler-task-plugin model. And add taskServerInfo field in TaskExecutionContext.
   
   I think it's better not to use userName/passWord in ssh since there are some security risks. Using pam file or authrized_key in ssh is a more secure way.
   
   BTW, I have some different views on the implementation method.
   
   Currently, DS supports S3 and HDFS as the storage mode of the resource center. In the future, it may also support other object storage, such as alibaba cloud oss, tencent cloud cos, etc.
   
   1. In common usage scenarios, The masterServer/apiServer's node usually does not contain the permission to use HDFS and S3. These permissions are usually included in the workerServer's node. It requires these permissions on user's masterServer/apiServer's node if using scp command to trasfer the files to the task server. In addition, downloading files from the masterServer/apiServer's node and then scping them to the task node will waste network and hard disk IO for some large files or large number of small files.
   
   2. Using SSH to execute shell commands usually requires escaping a lot of special characters for different task type. And I think this is a huge workload for subsequent maintenance.
   
   3. Using SSH means that the task running status and running logs need to be monitored by the masterServer. This may lead to high load on masterServer's node when the number of tasks is quite large.
   
   Which is not reasonable for users and maintainers.
   
   Based on all the above issues. I suggest implementing this in the following steps.
   
   1. Create a task level callback in the masterServer to provide a single task with task monitoring information.
   
   2. Create an executable task for each task, which can be a jar package or an executable file compiled through golang or any other languages, and transfer it to the task node for execution through asynchronous ssh. This executable task contains the actual execution of the task and the monitoring information reported by the task to the master.
   
   3. After the task is finished, the masterServer deletes the task's executable file through ssh or any other ways for clean up.
   
   In this way, all task types can be seamlessly implement with high performance and the task content can be executed without any escape. It also reduces the monitoring load of the master. Which is more reasonable for distributed processing.
   
   These are my humble opinions. If you have any questions, please let me know. @DarkAssassinator 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org