You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/08 05:40:40 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Ted-Jiang opened a new issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Enable periodic cleanup of work_dir directories in ballista executor which introduce 3 args
   `executor_cleanup_enable` : Enable periodic cleanup of work_dir directories.
   `executor_cleanup_interval`: Controls the interval in seconds , which the worker cleans up old job dirs on the local machine.
   `executor_cleanup_ttl`: Number of seconds to retain job work_dir on each executor. This is a Time To Live and should depend on the amount of available disk space you have.
   
   **Describe the solution you'd like**
   Executor periodic spawn a task to clean work_dir, if  all the files in `job_dir`  not modified in `executor_cleanup_ttl` seconds, it will be deleted.
   
   **Describe alternatives you've considered**
   Scheduler send rpc call to delete files when job done.
   
   **Additional context**
   #1662
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032242109


   On top of a background GC task, would it make sense to also clean up job dirs on job completion preemptively?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032250884


   https://github.com/apache/arrow-datafusion/blob/09c67d5af32aee107e87b9ddb93226707ccaa4fb/ballista/rust/core/src/execution_plans/shuffle_writer.rs#L152-L154


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang removed a comment on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang removed a comment on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032254580


   > preemptively
   @houqp 
   Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] mingmwang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1034810731


   > > preemptively
   > > @houqp
   > > Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?
   
   IMO, I think when a SQL is finished, all the immediate shuffle data can be cleared except for the result data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032249175


   > On top of a background GC task, would it make sense to also clean up job dirs on job completion preemptively?
   
   i think job dir in under work_dir


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang edited a comment on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032254580


   > preemptively
   @houqp 
   Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032254580


   > preemptively
   
   Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang removed a comment on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang removed a comment on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032254580


   > preemptively
   @houqp 
   Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang edited a comment on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032249175


   > On top of a background GC task, would it make sense to also clean up job dirs on job completion preemptively?
   
   i think job dir in under work_dir
   like:
    work_dir -> job_dirs(job_id) -> stage_dirs(stage_id) -> shuffle data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang removed a comment on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang removed a comment on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1032249175






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1039826718


   > > > preemptively
   > > > @houqp
   > > > Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?
   > 
   > IMO, I think when a SQL is finished, all the immediate shuffle data can be cleared except for the result data.
   
   @houqp  @mingmwang It sounds very reasonable , i thinks this will handles some error cases for robustness.
    IMHO, keep both of them and create a separate  issue to capture for future improvement (maybe after separate shuffle data and result data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1780: Enable periodic cleanup of work_dir directories in ballista executor

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1780:
URL: https://github.com/apache/arrow-datafusion/issues/1780#issuecomment-1039826718


   > > > preemptively
   > > > @houqp
   > > > Sorry for my confusion , You mean if a job has 3 stage, when stage3 is running, we can delete stage 1 first?
   > 
   > IMO, I think when a SQL is finished, all the immediate shuffle data can be cleared except for the result data.
   
   @houqp  @mingmwang It sounds very reasonable , i thinks this will handles some error cases for robustness.
    IMHO, keep both of them and create a separate  issue to capture for future improvement (maybe after separate shuffle data and result data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org