You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/07 06:23:48 UTC

[GitHub] [arrow-datafusion] yahoNanJing opened a new pull request #1935: Introduce query stage scheduler

yahoNanJing opened a new pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1934.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Details see #1934.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Put the stage generation logic into the channel processing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1935: Introduce Ballista query stage scheduler

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065867302


   > 
   
   
   
   > Running some Ballista integration testing as part of CI is a long standing request and I think be a major benefit / improvement now that Ballista development is driving forward. @Ted-Jiang if you have time to look into that it would be great
   > 
   > #321
   > 
   > cc @andygrove
   
   Sure, i will check 😊, Maybe add more test case in IT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064118121


   > For work like this in the future (large refactorings broken into multiple PRs which depend on each other) it might be a good idea to do something similar to the arrow2 branch. That is, we can create a feature branch for the overall refactoring and raise the incremental work as PRs to the feature branch. It is a little more annoying to manage but it makes the individual PRs much easier to review. This basically approximates a "stacked diffs" (https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6) model.
   
   I actually prefer to avoid large feature branches (like `arrow2` if possible) -- in my experience they are hard to maintain and collect conflicts quickly. Getting the changes into master as soon as possible I have found better over the long ter. 
   
   I think the sequence of PRs from @yahoNanJing  is actually a pretty good way (I think they are basically a stacked diff). I especially like keeping changes that "moving code around, but keeping the behavior the same" into their own PRs as I think they are easy to review and we can merge them quickly. In this case, I am sorry it has taken some time to get the in -- as you can see we were a little bottlenecked on review capacity with an influx of contributors. 
   
   I hope this bottleneck is reduced over time
   
   The downside is that when you end up with 4-5 PRs chained together the overall goal sometimes is harder to see. I have found making a single tracking issue that explains the overall plan and lists the individual steps works well
   
   Like
   - [ ] Step 1: Refactor XX (link to PR)
   - [ ] Step 2: Refactor YY  (link to PR)
   - [ ] Step 3: Refactor ZZ (link to PR)
   - [ ] Step 4: Change behavior (link to PR)
   
   ```
   - [ ] Step 1: Refactor XX (link to PR)
   - [ ] Step 2: Refactor YY  (link to PR)
   - [ ] Step 3: Refactor ZZ (link to PR)
   - [ ] Step 4: Change behavior (link to PR)
   ```
   
   I don't think there is any reason to have a 1-to-1 correspondence between issue and PR if the PRs are chained together. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on pull request #1935: Introduce Ballista query stage scheduler

Posted by GitBox <gi...@apache.org>.
Ted-Jiang edited a comment on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065867302


   > 
   
   
   
   > Running some Ballista integration testing as part of CI is a long standing request and I think be a major benefit / improvement now that Ballista development is driving forward. @Ted-Jiang if you have time to look into that it would be great
   > 
   > #321
   > 
   > cc @andygrove
   
   @alamb Sure, i will check 😊, Maybe add more test case in IT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065865685


   Running some Ballista integration testing as part of CI is a long standing request and I think be a major benefit / improvement now that Ballista development is driving forward. @Ted-Jiang  if you have time to look into that it would be great
   
   https://github.com/apache/arrow-datafusion/issues/321
   
   cc @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1062992746


   It's a little hard to tell from the diff which changes are specific to this PR vs. part of previous refactoring PRs but if I understand correctly this doesn't change the existing scheduler but just moves the scheduling logic to an encapsulated `QueryStageSchduler` that uses an event-based interface? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno edited a comment on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
realno edited a comment on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064660949


   > These PRs basically does not change behaviors. Therefore, I didn't add additional unit tests.
   
   Yes, unit test for this PR is good. My comment was more general to Ballista overall, sorry didn't make that clear :) 
   
   > For the pipeline, I'm not so sure whether the Ballista integration test be added or not.
   
   I checked the pipeline definition the integration test is not part of it. I think at some point we should include it. @andygrove @alamb should we do that now? I guess one concern might be if we want it to block the whole DataFusion pipeline


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #1935: Introduce Ballista query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065878303


   Thanks @alamb. If possible, our team will be very appreciate to present the scheduler changes in one of the meetups. However, there's still several PR under review, like #1983, #1842, #1862. It's better to finish all of them before the presentation. Then we can have a complete benchmark results and present the whole story. πŸ˜„ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1061288954


   I will try and review this tomorrow or Wednesday. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1061158388


   cc @thinkharderdev @mingmwang @realno @matthewmturner and @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing edited a comment on pull request #1935: Introduce Ballista query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing edited a comment on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065878303


   Thanks @alamb. Our team will be very appreciated for the chance to present the scheduler changes in one of the meetups. However, there's still several PR under review, like #1983, #1842, #1862. It's better to finish all of them before the presentation. Then we can have a complete benchmark results and present the whole story. πŸ˜„ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1935: Introduce Ballista query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065939050


   >  It's better to finish all of them before the presentation. Then we can have a complete benchmark results and present the whole story. πŸ˜„
   
   Sounds like a great plan to me πŸ‘ 
   
   If you don't mind me asking, who else is on your team? Are you working with @Ted-Jiang  @liukun4515  and @mingmwang?  (I am just trying to keep all the people and groups somewhat straight in my head)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1063780420


   > Hi @thinkharderdev, you are right. For #1908, #1909, #1910 and #1934, these 4 issues are mainly focusing on the refactoring and changing the scheduler server state changing to be event-based.
   > 
   > While for #1936, I'll introduce stage and the state machine for it. With the state machine of the job stage, it will be much easier and more efficient to do error handling, speculative execution. In general, event-based processing is very suitable for handling the state machine.
   > 
   > However, the PR for #1936 is really complicated. I'm still working on it.
   
   Cool. Looks good to me in general but would like to see the other PRs merged before approving. 
   
   For work like this in the future (large refactorings broken into multiple PRs which depend on each other) it might be a good idea to do something similar to the `arrow2` branch. That is, we can create a feature branch for the overall refactoring and raise the incremental work as PRs to the feature branch. It is a little more annoying to manage but it makes the individual PRs much easier to review. This basically approximates a "stacked diffs" (https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6) model. 
   
   @alamb @alamb @yjshen @realno @matthewmturner and @liukun4515 Does ^^ make sense?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1065866077


   Thanks @yahoNanJing  -- great work so far. I wonder if you might be interested in presenting (perhaps at one of the Rust meetups) some of the work you have done with the scheduler? I know the original design was listed on https://github.com/apache/arrow-datafusion/issues/321 but I am not sure if it has evolved during the implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
realno commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064660949


   > These PRs basically does not change behaviors. Therefore, I didn't add additional unit tests.
   Yes, unit test for this PR is good. My comment was more general to Ballista overall, sorry didn't make that clear :) 
   
   > For the pipeline, I'm not so sure whether the Ballista integration test be added or not.
   I checked the pipeline definition the integration test is not part of it. I think at some point we should include it. @andygrove @alamb should we do that now? I guess one concern might be if we want it to block the whole DataFusion pipeline


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1063015346


   Hi @thinkharderdev, you are right. For #1908, #1909, #1910 and #1934, these 4 issues are mainly focusing on the refactoring and changing the scheduler server state changing to be event-based. 
   
   While for #1936, I'll introduce stage and the state machine for it. With the state machine of the job stage, it will be much easier and more efficient to do error handling, speculative execution. In general, event-based processing is very suitable for handling the state machine.
   
   However, the PR for #1936 is really complicated. I'm still working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064404042


   Thanks @alamb and @thinkharderdev. Actually, I prefer step by step. Otherwise, like arrow2, it may hard to be merged.
   
   For the distributed task scheduling, there's still two PR left, #1935 and #1983. Actually, the most important PR is #1983. All of other PRs in this chain are for finally achieving #1983.
   
   I just finished it today. And I will do some load testing with my team tomorrow to see how the performance is improved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064896488


   
   > I just finished it today. And I will do some load testing with my team tomorrow to see how the performance is improved.
   > 
   > In the future, another important feature is to add task error recovering. I think it will be much easier to work it on this new task scheduling framework.
   
   ​​I think after see the big improvement in load testing #1530, Itβ€˜s worth doing such a big change! I will be glad to add UT for this change(if needed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing edited a comment on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing edited a comment on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064404042


   Thanks @alamb and @thinkharderdev. Actually, I prefer step by step. Otherwise, like arrow2, it may hard to be merged.
   
   For the distributed task scheduling, there's still two PR left, #1935 and #1983. Actually, the most important PR is #1983. All of other PRs in this chain are for finally achieving #1983.
   
   I just finished it today. And I will do some load testing with my team tomorrow to see how the performance is improved.
   
   In the future, another important feature is to add task error recovering. I think it will be much easier to work it on this new task scheduling framework.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
realno commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064562188


   The PR looks good to me, thanks @yahoNanJing!
   
   One question I have is about testing, there will be a lot of major changes to Ballista, do we have good test coverage to ensure things are working as expected? 
   
   Here is a related PR https://github.com/apache/arrow-datafusion/pull/1982, is Ballista integration test currently part of the pipeline? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1064626001


   Thanks @realno. These PRs basically does not change behaviors. Therefore, I didn't add additional unit tests.
   
   However, for #1983, I have added unit test for both pull-staged-based task scheduling and push-staged-based task scheduling.
   
   For the pipeline, I'm not so sure whether the Ballista integration test be added or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on pull request #1935: Introduce query stage scheduler

Posted by GitBox <gi...@apache.org>.
realno commented on pull request #1935:
URL: https://github.com/apache/arrow-datafusion/pull/1935#issuecomment-1061184832


   I will take a look later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org