You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "andygrove (via GitHub)" <gi...@apache.org> on 2023/02/26 16:01:07 UTC

[GitHub] [arrow-ballista] andygrove opened a new pull request, #687: Add ShuffleWriter trait

andygrove opened a new pull request, #687:
URL: https://github.com/apache/arrow-ballista/pull/687

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   N/A
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Add an abstraction for shuffle writer as a step towards supporting pluggable execution engines. See https://github.com/apache/arrow-ballista/pull/685 for more context on how I intend to use this.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   - Add `ShuffleWriter` trait
   - Implement `ShuffleWriter` trait in `ShuffleWriterExec`
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   No
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] tustvold commented on a diff in pull request #687: Add ShuffleWriter trait

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1118605224


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -139,7 +154,7 @@ impl ShuffleWriterExec {
         self.shuffle_output_partitioning.as_ref()
     }
 
-    pub fn execute_shuffle_write(
+    fn execute_shuffle_write_internal(

Review Comment:
   So there are two changes here
   
   * [`async_trait`] will box the future, so polling the future now goes through a dyn dispatch. This probably doesn't matter given this method is likely performing non-trivial IO, and async tends to result in poor codegen regardless
   * The returned future borrows self, i.e. can't outlive the borrow of self. This could matter, but given the pervasive use of `Arc` and `async move` it is probably easy enough to work around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on a diff in pull request #687: Add ExecutionEngine abstraction

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1120311299


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -288,6 +305,22 @@ impl ShuffleWriterExec {
     }
 }
 
+#[async_trait]
+impl ShuffleWriter for ShuffleWriterExec {

Review Comment:
   @thinkharderdev I have renamed the trait and made some other naming changes. I also added the ExecutionEngine abstraction in this PR. PTAL when you have time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on a diff in pull request #687: Add ShuffleWriter trait

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1118111745


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -139,7 +154,7 @@ impl ShuffleWriterExec {
         self.shuffle_output_partitioning.as_ref()
     }
 
-    pub fn execute_shuffle_write(
+    fn execute_shuffle_write_internal(

Review Comment:
   @tustvold You changed the signature of this method in https://github.com/apache/arrow-ballista/commit/b7bb2cfba13cc04a08c2f687102dd14a8dedc7b6 to change it from `async` to return an `impl Future`, and I am now calling it from an `async` method so just wanted to check that I am not undoing anything here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove merged pull request #687: Add ExecutionEngine abstraction

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove merged PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on a diff in pull request #687: Add ShuffleWriter trait

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1118953966


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -288,6 +305,22 @@ impl ShuffleWriterExec {
     }
 }
 
+#[async_trait]
+impl ShuffleWriter for ShuffleWriterExec {

Review Comment:
   That's a good point. Maybe want I want for now is just an `ExecutionEngine` trait with an `execute` method that returns the metadata with shuffle file locations. I will give this some more thought.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on a diff in pull request #687: Add ShuffleWriter trait

Posted by "thinkharderdev (via GitHub)" <gi...@apache.org>.
thinkharderdev commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1118642954


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -288,6 +305,22 @@ impl ShuffleWriterExec {
     }
 }
 
+#[async_trait]
+impl ShuffleWriter for ShuffleWriterExec {

Review Comment:
   This seems inverted to me. Shouldn't `ShuffleWriterExec` have a `Arc<dyn ShuffleWriter>` instead of impl the trait? 
   
   I know this is a step toward the pluggable execution engine, but also thinking about how this fits with allowing for different shuffle methods (write to disk, write to object store, stream to network, etc.) and seems like we would need a second layer of indirection for that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] tustvold commented on a diff in pull request #687: Add ShuffleWriter trait

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#discussion_r1118605224


##########
ballista/core/src/execution_plans/shuffle_writer.rs:
##########
@@ -139,7 +154,7 @@ impl ShuffleWriterExec {
         self.shuffle_output_partitioning.as_ref()
     }
 
-    pub fn execute_shuffle_write(
+    fn execute_shuffle_write_internal(

Review Comment:
   So there are two changes here
   
   * [`async_trait`] will box the future, so polling the future now goes through a dyn dispatch. This probably doesn't matter given this method is likely performing non-trivial IO
   * The returned future borrows self, i.e. can't outlive the borrow of self. This could matter, but given the pervasive use of `Arc` and `async move` it is probably easy enough to work around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on pull request #687: Add ExecutionEngine abstraction

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on PR #687:
URL: https://github.com/apache/arrow-ballista/pull/687#issuecomment-1450281168

   @Dandandan I know you already approved this PR, but it has now changed significantly. Could you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org