You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/10 14:51:36 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #538: Cleanup Repartition Exec code

alamb opened a new pull request #538:
URL: https://github.com/apache/arrow-datafusion/pull/538


    # Rationale for this change
   The body of RepartitionExec::execute is long and highly indented, and has a bunch of metrics related code that obscures how it works, in my opinion. 
   
   # What changes are included in this PR?
   
   As suggested by @tustvold  in https://github.com/apache/arrow-datafusion/pull/521#discussion_r646611109, attempt to make the code clearer and error conditions easier to reason about by:
   
   Changes:
   1. Factor the body of the repartition into its own async function
   2. Grouped the metrics into a `RepartitionMetrics` struct for convenience
   3. Refactored repeated code into `SQLMetric::add_elapsed` and reduced duplication
   
   I still think the metrics could still be made better, and I hope to work on that at a later date.
   
   
   
   # Are there any user-facing changes?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #538: Cleanup Repartition Exec code

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #538:
URL: https://github.com/apache/arrow-datafusion/pull/538


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #538: Cleanup Repartition Exec code

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #538:
URL: https://github.com/apache/arrow-datafusion/pull/538#discussion_r649259579



##########
File path: datafusion/src/physical_plan/repartition.rs
##########
@@ -132,132 +160,33 @@ impl ExecutionPlan for RepartitionExec {
                 // being read yet. This may cause high memory usage if the next operator is
                 // reading output partitions in order rather than concurrently. One workaround
                 // for this would be to add spill-to-disk capabilities.
-                let (sender, receiver) = tokio::sync::mpsc::unbounded_channel::<
-                    Option<ArrowResult<RecordBatch>>,
-                >();
+                let (sender, receiver) =
+                    mpsc::unbounded_channel::<Option<ArrowResult<RecordBatch>>>();
                 channels.insert(partition, (sender, receiver));
             }
             // Use fixed random state
             let random = ahash::RandomState::with_seeds(0, 0, 0, 0);
 
             // launch one async task per *input* partition
             for i in 0..num_input_partitions {
-                let random_state = random.clone();

Review comment:
       All of this code was moved to `pull_from_input`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org