You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/21 18:05:11 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #2312: Don't sort batches during plan

tustvold opened a new pull request, #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312

   # Which issue does this PR close?
   
   Closes #1939.
   
    # Rationale for this change
   
   Currently SortExec performs the sort during the query plan
   
   # What changes are included in this PR?
   
   This makes the SortExec lazy
   
   # Are there any user-facing changes?
   
   SortExec no longer computes values during query planning
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#discussion_r855467677


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -779,17 +779,21 @@ impl ExecutionPlan for SortExec {
 
         debug!("End SortExec's input.execute for partition: {}", partition);
 
-        let result = do_sort(
-            input,
-            partition,
-            self.expr.clone(),
-            self.metrics_set.clone(),
-            context,
-        )
-        .await;
-
-        debug!("End SortExec::execute for partition {}", partition);
-        result
+        Ok(Box::pin(RecordBatchBoxStream::new(
+            self.schema(),
+            futures::stream::once(
+                do_sort(

Review Comment:
   A nicer fix would be to make ExternalSort stream, or less async, or some combination of the two, but in the short term this is a quick fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#discussion_r855486761


##########
datafusion/core/src/physical_plan/stream.rs:
##########
@@ -73,3 +74,45 @@ impl RecordBatchStream for RecordBatchReceiverStream {
         self.schema.clone()
     }
 }
+
+/// Combines a [`BoxStream`] with [`SchemaRef`] implementing
+/// [`RecordBatchStream`] for the combination

Review Comment:
   I've updated it to no longer need a boxed stream, the trade-off is more pinning nonsense :sweat_smile:   PTAL 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#discussion_r855474275


##########
datafusion/core/src/physical_plan/stream.rs:
##########
@@ -73,3 +74,45 @@ impl RecordBatchStream for RecordBatchReceiverStream {
         self.schema.clone()
     }
 }
+
+/// Combines a [`BoxStream`] with [`SchemaRef`] implementing
+/// [`RecordBatchStream`] for the combination

Review Comment:
   ```suggestion
   /// Combines a [`BoxStream`] (created by calling `boxed` on a stream that produced `RecordBatch`es) 
   /// with a [`SchemaRef`] implementing [`RecordBatchStream`] for the combination
   ```



##########
datafusion/core/src/physical_plan/stream.rs:
##########
@@ -73,3 +74,45 @@ impl RecordBatchStream for RecordBatchReceiverStream {
         self.schema.clone()
     }
 }
+
+/// Combines a [`BoxStream`] with [`SchemaRef`] implementing
+/// [`RecordBatchStream`] for the combination

Review Comment:
   I find the whole `boxed` stream concept somewhat confusing, so I am trying to leave hints around for my future self



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#issuecomment-1105669402

   > It's really just a trick to ensure a given field is consistently either s
   
   TIL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#issuecomment-1105630161

   > It'll have one-less stack allocation per stream, and one less dyn-dispatch per poll. Whether that matters 🤷‍♂️
   
   What I was getting at was that I don't know what sort of magic occurs as part of `pin_project` there may be more allocations I am not aware of. I am not worried, just opining


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#discussion_r855466582


##########
datafusion/core/src/physical_plan/stream.rs:
##########
@@ -73,3 +74,45 @@ impl RecordBatchStream for RecordBatchReceiverStream {
         self.schema.clone()
     }
 }
+
+/// Combines a [`BoxStream`] with [`SchemaRef`] implementing
+/// [`RecordBatchStream`] for the combination
+pub(crate) struct RecordBatchBoxStream {

Review Comment:
   I'm not a massive fan of this, as it results in double dyn-dispatch, but it is probably acceptable until such a time as we rework ExecutionPlan.
   
   I've put this alongside RecordBatchReceiverStream as it is likely generally useful, but I've kept it pub(crate) as I'm not sure how long this interface will be around for



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#issuecomment-1105650670

   It's really just a trick to ensure a given field is consistently either structurally or non-structurally pinned - https://doc.rust-lang.org/nightly/core/pin/index.html#projections-and-structural-pinning. 
   
   Unlike say async_trait, there isn't that much wizardry occuring under the hood, the generated code is actually relatively straightforward
   
   ```
   pub(crate) fn project<'__pin>(
       self: ::pin_project_lite::__private::Pin<&'__pin mut Self>,
   ) -> Projection<'__pin, S> {
       unsafe {
           let Self {
               schema, stream
           } = self.get_unchecked_mut();
           Projection {
               schema: schema,
               stream
               : ::pin_project_lite::__private::Pin::new_unchecked(stream
               ),
           }
       }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#discussion_r855490014


##########
datafusion/core/src/physical_plan/stream.rs:
##########
@@ -73,3 +74,52 @@ impl RecordBatchStream for RecordBatchReceiverStream {
         self.schema.clone()
     }
 }
+
+pin_project! {
+    /// Combines a [`Stream`] with a [`SchemaRef`] implementing
+    /// [`RecordBatchStream`] for the combination
+    pub(crate) struct RecordBatchStreamAdapter<S> {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2312: Don't sort batches during plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2312:
URL: https://github.com/apache/arrow-datafusion/pull/2312#issuecomment-1105625089

   It'll have one-less stack allocation per stream, and one less dyn-dispatch per poll. Whether that matters 🤷‍♂️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org