You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/05/02 10:58:50 UTC

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #6163: Adaptive in-memory sort (~2x faster) (#5879)

tustvold commented on code in PR #6163:
URL: https://github.com/apache/arrow-datafusion/pull/6163#discussion_r1182397091


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -115,47 +113,37 @@ impl ExternalSorter {
         }
     }
 
-    async fn insert_batch(
-        &mut self,
-        input: RecordBatch,
-        tracking_metrics: &MemTrackingMetrics,
-    ) -> Result<()> {
-        if input.num_rows() > 0 {
-            let size = batch_byte_size(&input);
-            if self.reservation.try_grow(size).is_err() {
+    /// Appends an unsorted [`RecordBatch`] to `in_mem_batches`
+    ///
+    /// Updates memory usage metrics, and possibly triggers spilling to disk
+    async fn insert_batch(&mut self, input: RecordBatch) -> Result<()> {
+        if input.num_rows() == 0 {
+            return Ok(());
+        }
+
+        let size = batch_byte_size(&input);
+        if self.reservation.try_grow(size).is_err() {
+            let before = self.reservation.size();
+            self.in_mem_sort().await?;
+            // Sorting may have freed memory, especially if fetch is not `None`
+            //
+            // As such we check again, and if the memory usage has dropped by
+            // a factor of 2, and we can allocate the necessary capacity,
+            // we don't spill
+            //
+            // The factor of 2 aims to avoid a degenerate case where the
+            // memory required for `fetch` is just under the memory available,
+            // causing repeated resorting of data
+            if self.reservation.size() > before / 2

Review Comment:
   > needs to be called again and return successfully
   
   It is called in the loop body following the spill



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org