You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/05/18 15:59:27 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #6382: Account for memory usage in SortPreservingMerge (#5885)

tustvold opened a new pull request, #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5885
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#discussion_r1198076661


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -238,6 +247,9 @@ impl ExternalSorter {
             return Ok(());
         }
 
+        // Release the memory reserved for merge
+        self.merge_reservation.free();
+
         self.in_mem_batches = self
             .in_mem_sort_stream(self.metrics.baseline.intermediate())?

Review Comment:
   I double checked that `in_mem_sort_stream` correctly respects `self.reservation` 👍  



##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -55,6 +55,9 @@ use tempfile::NamedTempFile;
 use tokio::sync::mpsc::{Receiver, Sender};
 use tokio::task;
 
+/// How much memory to reserve for performing in-memory sorts

Review Comment:
   ```suggestion
   /// How much memory to reserve for performing in-memory sorts prior to spill
   ```



##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -55,6 +55,9 @@ use tempfile::NamedTempFile;
 use tokio::sync::mpsc::{Receiver, Sender};
 use tokio::task;
 
+/// How much memory to reserve for performing in-memory sorts
+const EXTERNAL_SORTER_MERGE_RESERVATION: usize = 10 * 1024 * 1024;

Review Comment:
   The problem with this approach is that even 10MB may not be enough to correctly merge the batches prior to spilling. So some queries that today would succeed (though exceed their memory limits) might fail.
   
   It seems to me better approaches (as follow on PRs) would be:
   1. Make this a config parameter so users can avoid the error by reserving more memory up front if needed
   2. teach SortExec how to write more (smaller) spill files if it doesn't have enough memory to merge the in memory batches. 
   
   However, given the behavior on master today is to simply ignore the reservation and exceed the memory limit this behavior seems better than before. 
    
   I suggest we merge this PR as is and file a follow on ticket for the improved behavior 



##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -94,8 +97,10 @@ struct ExternalSorter {
     expr: Arc<[PhysicalSortExpr]>,
     metrics: ExternalSorterMetrics,
     fetch: Option<usize>,
+    /// Reservation for in_mem_batches
     reservation: MemoryReservation,
-    partition_id: usize,
+    /// Reservation for in memory sorting of batches

Review Comment:
   ```suggestion
       /// Reservation for in memory sorting of batches, prior to spilling.
       /// Without this reservation, when the memory budget is exhausted
       /// it might not be possible to merge the in memory batches as part 
       /// of spilling.
   ```



##########
datafusion/execution/src/memory_pool/pool.rs:
##########
@@ -145,21 +145,21 @@ impl MemoryPool for FairSpillPool {
 
     fn unregister(&self, consumer: &MemoryConsumer) {
         if consumer.can_spill {
-            self.state.lock().num_spill -= 1;
+            self.state.lock().num_spill.checked_sub(1).unwrap();

Review Comment:
   Maybe it would be worth adding some unit tests to the `MemoryReservation` now given it is growing in sophistication



##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -29,6 +30,9 @@ pub struct RowCursor {
     num_rows: usize,
 
     rows: Rows,
+
+    #[allow(dead_code)]

Review Comment:
   I think it would help to note here in comments why the code needs to keep around a field that is never read (`dead_code`). I think it is to keep the reservation around long enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#discussion_r1198000591


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -115,6 +120,12 @@ impl ExternalSorter {
             .with_can_spill(true)
             .register(&runtime.memory_pool);
 
+        let mut merge_reservation =
+            MemoryConsumer::new(format!("ExternalSorterMerge[{partition_id}]"))
+                .register(&runtime.memory_pool);
+
+        merge_reservation.resize(EXTERNAL_SORTER_MERGE_RESERVATION);

Review Comment:
   I take it as a positive sign that this was required to make the spill tests pass, without this the merge would exceed the memory limit and fail



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#discussion_r1198003398


##########
datafusion/execution/src/memory_pool/pool.rs:
##########
@@ -145,21 +145,21 @@ impl MemoryPool for FairSpillPool {
 
     fn unregister(&self, consumer: &MemoryConsumer) {
         if consumer.can_spill {
-            self.state.lock().num_spill -= 1;
+            self.state.lock().num_spill.checked_sub(1).unwrap();

Review Comment:
   Drive by sanity check (as first version of `MemoryReservation::split` would unregister the same consumer multiple times :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#issuecomment-1583175485

   Converted to a draft as this PR is not ready to merge yet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#discussion_r1198003398


##########
datafusion/execution/src/memory_pool/pool.rs:
##########
@@ -145,21 +145,21 @@ impl MemoryPool for FairSpillPool {
 
     fn unregister(&self, consumer: &MemoryConsumer) {
         if consumer.can_spill {
-            self.state.lock().num_spill -= 1;
+            self.state.lock().num_spill.checked_sub(1).unwrap();

Review Comment:
   Drive by sanity check (as first version of `MemoryReservation::split` would unregister the same consumer multiple times) and the debug checks are the only reason I noticed :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#issuecomment-1654533642

   Ok, i really do plan to pick this code up tomorrow and work on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#issuecomment-1658740939

   I have a new version of this code on https://github.com/apache/arrow-datafusion/pull/7130 that I am making progress


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#issuecomment-1561969508

   I plan to try and help this PR over the line in the next day or two


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#discussion_r1198002865


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -55,6 +55,9 @@ use tempfile::NamedTempFile;
 use tokio::sync::mpsc::{Receiver, Sender};
 use tokio::task;
 
+/// How much memory to reserve for performing in-memory sorts
+const EXTERNAL_SORTER_MERGE_RESERVATION: usize = 10 * 1024 * 1024;

Review Comment:
   I'm not a massive fan of this, but this somewhat patches around the issue that once we initiate a merge we can't then spill



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6382:
URL: https://github.com/apache/arrow-datafusion/pull/6382#issuecomment-1559017203

   We have found another cause of the memory use in IOx downstream, but I still think this PR is valuable. Once we sort out downstream we'll try and get this one polished up and ready to go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen closed pull request #6382: Account for memory usage in SortPreservingMerge (#5885)

Posted by "yjshen (via GitHub)" <gi...@apache.org>.
yjshen closed pull request #6382: Account for memory usage in SortPreservingMerge (#5885)
URL: https://github.com/apache/arrow-datafusion/pull/6382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org