You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jonahgao (via GitHub)" <gi...@apache.org> on 2024/03/15 09:25:11 UTC

[PR] feat: track memory usage for recursive CTE [arrow-datafusion]

jonahgao opened a new pull request, #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619

   ## Which issue does this PR close?
   
   
   Closes #9554.
   
   ## Rationale for this change
   - Add a `MemoryReservation` to track the memory usage of the RecursiveQuery's buffer.
   - The `MemoryReservation` will move with the buffer into the `WorkTable`.
   - The `MemoryReservation` in the `WorkTable` will be freed at the start of next iteration.
   
   
   ## What changes are included in this PR?
   - Add track memory usage for recursive CTE.
   - Enable recursive CTE support by default.
   
   ## Are these changes tested?
   Yes
   
   ## Are there any user-facing changes?
   No. Enable a new feature by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#discussion_r1526400001


##########
datafusion/physical-plan/src/memory.rs:
##########
@@ -236,6 +237,8 @@ impl MemoryExec {
 pub struct MemoryStream {
     /// Vector of record batches
     data: Vec<RecordBatch>,
+    /// Optional memory reservation bound to the data, freed on drop

Review Comment:
   This concern is quite reasonable; previously there was no need to add a `MemoryReservation` to the `MemoryStream`, and the `with_reservation` function is a bit confused.
   
   I plan to no longer use `MemoryStream` for future recursive CTEs and instead use a new stream class that can read the `WorkTable` multiple times, in order to support 
   https://github.com/apache/arrow-datafusion/blob/81b0a011705b17a09f494f550a5382b0c3414597/datafusion/physical-plan/src/recursive_query.rs#L316 
   
   So that the `MemoryStream::with_reservation` API can be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#issuecomment-2001614045

   > Looks great, @jonahgao—thanks for taking this on. Would it be sensible to have the worktable's update method take a `ReservedBatches` as an argument, instead of constructing it internally? My thinking is that, given worktable's sole purpose is to facilitate the recursive query, tightening the integration with `ReservedBatches` could better illustrate to future contributors the coupled ownership relationship between a `RecordBatch`'s buffer's and its corresponding memory reservation. It's more of a nit than anything so feel free to ignore :)
   > 
   > And apologies that I've been a bit radio silent on this front @alamb! Been heads down on other stuff :/
   
   Make sense to me! Updated and thank you @matthewgapp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "matthewgapp (via GitHub)" <gi...@apache.org>.
matthewgapp commented on PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#issuecomment-2000367300

   Looks great, @jonahgao—thanks for taking this on. Would it be sensible to have the worktable's update method take a `ReservedBatches` as an argument, instead of constructing it internally? My thinking is that, given worktable's sole purpose is to facilitate the recursive query, tightening the integration with `ReservedBatches` could better illustrate to future contributors the coupled ownership relationship between a `RecordBatch`'s buffer's and its corresponding memory reservation. 
   
   And apologies that I've been a bit radio silent on this front @alamb! Been heads down on other stuff :/ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#discussion_r1526354142


##########
datafusion/core/tests/memory_limit.rs:
##########
@@ -301,6 +301,28 @@ async fn sort_spill_reservation() {
     test.with_config(config).with_expected_success().run().await;
 }
 
+#[tokio::test]
+async fn oom_recursive_cte() {
+    TestCase::new()
+        .with_query(
+            "WITH RECURSIVE nodes AS (
+            SELECT 1 as id
+            UNION ALL
+            SELECT UNNEST(RANGE(id+1, id+1000)) as id
+            FROM nodes
+            WHERE id < 10
+        )
+        SELECT * FROM nodes;",
+        )
+        .with_expected_errors(vec![

Review Comment:
   👍 



##########
datafusion/physical-plan/src/memory.rs:
##########
@@ -236,6 +237,8 @@ impl MemoryExec {
 pub struct MemoryStream {
     /// Vector of record batches
     data: Vec<RecordBatch>,
+    /// Optional memory reservation bound to the data, freed on drop

Review Comment:
   I was a little worried at first that this optional API makes it easy to forget to provide reservation. However I see now that the reservation is used only with the recursive CTE case
   
   (not for this PR) In general I wondered if we should *always* have a memory reservation to `MemoryStream`  🤔 I think that would double count batches from a MemTable however, so it isn't an obviously good improvement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#issuecomment-2001951978

   Thanks @jonahgao  and @matthewgapp 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: track memory usage for recursive CTE, enable recursive CTEs by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #9619:
URL: https://github.com/apache/arrow-datafusion/pull/9619#issuecomment-2000382789

   > And apologies that I've been a bit radio silent on this front @alamb! Been heads down on other stuff :/
   
   No worries @matthewgapp  -- I totally understand. Good luck and thank you for contributing the original PR 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org