You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2024/03/11 10:46:14 UTC

[I] Enable recursive CTE support by default [arrow-datafusion]

alamb opened a new issue, #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554

   ### Is your feature request related to a problem or challenge?
   
   As part of https://github.com/apache/arrow-datafusion/issues/462, @matthewgapp implemented  support for recursive Common Table Expressions (aka Recursive CTEs) 
   
   Here is an example of such a query:
   
   https://github.com/apache/arrow-datafusion/blob/4cd3c433004a7a6825643d6b3911db720efe5f76/datafusion/sqllogictest/test_files/cte.slt#L44-L68
   
   At the moment, to use recursive CTEs you must enable a config option:
   ```sql
   set datafusion.execution.enable_recursive_ctes = true;
   ```
   
   
   
   ### Describe the solution you'd like
   
   I would like recursive CTEs to be enabled by default (and thus useable without a config option)
   
   The only reason I know of at the moment that they are NOT enabled by default is because they might buffer an infinite amount of data (and thus exceed the total memory available to DataFusion)
   
   ### Describe alternatives you've considered
   
   I think the basic idea would be to
   1. Extend `RecursiveQueryExec` with a `MemoryReservation`
   2. Track the memory of any buffered batches, erroring if they can not be buffered
   3. Write a test in `memory_limit.rs` showing the limit being hit: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/memory_limit.rs
   4. Change the default value of 
   
   The main PR that added this feature was https://github.com/apache/arrow-datafusion/pull/8840
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1998873032

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #9554: Enable recursive CTE support by default
URL: https://github.com/apache/arrow-datafusion/issues/9554


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1991600013

   > it's a bit slow.
   
   Disabling `enable_round_robin_repartition` and `coalesce_batches` will make it much faster.
   @Matthewgapp once mentioned this issue in PR #7581.
   
   ```sh
   DataFusion CLI v36.0.0
   
   ❯ set datafusion.optimizer.enable_round_robin_repartition=false;
   0 rows in set. Query took 0.002 seconds.
   
   ❯ set datafusion.execution.coalesce_batches=false;
   0 rows in set. Query took 0.002 seconds.
   
   ❯ with recursive t(a) as
   (select 1 as a
   union all
   select 1+a from t where a<10000)
   select count(*) from t;
   +----------+
   | COUNT(*) |
   +----------+
   | 10000    |
   +----------+
   1 row in set. Query took 0.194 seconds.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1991613072

   Looks like there is a draft PR here addressing some of the performance issues / improvements: https://github.com/matthewgapp/arrow-datafusion/pull/2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1998873145

   I am interested in implementing this memory limit feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1999756943

   Thank you @jonahgao 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Enable recursive CTE support by default [arrow-datafusion]

Posted by "l1t1 (via GitHub)" <gi...@apache.org>.
l1t1 commented on issue #9554:
URL: https://github.com/apache/arrow-datafusion/issues/9554#issuecomment-1990789938

   it's a bit slow.
   ```
   with recursive t(a) as
   (select 1 as a
   union all
   select 1+a from t where a<10000)
   select count(*) from t;
   +----------+
   | COUNT(*) |
   +----------+
   | 10000    |
   +----------+
   1 row in set. Query took 2.347 seconds.
   ```
   duckdb 0.10 :
   ```
   D with recursive t(a) as (select 1 as a union all select 1+a from t where a<10000) select count(*) from t;
   
   Run Time (s): real 0.291 user 0.670804 sys 0.046800
   D with recursive t(a) as (select 1 as a union all select 1+a from t where a<20000) select count(*) from t;
   
   Run Time (s): real 0.602 user 1.310408 sys 0.140401
   D with recursive t(a) as (select 1 as a union all select 1+a from t where a<40000) select count(*) from t;
   
   Run Time (s): real 1.114 user 2.418016 sys 0.265202
   D with recursive t(a) as (select 1 as a union all select 1+a from t where a<80000) select count(*) from t;
   
   Run Time (s): real 2.326 user 5.163633 sys 0.374402
   ```
   hyper 0.0.18161:
   ```
   > with recursive t(a) as (select 1 as a union all select 1+a from t where a<10000) select count(*) from t;
   
   0.061 s
   
   > with recursive t(a) as (select 1 as a union all select 1+a from t where a<20000) select count(*) from t;
   
   0.046 s
   
   > with recursive t(a) as (select 1 as a union all select 1+a from t where a<40000) select count(*) from t;
   
   0.052 s
   
   > with recursive t(a) as (select 1 as a union all select 1+a from t where a<80000) select count(*) from t;
   
   0.074 s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org