You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/04 17:55:33 UTC

[GitHub] [arrow] mattgerg12 opened a new issue #12568: Support for automatic optimizations

mattgerg12 opened a new issue #12568:
URL: https://github.com/apache/arrow/issues/12568


   @nealrichardson Does arrow R query optimizer can automatically push down filters and projections based on the `dplyr` transformation queries that follows after the load using `open_dataset()`? I am aware that if we give the `partitioning =`option then it will only put the directories that is needed. But other than this are there any options to achieve the predicate and projection pushdown ?
   
   Are there any plans to support this in future ? Or the recent integration with [duckDB](https://arrow.apache.org/docs/r/news/#2-duckdb-integration-6-0-0) is the solution Rarrow is leaning towards? But in this situation we are limited in using only duckDB verbs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed issue #12568: Support for automatic optimizations

Posted by GitBox <gi...@apache.org>.

nealrichardson closed issue #12568:
URL: https://github.com/apache/arrow/issues/12568


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on issue #12568: Support for automatic optimizations

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on issue #12568:
URL: https://github.com/apache/arrow/issues/12568#issuecomment-1059395801


   There is no query optimizer per se, but yes, filter and projection are pushed down in the arrow query engine. duckdb also pushes predicates down to arrow when you use it to query an arrow dataset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mattgerg12 commented on issue #12568: Support for automatic optimizations

Posted by GitBox <gi...@apache.org>.

mattgerg12 commented on issue #12568:
URL: https://github.com/apache/arrow/issues/12568#issuecomment-1059407707


   Thanks @nealrichardson for replying. Sorry for the misunderstanding. Is this automatic filter and projection pushdown feature with `rarrow` and `dplyr` OR in general with `arrow` ? As I was thinking it's not from this [arrow article](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/) that says
   
   _"For the comparison with Pandas, note that DuckDB runs in parallel, while pandas only support single-threaded execution. Besides that, one should note that we are comparing automatic optimizations. DuckDB’s query optimizer can automatically push down filters and projections. This automatic optimization is not supported in pandas, but it is possible for users to manually perform some of these predicate and filter pushdowns by manually specifying them them in the read_parquet() call."_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on issue #12568: Support for automatic optimizations

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on issue #12568:
URL: https://github.com/apache/arrow/issues/12568#issuecomment-1059473984


   The arrow C++ engine supports predicate pushdown. The R package wraps that and supports calling it with dplyr methods, which push down. The pandas comparison in the duckdb blog post you mention is using pandas, not arrow, to compute. It's only using arrow to read the parquet file into a pandas DataFrame.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #12568: Support for automatic optimizations

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #12568:
URL: https://github.com/apache/arrow/issues/12568#issuecomment-1059426493


   This is not an answer exactly but I wanted to mention that there are a number of PRs & JIRAs in Arrow to actively adopt [Substrait](https://substrait.io/) which should open the door for users to plug in their own optimizer.  For example, if rarrow's dplyr bindings produced Substrait plans instead of arrow plans then you could do:
   
   ```
   rarrow-dplyr -> substrait -> any optimizer you want (e.g. duckdb's optimizer, calcite, etc.) -> substrait -> arrow
   ```
   
   Though that is perhaps a bit of an oversimplification as optimizers often would want information back from arrow as well (e.g. table statistics, cost estimates, etc.) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org