You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/10 18:31:54 UTC

[GitHub] [arrow] westonpace commented on issue #33986: [Python][Rust] Create extension point in python for Dataset/Scanner

westonpace commented on issue #33986:
URL: https://github.com/apache/arrow/issues/33986#issuecomment-1542634140

@vibhatha can say more but I believe we have been playing around with Substrait and Iceberg for something similar.

> I'm basically thinking we have table formats with Python libraries: Delta Lake, Iceberg, and Lance. And we have single-node query engines such as DuckDB, Polars, and Datafusion. It would be cool if we could pass any of the table formats into any of the query engines, all with one protocol.

* Start with a query that has named tables (e.g. `SELECT foo.x, bar.y FROM foo INNER JOIN bar ON foo.id = bar.id WHERE foo.z > 20`)
* Convert query to Substrait if not there already
* In some library (e.g. pyiceberg or pysubstrait) Visit the Substrait query to create a new Substrait query
* For each named table
* Look up the table in the catalog. Figure out which files to query, how to devolve the filter / selection, what additional filters to add for row level deletes, etc.
* Create a local_files read that has the correct files, the devolved filter, and the devolved selection
* Pass the Substrait query on to your engine of choice (e.g. DuckDb, Polars, Datafusion).

Producers only need to know Substrait
Consumers only need to know Substrait
User has to register the middleware piece somewhere.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org