You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/06/07 07:26:51 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #35568: GH-33986: [Python] Sketch out a minimal protocol interface for datasets

jorisvandenbossche commented on PR #35568:
URL: https://github.com/apache/arrow/pull/35568#issuecomment-1580094236

   > > @westonpace you are correct that this doesn't define how such dataset classes are built. That's left to the consumer, who will write their own classes that conform to this API.
   > 
   > That would seem an essential API if this protocol was meant to be used by "table formats" to prepare "queries simple enough for query engines to understand". So perhaps I am misunderstanding.
   > 
   > Is this protocol meant to be used by "query engines" to "query a table format library as if it were a dataset"?
   
   My assumption was that it are the _producers_ that implement the classes that conform to this API?
   
   How are the consumer and producer supposed to interact with this protocol?
   
   Taking duckdb as example, the user can currently manually create a pyarrow object, and then query automatically from this using duckdb:
   
   ```python
   import pyarrow.dataset as ds
   
   pyarrow_dataset = ds.dataset(...)
   duckdb.sql("SELECT * FROM pyarrow_dataset WHERE ..")
   ```
   
   Is the idea that something similar would then work for any object supporting this protocol? (in the assumption that duckdb relaxes it check for a pyarrow object to any object conforming to the protocol) For example with delta-lake:
   ```python
   from deltalake import DeltaTable
   
   delta_table = DeltaTable("..")
   duckdb.sql("SELECT * FROM delta_table WHERE ..")
   ```
   
   But if this is the intended usage, I don't understand what the "builder API" (https://github.com/apache/arrow/pull/35568#pullrequestreview-1431322458) would be meant for? 
   
   > In other words, for a table format to use a query engine, it's not enough to pass a single query (e.g. filter / columns / whatever). We need to pass a query per file.
   
   @westonpace Why is that not sufficient? I think it is up to the table format to translate the single query into a query per file (and execute this)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org