You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/09/01 01:59:28 UTC

[GitHub] [arrow] wjones127 opened a new issue, #37504: [Python] Define a Dataset protocol based on Substrait and C Data Interface

wjones127 opened a new issue, #37504:
URL: https://github.com/apache/arrow/issues/37504

   ### Describe the enhancement requested
   
   Based on discussion in the [2023-08-30 Arrow community meeting](https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit#heading=h.k1ts4kvvl8jq). This is a continuation of https://github.com/apache/arrow/pull/35568 and https://github.com/apache/arrow/issues/33986.
   
   We'd like to have a protocol for sharing unmaterialized datasets that:
   
    1. Can be consumed as one or more streams of Arrow data
    2. Can have projections and filters pushed down to the scanner
   
   This would provide a extendible connection between scanners and query engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users employ their preferred query engine to load any supported dataset. From their perspective, usage would might look like:
   
   ```python
   from deltalake import DeltaTable
   table = DeltaTable("path/to/table")
   
   import duckdb
   duckdb.sql("SELECT y FROM table WHERE X > 3")
   ```
   
   ## Shape of the protocol
   
   The overall shape would look roughly like:
   
   ```python
   from abc import ABC
   
   class AbstractArrowScannable(ABC):
       def __arrow_dataset__(self) -> AbstractArrowScanner
   
   
   class AbstractArrowScanner(ABC):
       def get_schema(self) -> capsule[ArrowSchema]:
           ...
   
       def get_stream(
           self,
           columns: List[str],
           filter: SubstraitExpression,
       ) -> capsule[ArrowArrayStream]:
           ...
   
       def get_partitions(self, filter: filter: SubstraitExpression) -> list[AbstractArrowScanner]:
           ...
   
   ```
   
   Data and schema are returned as C Data Interface objects (see: 35531). Predicates are passed as [Substrait extended expressions](https://substrait.io/expressions/extended_expression/). 
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1872372103

   Haven't had time to work on this, but wanted to note here a current pain point for users of the dataset API is that there aren't table statistics the caller can access, and this leads to bad join orders. Some mentions of this here:
   
   https://twitter.com/mim_djo/status/1740542585410814393
   https://github.com/delta-io/delta-rs/issues/1838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1966959341

   > Are we sure a blocking API like this would be palatable?
   
   Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?
   
   Ideally all these methods are brief.
   
   Though I haven't discussed this in depth with implementors of query engines. I'd be curious for their thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1967453592

   An `Iterable` would probably be better indeed. It would not solve the async use case directly but we would at least allow producing results without blocking on the entire filesystem walk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1967451022

   Perhaps `get_partitions(...) -> Iterable[AbstractArrowScanner]` would do it? Not sure if anybody is interested in asyncio for this but an async iterator might work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on issue #37504: [Python] Define a Dataset protocol based on Substrait and C Data Interface

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1703536917

   Is schema negotiation outside the scope of this protocol? If `get_schema()` contains a `Utf8View`, for example, is it the consumer's responsibility to do the cast, or can the consumer pass a schema with `Utf8View` columns as `Utf8` to `get_stream()` (or another method)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1966789378

   Are we sure a blocking API like this would be palatable for existing execution engines such as Acero, DuckDB... ?
   
   Of course, at worse the various method/function calls can be offloaded to a dedicated thread pool.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #37504: [Python] Define a Dataset protocol based on Substrait and C Data Interface

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1706880698

   > Is schema negotiation outside the scope of this protocol?
   
   I think we can include that. I'd like to design that as part of the PyCapsule API first, so we match the semantics there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Define a Dataset protocol based on Substrait and C Data Interface [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #37504:
URL: https://github.com/apache/arrow/issues/37504#issuecomment-1967182003

   > > Are we sure a blocking API like this would be palatable?
   > 
   > Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?
   
   No, to the fact that these functions are synchronous.
   
   > Ideally all these methods are brief.
   
   I'm not sure. `get_partitions` will typically have to walk a filesystem, which can be long-ish especially on large datasets or remote filesystems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org