You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/03 12:53:31 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #33986: [Python][Rust] Create extension point in python for Dataset/Scanner

jorisvandenbossche commented on issue #33986:
URL: https://github.com/apache/arrow/issues/33986#issuecomment-1415832284

   I have been recently thinking about this as well, triggered by our work to support the DataFrame Interchange protocol in pyarrow (https://github.com/apache/arrow/pull/14804, https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html). 
   For some context, this "interchange protocol" is a python specific one to allow interchanging dataframe-data, typically between different dataframe libraries (eg convert a dataframe of library1 into a dataframe of library2 without both libraries having to know about each other). Essentially, this is very similar to our C Data Interface (it specifies how to share the memory, and in the end also has pointers to buffers of a certain size, just like the C struct), but because it is a python API, it gives you some functionality on top of those raw buffers (or, before you get to the buffers). 
   
   As illustration, using latest pyarrow (using a pyarrow.Table, so all data is already in memory, but we _could_ add the same to something backed by a not-yet materialized stream or data on disk):
   
   ```python
   >>> table = pa.table({'a': [1, 2, 3], 'b': [4, 5, 6]})
   # accessing this interchange object
   >>> interchange_object = table.__dataframe__()
   >>> interchange_object
   <pyarrow.interchange.dataframe._PyArrowDataFrame at 0x7ffb64f45420>
   # this doesn't yet need to materialize all the buffers, but you can inspect metadata
   >>> interchange_object.num_columns()
   2
   >>> interchange_object.column_names()
   ['a', 'b']
   # you can select a subset (i.e. simple projection)
   >>> subset = interchange_object.select_columns_by_name(['a'])
   >>> subset.num_columns()
   1
   # only when actually asking for the buffers of one chunk of a column, the data needs
   # to be in memory (to pass a pointer to the buffers)
   >>> subset.get_column_by_name("a").get_buffers()
   {'data': (PyArrowBuffer({'bufsize': 24, 'ptr': 140717840175104, 'device': 'CPU'}),
     (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': None,
    'offsets': None}
   ```
   
   To be clear, I don't propose we do something exactly like that (and personally, I think it's also a missed opportunity for the DataFrame Interchange protocol to not use Arrow for the memory layout specification, and not give access to data as arrow data or using the Arrow C Interface). 
   But it has some nice properties compared to our "raw" C Data Interface. It allows to first inspect the data (columns, data types) before you actually start materializing/consuming the data chunk by chunk, and allows you to get the data of only a subset of the columns. While when using the C Stream for a Table or generically RecordBatchReader, you directly get the struct with all buffers for all columns. So if you don't need all columns, you need to ensure to first subset the data before accessing the C interface data, but this means you need to know the API of the object you are getting the data from (partly removing the benefit of the generality of the C Interface). 
   
   So having some standard Python interface that would give you delayed and queryable (filter/simple projection) access to Arrow C Interface data sounds really interesting, or having this as an additional C ABI (for which we can still provide such a Python interface as well) like David sketched above. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org