You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/01/31 14:26:00 UTC

[jira] [Commented] (ARROW-15317) [R] Expose API to create Dataset from Fragments

    [ https://issues.apache.org/jira/browse/ARROW-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484714#comment-17484714 ] 

Dewey Dunnington commented on ARROW-15317:
------------------------------------------

If I'm reading this correctly, this sounds useful for making an abstraction around arbitrary file formats (I'm thinking things like some geospatial formats like shapefiles here) in addition to the ones you listed above!

Where this is tested in Python: https://github.com/apache/arrow/blob/ad073b7c0fec80ce88aaf1e7d6a78104711952f2/python/pyarrow/tests/test_dataset.py#L788-L804

> [R] Expose API to create Dataset from Fragments
> -----------------------------------------------
>
>                 Key: ARROW-15317
>                 URL: https://issues.apache.org/jira/browse/ARROW-15317
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Will Jones
>            Priority: Minor
>
> Third-party packages may define dataset factories for table formats like Delta Lake and Apache Iceberg. These formats store metadata like schema, file lists, and file-level statistics on the side, and can construct a dataset without a discovery process needed. Python exposed enough API to do this successfully for [a Delta Lake dataset reader here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].
> I propose adding the following to the R API:
>  * Expose {{Fragment}} as an R6 object
>  * Add the {{MakeFragment}} method to various file format objects. It's key that {{partition_expression}} is included as an argument. ([See Python equivalent here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
>  * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)