You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/01/13 04:51:00 UTC

[jira] [Created] (ARROW-15317) [R] Expose API to create Dataset from Fragments

Will Jones created ARROW-15317:
----------------------------------

             Summary: [R] Expose API to create Dataset from Fragments
                 Key: ARROW-15317
                 URL: https://issues.apache.org/jira/browse/ARROW-15317
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 6.0.1
            Reporter: Will Jones


Third-party packages may define dataset factories for table formats like Delta Lake and Apache Iceberg. These formats store metadata like schema, file lists, and file-level statistics on the side, and can construct a dataset without a discovery process needed. Python exposed enough API to do this successfully for [a Delta Lake dataset reader here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].

I propose adding the following to the R API:

 * Expose {{Fragment}} as an R6 object
 * Add the {{MakeFragment}} method to various file format objects. It's key that {{partition_expression}} is included as an argument. ([See Python equivalent here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
 * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)