You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2021/11/16 21:38:00 UTC

[jira] [Created] (ARROW-14730) [C++][R][Python] Support reading from Delta Lake tables

Will Jones created ARROW-14730:
----------------------------------

             Summary: [C++][R][Python] Support reading from Delta Lake tables
                 Key: ARROW-14730
                 URL: https://issues.apache.org/jira/browse/ARROW-14730
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Will Jones


[Delta Lake|https://delta.io/] is a parquet table format that supports ACID transactions. It's popularized by Databricks, which uses it as the default table format in their platform. Previously, it's only been readable from Spark, but now there is an effort in [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from elsewhere. There is already some integration with DataFusion (see: https://github.com/apache/arrow-datafusion/issues/525).

There does already exist [a method to read Delta Lake tables into Arrow tables in Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table] in the delta-rs Python bindings. This includes filtering by partitions.

Is there a good way we could integrate this functionality with Arrow C++ Dataset and expose that in Python and R? Would that be something that should be implemented in Arrow libraries or in delta-rs?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)