You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2021/12/12 21:35:00 UTC

[jira] [Commented] (ARROW-14730) [C++][R][Python] Support reading from Delta Lake tables

    [ https://issues.apache.org/jira/browse/ARROW-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458047#comment-17458047 ] 

Will Jones commented on ARROW-14730:
------------------------------------

I've thought about this a little more and I think it might make more sense to do the Dataset implementation of Delta Lake in C++ and not wrap delta-rs. From a maintenance perspective, we already have expertise maintaining the code, tests, and CI associated with Python / R / C++; trying to replicate that in delta-rs might prove difficult. And from a code perspective, a Delta Lake reader requires Filesystem (local, S3) and format reader (Parquet, JSON) implementations, and so if the PyArrow and R arrow implementations used delta-rs they would inevitably contain a Rust and C++ implementations of that with potentially different behaviors. I don't think we can avoid that.

That said, it still might make sense to have the code live in delta-io GitHub organization. AFAIK we don't yet have dataset implementations that live outside of the Arrow repo, but that's something we'd eventually like to support, right?

> [C++][R][Python] Support reading from Delta Lake tables
> -------------------------------------------------------
>
>                 Key: ARROW-14730
>                 URL: https://issues.apache.org/jira/browse/ARROW-14730
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Will Jones
>            Priority: Major
>
> [Delta Lake|https://delta.io/] is a parquet table format that supports ACID transactions. It's popularized by Databricks, which uses it as the default table format in their platform. Previously, it's only been readable from Spark, but now there is an effort in [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from elsewhere. There is already some integration with DataFusion (see: https://github.com/apache/arrow-datafusion/issues/525).
> There does already exist [a method to read Delta Lake tables into Arrow tables in Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table] in the delta-rs Python bindings. This includes filtering by partitions.
> Is there a good way we could integrate this functionality with Arrow C++ Dataset and expose that in Python and R? Would that be something that should be implemented in Arrow libraries or in delta-rs?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)