You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "QP Hou (Jira)" <ji...@apache.org> on 2021/11/18 17:10:00 UTC

[jira] [Comment Edited] (ARROW-14730) [C++][R][Python] Support reading from Delta Lake tables

    [ https://issues.apache.org/jira/browse/ARROW-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446049#comment-17446049 ] 

QP Hou edited comment on ARROW-14730 at 11/18/21, 5:09 PM:
-----------------------------------------------------------

> From that's it's simply a matter of reading parquet files. In fact you can see this in the implementation of to_pyarrow_dataset().

The current implementation is actually not complete because we are not populating partition columns based off delta table metadata. In the long run, we plan to change the Rust core to return enriched arrow record batches instead. The real hard part is writing to tables backed by S3.

For these complex table formats like delta, iceberg and hudi, I think it's better to keep them in a separate repos. For example, deltalake itself is designed for spark, but managed in its own repo outside of the spark code base.

I will be more than happy to accept and help with PRs to add the C++ binding in delta-rs ;)


was (Author: houqp):
> From that's it's simply a matter of reading parquet files. In fact you can see this in the implementation of to_pyarrow_dataset().

The current implementation is actually not complete because we are not populating partition columns based off delta table metadata. In the long run, we plan to change the Rust core to return enriched arrow record batches instead.

For these complex table formats like delta, iceberg and hudi, I think it's better to keep them in a separate repos. For example, deltalake itself is designed for spark, but managed in its own repo outside of the spark code base.

I will be more than happy to accept and help with PRs to add the C++ binding in delta-rs ;)

> [C++][R][Python] Support reading from Delta Lake tables
> -------------------------------------------------------
>
>                 Key: ARROW-14730
>                 URL: https://issues.apache.org/jira/browse/ARROW-14730
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Will Jones
>            Priority: Major
>
> [Delta Lake|https://delta.io/] is a parquet table format that supports ACID transactions. It's popularized by Databricks, which uses it as the default table format in their platform. Previously, it's only been readable from Spark, but now there is an effort in [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from elsewhere. There is already some integration with DataFusion (see: https://github.com/apache/arrow-datafusion/issues/525).
> There does already exist [a method to read Delta Lake tables into Arrow tables in Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table] in the delta-rs Python bindings. This includes filtering by partitions.
> Is there a good way we could integrate this functionality with Arrow C++ Dataset and expose that in Python and R? Would that be something that should be implemented in Arrow libraries or in delta-rs?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)