You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Daniël Heres <da...@gmail.com> on 2021/06/09 06:28:55 UTC

Delta Lake support for DataFusion

Hi all,

I would like to receive some feedback about adding Delta Lake support to
DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
As you might know, Delta Lake <https://delta.io/> is a format adding
features like ACID transactions, statistics, and storage optimization to
Parquet and is getting quite some traction for managing data lakes.
It seems a great feature to have in DataFusion as well.

The delta-rs <https://github.com/delta-io/delta-rs> project provides a
native, Apache licensed, Rust implementation of Delta Lake, already
supporting a large part of the format and operations.

The first integration I would like to propose is adding read support via a
new TableProvider. There might be some work to do around dependencies as
both DataFusion and delta-rs rely on (certain versions of) Arrow and
Parquet.

Let me know if you have any further ideas or concerns.

Best regards,

Daniël Heres

Re: Delta Lake support for DataFusion

Posted by Andrew Lamb <al...@influxdata.com>.

> And probably some more I don't think of currently. I think this is useful
> work as it also would enable other "extensions" to work in a similar way

I 100% agree

On Wed, Jun 9, 2021 at 2:30 PM Daniël Heres <da...@gmail.com> wrote:

> Thanks all for the valuable input!
>
> I agree following the plugin / model makes a lot of sense for now (either
> in arrow-datafusion repo or somewhere external, for example in delta-rs if
> we're OK it not being part of Apache right now).
>
> In order to support certain Delta Lake features including SQL syntax we
> probably need to do make DataFusion a bit more extensible besides what is
> currently possible with the TableProvider, for example:
>
> * Allow registering a custom data format (for supporting things like
> *create
> external table t stored as parquet*)
> * Allow parsing and/or handling custom SQL syntax like *optimize*  /
> *vacuum* / *select * from t version as of n* , etc.
>
> And probably some more I don't think of currently. I think this is useful
> work as it also would enable other "extensions" to work in a similar way
> (e.g. Apache Iceberg and other formats / readers / writers / syntax) and
> make DataFusion a more flexible engine.
>
> Best, Daniël
>
> Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <ne...@gmail.com>:
>
> > The correct approach might be to improve DataFusion support in
> > delta-rs. TableProvider is already implemented here:
> >
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
> >
> > I've pinged QP to ask for their advice.
> >
> > Neville
> >
> > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:
> >
> > > I think the idea of DataFusion + DeltaLake is quite compelling and
> likely
> > > useful.
> > >
> > > However, I think DataFusion is ideally an  "embeddable query engine"
> > rather
> > > than a database system in itself, so in that mental model Delta Lake
> > > integration belongs somewhere other than the core DataFusion crate.
> > >
> > > My ideal structure would be a new crate (maybe not even part of the
> > Apache
> > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained
> the
> > > TableProvider and whatever else was needed to integrate DataFusion with
> > > DeltaLake
> > >
> > > This structure could also start a pattern of publishing plugins for
> > > DataFusion separately from the core.
> > >
> > > Andrew
> > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0,
> 4.2.0,
> > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so
> they
> > > should work together nicely
> > >
> > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> > >
> > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to receive some feedback about adding Delta Lake support
> > to
> > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> > > > As you might know, Delta Lake <https://delta.io/> is a format adding
> > > > features like ACID transactions, statistics, and storage optimization
> > to
> > > > Parquet and is getting quite some traction for managing data lakes.
> > > > It seems a great feature to have in DataFusion as well.
> > > >
> > > > The delta-rs <https://github.com/delta-io/delta-rs> project
> provides a
> > > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > > supporting a large part of the format and operations.
> > > >
> > > > The first integration I would like to propose is adding read support
> > via
> > > a
> > > > new TableProvider. There might be some work to do around dependencies
> > as
> > > > both DataFusion and delta-rs rely on (certain versions of) Arrow and
> > > > Parquet.
> > > >
> > > > Let me know if you have any further ideas or concerns.
> > > >
> > > > Best regards,
> > > >
> > > > Daniël Heres
> > > >
> > >
> >
>
>
> --
> Daniël Heres
>

Re: Delta Lake support for DataFusion

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

I agree with all of you. ^_^

I created https://github.com/apache/arrow-datafusion/issues/533 to track
this. I tried to encapsulate the three main use-cases for the SQL
extension. Feel free to edit at will.

Best,
Jorge




On Thu, Jun 10, 2021 at 8:37 AM QP Hou <qp...@scribd.com.invalid> wrote:

> Thanks Daniël for starting the discussion!
>
> Looks like we are on the same page to take this as an opportunity to
> make datafusion more extensible :)
>
> I think Neville and Daniël nailed the biggest missing piece at the
> moment: being able to extend SQL parser and planner with new syntaxes
> and map them to custom plan/expression nodes.
>
> Another thing that I think we should do is to come up with a way to
> better surface these datafusion extensions to help with discoveries.
> For example, pandas has a dedicated section [1] in their official doc
> for this. Perhaps we could start with adding a list of extensions in
> the readme.
>
> After thinking more on this, I feel like it's better to keep the
> extension within delta-rs for now. In the future, delta-rs will likely
> need to depend on ballista for processing delta table metadata using
> distributed compute. So if we move the extension code into
> arrow-datafusion, it might result in circular dependency. I don't see
> a lot of benefits in creating a dedicated datafusion-delta-rs repo at
> the moment. But I am happy to go that route if there are compelling
> reasons. My main goal is just to make sure we have a single officially
> maintained datafusion extension for delta lake.
>
> [1]: https://pandas.pydata.org/docs/ecosystem.html#io
>
> Thanks,
> QP Hou
>
> On Wed, Jun 9, 2021 at 11:30 AM Daniël Heres <da...@gmail.com>
> wrote:
> >
> > Thanks all for the valuable input!
> >
> > I agree following the plugin / model makes a lot of sense for now (either
> > in arrow-datafusion repo or somewhere external, for example in delta-rs
> if
> > we're OK it not being part of Apache right now).
> >
> > In order to support certain Delta Lake features including SQL syntax we
> > probably need to do make DataFusion a bit more extensible besides what is
> > currently possible with the TableProvider, for example:
> >
> > * Allow registering a custom data format (for supporting things like
> *create
> > external table t stored as parquet*)
> > * Allow parsing and/or handling custom SQL syntax like *optimize*  /
> > *vacuum* / *select * from t version as of n* , etc.
> >
> > And probably some more I don't think of currently. I think this is useful
> > work as it also would enable other "extensions" to work in a similar way
> > (e.g. Apache Iceberg and other formats / readers / writers / syntax) and
> > make DataFusion a more flexible engine.
> >
> > Best, Daniël
> >
> > Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <nevilledips@gmail.com
> >:
> >
> > > The correct approach might be to improve DataFusion support in
> > > delta-rs. TableProvider is already implemented here:
> > >
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
> > >
> > > I've pinged QP to ask for their advice.
> > >
> > > Neville
> > >
> > > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:
> > >
> > > > I think the idea of DataFusion + DeltaLake is quite compelling and
> likely
> > > > useful.
> > > >
> > > > However, I think DataFusion is ideally an  "embeddable query engine"
> > > rather
> > > > than a database system in itself, so in that mental model Delta Lake
> > > > integration belongs somewhere other than the core DataFusion crate.
> > > >
> > > > My ideal structure would be a new crate (maybe not even part of the
> > > Apache
> > > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained
> the
> > > > TableProvider and whatever else was needed to integrate DataFusion
> with
> > > > DeltaLake
> > > >
> > > > This structure could also start a pattern of publishing plugins for
> > > > DataFusion separately from the core.
> > > >
> > > > Andrew
> > > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0,
> 4.2.0,
> > > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so
> they
> > > > should work together nicely
> > > >
> > > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> > > >
> > > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to receive some feedback about adding Delta Lake
> support
> > > to
> > > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525
> ).
> > > > > As you might know, Delta Lake <https://delta.io/> is a format
> adding
> > > > > features like ACID transactions, statistics, and storage
> optimization
> > > to
> > > > > Parquet and is getting quite some traction for managing data lakes.
> > > > > It seems a great feature to have in DataFusion as well.
> > > > >
> > > > > The delta-rs <https://github.com/delta-io/delta-rs> project
> provides a
> > > > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > > > supporting a large part of the format and operations.
> > > > >
> > > > > The first integration I would like to propose is adding read
> support
> > > via
> > > > a
> > > > > new TableProvider. There might be some work to do around
> dependencies
> > > as
> > > > > both DataFusion and delta-rs rely on (certain versions of) Arrow
> and
> > > > > Parquet.
> > > > >
> > > > > Let me know if you have any further ideas or concerns.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Daniël Heres
> > > > >
> > > >
> > >
> >
> >
> > --
> > Daniël Heres
>

Re: Delta Lake support for DataFusion

Posted by QP Hou <qp...@scribd.com.INVALID>.

Thanks Daniël for starting the discussion!

Looks like we are on the same page to take this as an opportunity to
make datafusion more extensible :)

I think Neville and Daniël nailed the biggest missing piece at the
moment: being able to extend SQL parser and planner with new syntaxes
and map them to custom plan/expression nodes.

Another thing that I think we should do is to come up with a way to
better surface these datafusion extensions to help with discoveries.
For example, pandas has a dedicated section [1] in their official doc
for this. Perhaps we could start with adding a list of extensions in
the readme.

After thinking more on this, I feel like it's better to keep the
extension within delta-rs for now. In the future, delta-rs will likely
need to depend on ballista for processing delta table metadata using
distributed compute. So if we move the extension code into
arrow-datafusion, it might result in circular dependency. I don't see
a lot of benefits in creating a dedicated datafusion-delta-rs repo at
the moment. But I am happy to go that route if there are compelling
reasons. My main goal is just to make sure we have a single officially
maintained datafusion extension for delta lake.

[1]: https://pandas.pydata.org/docs/ecosystem.html#io

Thanks,
QP Hou

On Wed, Jun 9, 2021 at 11:30 AM Daniël Heres <da...@gmail.com> wrote:
>
> Thanks all for the valuable input!
>
> I agree following the plugin / model makes a lot of sense for now (either
> in arrow-datafusion repo or somewhere external, for example in delta-rs if
> we're OK it not being part of Apache right now).
>
> In order to support certain Delta Lake features including SQL syntax we
> probably need to do make DataFusion a bit more extensible besides what is
> currently possible with the TableProvider, for example:
>
> * Allow registering a custom data format (for supporting things like *create
> external table t stored as parquet*)
> * Allow parsing and/or handling custom SQL syntax like *optimize*  /
> *vacuum* / *select * from t version as of n* , etc.
>
> And probably some more I don't think of currently. I think this is useful
> work as it also would enable other "extensions" to work in a similar way
> (e.g. Apache Iceberg and other formats / readers / writers / syntax) and
> make DataFusion a more flexible engine.
>
> Best, Daniël
>
> Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <ne...@gmail.com>:
>
> > The correct approach might be to improve DataFusion support in
> > delta-rs. TableProvider is already implemented here:
> > https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
> >
> > I've pinged QP to ask for their advice.
> >
> > Neville
> >
> > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:
> >
> > > I think the idea of DataFusion + DeltaLake is quite compelling and likely
> > > useful.
> > >
> > > However, I think DataFusion is ideally an  "embeddable query engine"
> > rather
> > > than a database system in itself, so in that mental model Delta Lake
> > > integration belongs somewhere other than the core DataFusion crate.
> > >
> > > My ideal structure would be a new crate (maybe not even part of the
> > Apache
> > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained the
> > > TableProvider and whatever else was needed to integrate DataFusion with
> > > DeltaLake
> > >
> > > This structure could also start a pattern of publishing plugins for
> > > DataFusion separately from the core.
> > >
> > > Andrew
> > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0,
> > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so they
> > > should work together nicely
> > >
> > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> > >
> > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to receive some feedback about adding Delta Lake support
> > to
> > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> > > > As you might know, Delta Lake <https://delta.io/> is a format adding
> > > > features like ACID transactions, statistics, and storage optimization
> > to
> > > > Parquet and is getting quite some traction for managing data lakes.
> > > > It seems a great feature to have in DataFusion as well.
> > > >
> > > > The delta-rs <https://github.com/delta-io/delta-rs> project provides a
> > > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > > supporting a large part of the format and operations.
> > > >
> > > > The first integration I would like to propose is adding read support
> > via
> > > a
> > > > new TableProvider. There might be some work to do around dependencies
> > as
> > > > both DataFusion and delta-rs rely on (certain versions of) Arrow and
> > > > Parquet.
> > > >
> > > > Let me know if you have any further ideas or concerns.
> > > >
> > > > Best regards,
> > > >
> > > > Daniël Heres
> > > >
> > >
> >
>
>
> --
> Daniël Heres

Re: Delta Lake support for DataFusion

Posted by Daniël Heres <da...@gmail.com>.

Thanks all for the valuable input!

I agree following the plugin / model makes a lot of sense for now (either
in arrow-datafusion repo or somewhere external, for example in delta-rs if
we're OK it not being part of Apache right now).

In order to support certain Delta Lake features including SQL syntax we
probably need to do make DataFusion a bit more extensible besides what is
currently possible with the TableProvider, for example:

* Allow registering a custom data format (for supporting things like *create
external table t stored as parquet*)
* Allow parsing and/or handling custom SQL syntax like *optimize*  /
*vacuum* / *select * from t version as of n* , etc.

And probably some more I don't think of currently. I think this is useful
work as it also would enable other "extensions" to work in a similar way
(e.g. Apache Iceberg and other formats / readers / writers / syntax) and
make DataFusion a more flexible engine.

Best, Daniël

Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <ne...@gmail.com>:

> The correct approach might be to improve DataFusion support in
> delta-rs. TableProvider is already implemented here:
> https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs
>
> I've pinged QP to ask for their advice.
>
> Neville
>
> On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:
>
> > I think the idea of DataFusion + DeltaLake is quite compelling and likely
> > useful.
> >
> > However, I think DataFusion is ideally an  "embeddable query engine"
> rather
> > than a database system in itself, so in that mental model Delta Lake
> > integration belongs somewhere other than the core DataFusion crate.
> >
> > My ideal structure would be a new crate (maybe not even part of the
> Apache
> > Arrow Project), perhaps called `datafusion-delta-rs`, that contained the
> > TableProvider and whatever else was needed to integrate DataFusion with
> > DeltaLake
> >
> > This structure could also start a pattern of publishing plugins for
> > DataFusion separately from the core.
> >
> > Andrew
> > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0,
> > etc), I think delta-rs[1] and datafusion both only specify `4.x` so they
> > should work together nicely
> >
> > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
> >
> > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I would like to receive some feedback about adding Delta Lake support
> to
> > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> > > As you might know, Delta Lake <https://delta.io/> is a format adding
> > > features like ACID transactions, statistics, and storage optimization
> to
> > > Parquet and is getting quite some traction for managing data lakes.
> > > It seems a great feature to have in DataFusion as well.
> > >
> > > The delta-rs <https://github.com/delta-io/delta-rs> project provides a
> > > native, Apache licensed, Rust implementation of Delta Lake, already
> > > supporting a large part of the format and operations.
> > >
> > > The first integration I would like to propose is adding read support
> via
> > a
> > > new TableProvider. There might be some work to do around dependencies
> as
> > > both DataFusion and delta-rs rely on (certain versions of) Arrow and
> > > Parquet.
> > >
> > > Let me know if you have any further ideas or concerns.
> > >
> > > Best regards,
> > >
> > > Daniël Heres
> > >
> >
>


-- 
Daniël Heres

Re: Delta Lake support for DataFusion

Posted by Neville Dipale <ne...@gmail.com>.

The correct approach might be to improve DataFusion support in
delta-rs. TableProvider is already implemented here:
https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs

I've pinged QP to ask for their advice.

Neville

On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote:

> I think the idea of DataFusion + DeltaLake is quite compelling and likely
> useful.
>
> However, I think DataFusion is ideally an  "embeddable query engine" rather
> than a database system in itself, so in that mental model Delta Lake
> integration belongs somewhere other than the core DataFusion crate.
>
> My ideal structure would be a new crate (maybe not even part of the Apache
> Arrow Project), perhaps called `datafusion-delta-rs`, that contained the
> TableProvider and whatever else was needed to integrate DataFusion with
> DeltaLake
>
> This structure could also start a pattern of publishing plugins for
> DataFusion separately from the core.
>
> Andrew
> p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0,
> etc), I think delta-rs[1] and datafusion both only specify `4.x` so they
> should work together nicely
>
> https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml
>
> On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com> wrote:
>
> > Hi all,
> >
> > I would like to receive some feedback about adding Delta Lake support to
> > DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> > As you might know, Delta Lake <https://delta.io/> is a format adding
> > features like ACID transactions, statistics, and storage optimization to
> > Parquet and is getting quite some traction for managing data lakes.
> > It seems a great feature to have in DataFusion as well.
> >
> > The delta-rs <https://github.com/delta-io/delta-rs> project provides a
> > native, Apache licensed, Rust implementation of Delta Lake, already
> > supporting a large part of the format and operations.
> >
> > The first integration I would like to propose is adding read support via
> a
> > new TableProvider. There might be some work to do around dependencies as
> > both DataFusion and delta-rs rely on (certain versions of) Arrow and
> > Parquet.
> >
> > Let me know if you have any further ideas or concerns.
> >
> > Best regards,
> >
> > Daniël Heres
> >
>

Re: Delta Lake support for DataFusion

Posted by Andrew Lamb <al...@influxdata.com>.

I think the idea of DataFusion + DeltaLake is quite compelling and likely
useful.

However, I think DataFusion is ideally an  "embeddable query engine" rather
than a database system in itself, so in that mental model Delta Lake
integration belongs somewhere other than the core DataFusion crate.

My ideal structure would be a new crate (maybe not even part of the Apache
Arrow Project), perhaps called `datafusion-delta-rs`, that contained the
TableProvider and whatever else was needed to integrate DataFusion with
DeltaLake

This structure could also start a pattern of publishing plugins for
DataFusion separately from the core.

Andrew
p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, 4.2.0,
etc), I think delta-rs[1] and datafusion both only specify `4.x` so they
should work together nicely

https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml

On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <da...@gmail.com> wrote:

> Hi all,
>
> I would like to receive some feedback about adding Delta Lake support to
> DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
> As you might know, Delta Lake <https://delta.io/> is a format adding
> features like ACID transactions, statistics, and storage optimization to
> Parquet and is getting quite some traction for managing data lakes.
> It seems a great feature to have in DataFusion as well.
>
> The delta-rs <https://github.com/delta-io/delta-rs> project provides a
> native, Apache licensed, Rust implementation of Delta Lake, already
> supporting a large part of the format and operations.
>
> The first integration I would like to propose is adding read support via a
> new TableProvider. There might be some work to do around dependencies as
> both DataFusion and delta-rs rely on (certain versions of) Arrow and
> Parquet.
>
> Let me know if you have any further ideas or concerns.
>
> Best regards,
>
> Daniël Heres
>

Re: Delta Lake support for DataFusion

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi,

Some questions that come to mind:

1. If we add vendor X to datafusion, will we be open to other vendor Y? How
do we compare vendors? How do we draw the line of "not sufficiently
relevant"?
2. How do we ensure that we do not distort the same level playing field
that some people expect from DataFusion?
3. What is the challenge of creating a binary that uses DataFusion +
Delta-lake custom table provider outside of DataFusion?

I see DataFusion's plugin system,

* custom nodes
* custom table providers
* custom physical optimizers
* custom logical optimizers
* UDFs
* UDAFs

as our answer to not bundle vendor-specific implementations (e.g. s3,
azure, Oracle, IOx, IBM, google, delta lake), and instead allow users to
build applications on top of, with whatever vendor-specific requirements
they have. Rust lends itself really well to this, as dependencies are
maintained in Cargo.toml, and binaries compiled with DataFusion can be
built with plugins and deployed in prod environments as a single binary.

AFAIK delta-lake itself is not bundled with spark, and is instead installed
separately (e.g. on the POM for java, pip install delta-spark for Python)
[1]. I think that this is a sustainable model whereby we do not have to
know about delta-lake specifics to be able to maintain the code, and
instead declare contracts for extensions, which others maintain for their
specific formats/systems.

Best,
Jorge

[1] https://docs.delta.io/1.0.0/quick-start.html