You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Andrew Lamb <al...@influxdata.com> on 2021/02/14 11:13:38 UTC

[Rust] [DataFusion] Topic for next Rust Sync Call

I would like to add the following item to the agenda call for the next Rust
sync call:

Dependencies

Background: As the dependency stack gets larger, it will be harder to use
DataFusion as an embedded query engine and the compile / dev times will get
higher.

As we expand the supported functions of DataFusion this problem is likely
to get worse. For example
https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
https://github.com/apache/arrow/pull/9139

Proposal: Add Rust "features" to the datafusion crate and make many of the
new dependencies optional (so that we had features like regex and unicode
and hash which would only pull in the dependencies / have those functions
if the features were enabled.) This approach has worked well for Arrow
(which has only chrono and num as required dependencies)

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Wes McKinney <we...@gmail.com>.
Regarding https://github.com/jorgecarleitao/arrow2 — please (looking
at the PMC members who work on Rust) be careful about IP provenance
considerations with code developed outside the Foundation.

On Wed, Mar 10, 2021 at 11:15 AM Dominik Moritz <do...@cmu.edu> wrote:
>
>  I have a talk prepared to talk about my Arrow implementation in
> WebAssembly.
>
> On Mar 10, 2021 at 04:38:21, Andrew Lamb <al...@influxdata.com> wrote:
>
> > Reminder that today is the next Rust sync call
> >
> > Potential topics for discussion:
> > * Ballista / DataFusion / etc
> > * I remember that someone else was going to demo the use of Arrow but I
> > can't remember exactly what that was now
> >
> > On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz <do...@cmu.edu> wrote:
> >
> >  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
> >
> > because of some dependencies:
> >
> > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
> >
> > whether DataFusion could have a feature flag for only shipping what is WASM
> >
> > compatible?
> >
> >
> > On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:
> >
> >
> > > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > >
> >
> > > carve out some free time for the next one :)
> >
> > >
> >
> > > It is every other Wednesday at noon EST. Here is the original
> >
> > announcement
> >
> > > with more details:
> >
> > >
> >
> > >
> >
> >
> > https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >
> > >
> >
> > >
> >
> > > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> >
> > ruan@reservoirdb.com>
> >
> > > wrote:
> >
> > >
> >
> > > I'd be interested in helping spec this out, it's especially tricky atm to
> >
> > >
> >
> > > track down issues when integrating DataFusion into the same binary as
> >
> > other
> >
> > >
> >
> > > medium/large dependencies.
> >
> > >
> >
> > >
> >
> > > Recently hit a really specific issue where DataFusion depends on Parquet,
> >
> > >
> >
> > > which supports various compression algs, including Brotli, and actix-web
> >
> > >
> >
> > > also depends on a slightly different Rust implementation of Brotli. Both
> >
> > of
> >
> > >
> >
> > > these Brotli libs package the same underlying C lib separately, resulting
> >
> > >
> >
> > > in multiply-defined symbols compiling using msvc (and maybe on other
> >
> > >
> >
> > > platforms? didn't test in CI in the end).
> >
> > >
> >
> > >
> >
> > > Got a quick interim hack [1] in place for my use case which doesn't
> >
> > really
> >
> > >
> >
> > > use Parquet, so it's not pressing, but would be awesome to sort this
> >
> > >
> >
> > > properly upstream.
> >
> > >
> >
> > >
> >
> > > I guess the only major tradeoff of having a comprehensive feature setup
> >
> > is
> >
> > >
> >
> > > that it could make testing slightly harder, in terms of making sure
> >
> > no-one
> >
> > >
> >
> > > breaks the build for specific feature combinations; this can always be
> >
> > >
> >
> > > mitigated with more CI though (yay, unlimited Actions minutes for public
> >
> > >
> >
> > > repos).
> >
> > >
> >
> > >
> >
> > > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > >
> >
> > > carve out some free time for the next one :)
> >
> > >
> >
> > >
> >
> > > [1]
> >
> > >
> >
> > >
> >
> > >
> >
> >
> > https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >
> > >
> >
> > >
> >
> > > -----Original Message-----
> >
> > >
> >
> > > From: Andrew Lamb <al...@influxdata.com>
> >
> > >
> >
> > > Sent: 14 February 2021 11:14
> >
> > >
> >
> > > To: dev <de...@arrow.apache.org>
> >
> > >
> >
> > > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >
> > >
> >
> > >
> >
> > > I would like to add the following item to the agenda call for the next
> >
> > >
> >
> > > Rust sync call:
> >
> > >
> >
> > >
> >
> > > Dependencies
> >
> > >
> >
> > >
> >
> > > Background: As the dependency stack gets larger, it will be harder to use
> >
> > >
> >
> > > DataFusion as an embedded query engine and the compile / dev times will
> >
> > get
> >
> > >
> >
> > > higher.
> >
> > >
> >
> > >
> >
> > > As we expand the supported functions of DataFusion this problem is likely
> >
> > >
> >
> > > to get worse. For example
> >
> > >
> >
> > > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >
> > >
> >
> > > https://github.com/apache/arrow/pull/9139
> >
> > >
> >
> > >
> >
> > > Proposal: Add Rust "features" to the datafusion crate and make many of
> >
> > the
> >
> > >
> >
> > > new dependencies optional (so that we had features like regex and unicode
> >
> > >
> >
> > > and hash which would only pull in the dependencies / have those functions
> >
> > >
> >
> > > if the features were enabled.) This approach has worked well for Arrow
> >
> > >
> >
> > > (which has only chrono and num as required dependencies)
> >
> > >
> >
> > >
> >
> > >
> >
> >
> >

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Dominik Moritz <do...@cmu.edu>.
 I have a talk prepared to talk about my Arrow implementation in
WebAssembly.

On Mar 10, 2021 at 04:38:21, Andrew Lamb <al...@influxdata.com> wrote:

> Reminder that today is the next Rust sync call
>
> Potential topics for discussion:
> * Ballista / DataFusion / etc
> * I remember that someone else was going to demo the use of Arrow but I
> can't remember exactly what that was now
>
> On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz <do...@cmu.edu> wrote:
>
>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
>
> because of some dependencies:
>
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
>
> whether DataFusion could have a feature flag for only shipping what is WASM
>
> compatible?
>
>
> On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:
>
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
>
> >
>
> > carve out some free time for the next one :)
>
> >
>
> > It is every other Wednesday at noon EST. Here is the original
>
> announcement
>
> > with more details:
>
> >
>
> >
>
>
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
>
> >
>
> >
>
> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
>
> ruan@reservoirdb.com>
>
> > wrote:
>
> >
>
> > I'd be interested in helping spec this out, it's especially tricky atm to
>
> >
>
> > track down issues when integrating DataFusion into the same binary as
>
> other
>
> >
>
> > medium/large dependencies.
>
> >
>
> >
>
> > Recently hit a really specific issue where DataFusion depends on Parquet,
>
> >
>
> > which supports various compression algs, including Brotli, and actix-web
>
> >
>
> > also depends on a slightly different Rust implementation of Brotli. Both
>
> of
>
> >
>
> > these Brotli libs package the same underlying C lib separately, resulting
>
> >
>
> > in multiply-defined symbols compiling using msvc (and maybe on other
>
> >
>
> > platforms? didn't test in CI in the end).
>
> >
>
> >
>
> > Got a quick interim hack [1] in place for my use case which doesn't
>
> really
>
> >
>
> > use Parquet, so it's not pressing, but would be awesome to sort this
>
> >
>
> > properly upstream.
>
> >
>
> >
>
> > I guess the only major tradeoff of having a comprehensive feature setup
>
> is
>
> >
>
> > that it could make testing slightly harder, in terms of making sure
>
> no-one
>
> >
>
> > breaks the build for specific feature combinations; this can always be
>
> >
>
> > mitigated with more CI though (yay, unlimited Actions minutes for public
>
> >
>
> > repos).
>
> >
>
> >
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
>
> >
>
> > carve out some free time for the next one :)
>
> >
>
> >
>
> > [1]
>
> >
>
> >
>
> >
>
>
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>
> >
>
> >
>
> > -----Original Message-----
>
> >
>
> > From: Andrew Lamb <al...@influxdata.com>
>
> >
>
> > Sent: 14 February 2021 11:14
>
> >
>
> > To: dev <de...@arrow.apache.org>
>
> >
>
> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>
> >
>
> >
>
> > I would like to add the following item to the agenda call for the next
>
> >
>
> > Rust sync call:
>
> >
>
> >
>
> > Dependencies
>
> >
>
> >
>
> > Background: As the dependency stack gets larger, it will be harder to use
>
> >
>
> > DataFusion as an embedded query engine and the compile / dev times will
>
> get
>
> >
>
> > higher.
>
> >
>
> >
>
> > As we expand the supported functions of DataFusion this problem is likely
>
> >
>
> > to get worse. For example
>
> >
>
> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
>
> >
>
> > https://github.com/apache/arrow/pull/9139
>
> >
>
> >
>
> > Proposal: Add Rust "features" to the datafusion crate and make many of
>
> the
>
> >
>
> > new dependencies optional (so that we had features like regex and unicode
>
> >
>
> > and hash which would only pull in the dependencies / have those functions
>
> >
>
> > if the features were enabled.) This approach has worked well for Arrow
>
> >
>
> > (which has only chrono and num as required dependencies)
>
> >
>
> >
>
> >
>
>
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Hi,

If there is time available, I would like to present the status of the
experimental arrow2 <https://github.com/jorgecarleitao/arrow2> repo, and
gather feedback on what would be the best way to proceed. 10-15m?

Best,
Jorge


On Wed, Mar 10, 2021 at 1:57 PM Andrew Lamb <al...@influxdata.com> wrote:

> Also:
> *  semantics for CAST and what to do on failure (return NULL or error)
> [Mike S]
>
> On Wed, Mar 10, 2021 at 7:38 AM Andrew Lamb <al...@influxdata.com> wrote:
>
> > Reminder that today is the next Rust sync call
> >
> > Potential topics for discussion:
> > * Ballista / DataFusion / etc
> > * I remember that someone else was going to demo the use of Arrow but I
> > can't remember exactly what that was now
> >
> > On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz <do...@cmu.edu>
> wrote:
> >
> >>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t
> >> work
> >> because of some dependencies:
> >> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I
> >> wonder
> >> whether DataFusion could have a feature flag for only shipping what is
> >> WASM
> >> compatible?
> >>
> >> On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:
> >>
> >> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >> >
> >> > carve out some free time for the next one :)
> >> >
> >> > It is every other Wednesday at noon EST. Here is the original
> >> announcement
> >> > with more details:
> >> >
> >> >
> >>
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >> >
> >> >
> >> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> >> ruan@reservoirdb.com>
> >> > wrote:
> >> >
> >> > I'd be interested in helping spec this out, it's especially tricky atm
> >> to
> >> >
> >> > track down issues when integrating DataFusion into the same binary as
> >> other
> >> >
> >> > medium/large dependencies.
> >> >
> >> >
> >> > Recently hit a really specific issue where DataFusion depends on
> >> Parquet,
> >> >
> >> > which supports various compression algs, including Brotli, and
> actix-web
> >> >
> >> > also depends on a slightly different Rust implementation of Brotli.
> >> Both of
> >> >
> >> > these Brotli libs package the same underlying C lib separately,
> >> resulting
> >> >
> >> > in multiply-defined symbols compiling using msvc (and maybe on other
> >> >
> >> > platforms? didn't test in CI in the end).
> >> >
> >> >
> >> > Got a quick interim hack [1] in place for my use case which doesn't
> >> really
> >> >
> >> > use Parquet, so it's not pressing, but would be awesome to sort this
> >> >
> >> > properly upstream.
> >> >
> >> >
> >> > I guess the only major tradeoff of having a comprehensive feature
> setup
> >> is
> >> >
> >> > that it could make testing slightly harder, in terms of making sure
> >> no-one
> >> >
> >> > breaks the build for specific feature combinations; this can always be
> >> >
> >> > mitigated with more CI though (yay, unlimited Actions minutes for
> public
> >> >
> >> > repos).
> >> >
> >> >
> >> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >> >
> >> > carve out some free time for the next one :)
> >> >
> >> >
> >> > [1]
> >> >
> >> >
> >> >
> >>
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >> >
> >> >
> >> > -----Original Message-----
> >> >
> >> > From: Andrew Lamb <al...@influxdata.com>
> >> >
> >> > Sent: 14 February 2021 11:14
> >> >
> >> > To: dev <de...@arrow.apache.org>
> >> >
> >> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >> >
> >> >
> >> > I would like to add the following item to the agenda call for the next
> >> >
> >> > Rust sync call:
> >> >
> >> >
> >> > Dependencies
> >> >
> >> >
> >> > Background: As the dependency stack gets larger, it will be harder to
> >> use
> >> >
> >> > DataFusion as an embedded query engine and the compile / dev times
> will
> >> get
> >> >
> >> > higher.
> >> >
> >> >
> >> > As we expand the supported functions of DataFusion this problem is
> >> likely
> >> >
> >> > to get worse. For example
> >> >
> >> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >> >
> >> > https://github.com/apache/arrow/pull/9139
> >> >
> >> >
> >> > Proposal: Add Rust "features" to the datafusion crate and make many of
> >> the
> >> >
> >> > new dependencies optional (so that we had features like regex and
> >> unicode
> >> >
> >> > and hash which would only pull in the dependencies / have those
> >> functions
> >> >
> >> > if the features were enabled.) This approach has worked well for Arrow
> >> >
> >> > (which has only chrono and num as required dependencies)
> >> >
> >> >
> >> >
> >>
> >
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Andrew Lamb <al...@influxdata.com>.
Also:
*  semantics for CAST and what to do on failure (return NULL or error)
[Mike S]

On Wed, Mar 10, 2021 at 7:38 AM Andrew Lamb <al...@influxdata.com> wrote:

> Reminder that today is the next Rust sync call
>
> Potential topics for discussion:
> * Ballista / DataFusion / etc
> * I remember that someone else was going to demo the use of Arrow but I
> can't remember exactly what that was now
>
> On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz <do...@cmu.edu> wrote:
>
>>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t
>> work
>> because of some dependencies:
>> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I
>> wonder
>> whether DataFusion could have a feature flag for only shipping what is
>> WASM
>> compatible?
>>
>> On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:
>>
>> > Also, unrelated, is there a schedule for the sync calls? Will try and
>> >
>> > carve out some free time for the next one :)
>> >
>> > It is every other Wednesday at noon EST. Here is the original
>> announcement
>> > with more details:
>> >
>> >
>> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
>> >
>> >
>> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
>> ruan@reservoirdb.com>
>> > wrote:
>> >
>> > I'd be interested in helping spec this out, it's especially tricky atm
>> to
>> >
>> > track down issues when integrating DataFusion into the same binary as
>> other
>> >
>> > medium/large dependencies.
>> >
>> >
>> > Recently hit a really specific issue where DataFusion depends on
>> Parquet,
>> >
>> > which supports various compression algs, including Brotli, and actix-web
>> >
>> > also depends on a slightly different Rust implementation of Brotli.
>> Both of
>> >
>> > these Brotli libs package the same underlying C lib separately,
>> resulting
>> >
>> > in multiply-defined symbols compiling using msvc (and maybe on other
>> >
>> > platforms? didn't test in CI in the end).
>> >
>> >
>> > Got a quick interim hack [1] in place for my use case which doesn't
>> really
>> >
>> > use Parquet, so it's not pressing, but would be awesome to sort this
>> >
>> > properly upstream.
>> >
>> >
>> > I guess the only major tradeoff of having a comprehensive feature setup
>> is
>> >
>> > that it could make testing slightly harder, in terms of making sure
>> no-one
>> >
>> > breaks the build for specific feature combinations; this can always be
>> >
>> > mitigated with more CI though (yay, unlimited Actions minutes for public
>> >
>> > repos).
>> >
>> >
>> > Also, unrelated, is there a schedule for the sync calls? Will try and
>> >
>> > carve out some free time for the next one :)
>> >
>> >
>> > [1]
>> >
>> >
>> >
>> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>> >
>> >
>> > -----Original Message-----
>> >
>> > From: Andrew Lamb <al...@influxdata.com>
>> >
>> > Sent: 14 February 2021 11:14
>> >
>> > To: dev <de...@arrow.apache.org>
>> >
>> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>> >
>> >
>> > I would like to add the following item to the agenda call for the next
>> >
>> > Rust sync call:
>> >
>> >
>> > Dependencies
>> >
>> >
>> > Background: As the dependency stack gets larger, it will be harder to
>> use
>> >
>> > DataFusion as an embedded query engine and the compile / dev times will
>> get
>> >
>> > higher.
>> >
>> >
>> > As we expand the supported functions of DataFusion this problem is
>> likely
>> >
>> > to get worse. For example
>> >
>> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
>> >
>> > https://github.com/apache/arrow/pull/9139
>> >
>> >
>> > Proposal: Add Rust "features" to the datafusion crate and make many of
>> the
>> >
>> > new dependencies optional (so that we had features like regex and
>> unicode
>> >
>> > and hash which would only pull in the dependencies / have those
>> functions
>> >
>> > if the features were enabled.) This approach has worked well for Arrow
>> >
>> > (which has only chrono and num as required dependencies)
>> >
>> >
>> >
>>
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Andrew Lamb <al...@influxdata.com>.
Reminder that today is the next Rust sync call

Potential topics for discussion:
* Ballista / DataFusion / etc
* I remember that someone else was going to demo the use of Arrow but I
can't remember exactly what that was now

On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz <do...@cmu.edu> wrote:

>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
> because of some dependencies:
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
> whether DataFusion could have a feature flag for only shipping what is WASM
> compatible?
>
> On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > carve out some free time for the next one :)
> >
> > It is every other Wednesday at noon EST. Here is the original
> announcement
> > with more details:
> >
> >
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >
> >
> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> ruan@reservoirdb.com>
> > wrote:
> >
> > I'd be interested in helping spec this out, it's especially tricky atm to
> >
> > track down issues when integrating DataFusion into the same binary as
> other
> >
> > medium/large dependencies.
> >
> >
> > Recently hit a really specific issue where DataFusion depends on Parquet,
> >
> > which supports various compression algs, including Brotli, and actix-web
> >
> > also depends on a slightly different Rust implementation of Brotli. Both
> of
> >
> > these Brotli libs package the same underlying C lib separately, resulting
> >
> > in multiply-defined symbols compiling using msvc (and maybe on other
> >
> > platforms? didn't test in CI in the end).
> >
> >
> > Got a quick interim hack [1] in place for my use case which doesn't
> really
> >
> > use Parquet, so it's not pressing, but would be awesome to sort this
> >
> > properly upstream.
> >
> >
> > I guess the only major tradeoff of having a comprehensive feature setup
> is
> >
> > that it could make testing slightly harder, in terms of making sure
> no-one
> >
> > breaks the build for specific feature combinations; this can always be
> >
> > mitigated with more CI though (yay, unlimited Actions minutes for public
> >
> > repos).
> >
> >
> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > carve out some free time for the next one :)
> >
> >
> > [1]
> >
> >
> >
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >
> >
> > -----Original Message-----
> >
> > From: Andrew Lamb <al...@influxdata.com>
> >
> > Sent: 14 February 2021 11:14
> >
> > To: dev <de...@arrow.apache.org>
> >
> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >
> >
> > I would like to add the following item to the agenda call for the next
> >
> > Rust sync call:
> >
> >
> > Dependencies
> >
> >
> > Background: As the dependency stack gets larger, it will be harder to use
> >
> > DataFusion as an embedded query engine and the compile / dev times will
> get
> >
> > higher.
> >
> >
> > As we expand the supported functions of DataFusion this problem is likely
> >
> > to get worse. For example
> >
> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >
> > https://github.com/apache/arrow/pull/9139
> >
> >
> > Proposal: Add Rust "features" to the datafusion crate and make many of
> the
> >
> > new dependencies optional (so that we had features like regex and unicode
> >
> > and hash which would only pull in the dependencies / have those functions
> >
> > if the features were enabled.) This approach has worked well for Arrow
> >
> > (which has only chrono and num as required dependencies)
> >
> >
> >
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Dominik Moritz <do...@cmu.edu>.
 Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
because of some dependencies:
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
whether DataFusion could have a feature flag for only shipping what is WASM
compatible?

On Feb 15, 2021 at 12:13:04, Andrew Lamb <al...@influxdata.com> wrote:

> Also, unrelated, is there a schedule for the sync calls? Will try and
>
> carve out some free time for the next one :)
>
> It is every other Wednesday at noon EST. Here is the original announcement
> with more details:
>
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
>
>
> On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <ru...@reservoirdb.com>
> wrote:
>
> I'd be interested in helping spec this out, it's especially tricky atm to
>
> track down issues when integrating DataFusion into the same binary as other
>
> medium/large dependencies.
>
>
> Recently hit a really specific issue where DataFusion depends on Parquet,
>
> which supports various compression algs, including Brotli, and actix-web
>
> also depends on a slightly different Rust implementation of Brotli. Both of
>
> these Brotli libs package the same underlying C lib separately, resulting
>
> in multiply-defined symbols compiling using msvc (and maybe on other
>
> platforms? didn't test in CI in the end).
>
>
> Got a quick interim hack [1] in place for my use case which doesn't really
>
> use Parquet, so it's not pressing, but would be awesome to sort this
>
> properly upstream.
>
>
> I guess the only major tradeoff of having a comprehensive feature setup is
>
> that it could make testing slightly harder, in terms of making sure no-one
>
> breaks the build for specific feature combinations; this can always be
>
> mitigated with more CI though (yay, unlimited Actions minutes for public
>
> repos).
>
>
> Also, unrelated, is there a schedule for the sync calls? Will try and
>
> carve out some free time for the next one :)
>
>
> [1]
>
>
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>
>
> -----Original Message-----
>
> From: Andrew Lamb <al...@influxdata.com>
>
> Sent: 14 February 2021 11:14
>
> To: dev <de...@arrow.apache.org>
>
> Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>
>
> I would like to add the following item to the agenda call for the next
>
> Rust sync call:
>
>
> Dependencies
>
>
> Background: As the dependency stack gets larger, it will be harder to use
>
> DataFusion as an embedded query engine and the compile / dev times will get
>
> higher.
>
>
> As we expand the supported functions of DataFusion this problem is likely
>
> to get worse. For example
>
> https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
>
> https://github.com/apache/arrow/pull/9139
>
>
> Proposal: Add Rust "features" to the datafusion crate and make many of the
>
> new dependencies optional (so that we had features like regex and unicode
>
> and hash which would only pull in the dependencies / have those functions
>
> if the features were enabled.) This approach has worked well for Arrow
>
> (which has only chrono and num as required dependencies)
>
>
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Andrew Lamb <al...@influxdata.com>.
> Also, unrelated, is there a schedule for the sync calls? Will try and
carve out some free time for the next one :)

It is every other Wednesday at noon EST. Here is the original announcement
with more details:
https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E


On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <ru...@reservoirdb.com>
wrote:

> I'd be interested in helping spec this out, it's especially tricky atm to
> track down issues when integrating DataFusion into the same binary as other
> medium/large dependencies.
>
> Recently hit a really specific issue where DataFusion depends on Parquet,
> which supports various compression algs, including Brotli, and actix-web
> also depends on a slightly different Rust implementation of Brotli. Both of
> these Brotli libs package the same underlying C lib separately, resulting
> in multiply-defined symbols compiling using msvc (and maybe on other
> platforms? didn't test in CI in the end).
>
> Got a quick interim hack [1] in place for my use case which doesn't really
> use Parquet, so it's not pressing, but would be awesome to sort this
> properly upstream.
>
> I guess the only major tradeoff of having a comprehensive feature setup is
> that it could make testing slightly harder, in terms of making sure no-one
> breaks the build for specific feature combinations; this can always be
> mitigated with more CI though (yay, unlimited Actions minutes for public
> repos).
>
> Also, unrelated, is there a schedule for the sync calls? Will try and
> carve out some free time for the next one :)
>
> [1]
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>
> -----Original Message-----
> From: Andrew Lamb <al...@influxdata.com>
> Sent: 14 February 2021 11:14
> To: dev <de...@arrow.apache.org>
> Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>
> I would like to add the following item to the agenda call for the next
> Rust sync call:
>
> Dependencies
>
> Background: As the dependency stack gets larger, it will be harder to use
> DataFusion as an embedded query engine and the compile / dev times will get
> higher.
>
> As we expand the supported functions of DataFusion this problem is likely
> to get worse. For example
> https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> https://github.com/apache/arrow/pull/9139
>
> Proposal: Add Rust "features" to the datafusion crate and make many of the
> new dependencies optional (so that we had features like regex and unicode
> and hash which would only pull in the dependencies / have those functions
> if the features were enabled.) This approach has worked well for Arrow
> (which has only chrono and num as required dependencies)
>

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Daniël Heres <da...@gmail.com>.
I think it's a great idea to make default / dev compile times faster and
have clear guidelines for how to use dependencies.

Some low-hanging fruit could be moving some development dependencies under
different crates to reduce compile times and bigger dependencies, i.e.
those needed for criterion.

I created a PR to show this here: https://github.com/apache/arrow/pull/9493
The same could be done for DataFusion, and dependencies needed for the
examples as well.


Op zo 14 feb. 2021 om 14:29 schreef Ruan Pearce-Authers <
ruan@reservoirdb.com>:

> I'd be interested in helping spec this out, it's especially tricky atm to
> track down issues when integrating DataFusion into the same binary as other
> medium/large dependencies.
>
> Recently hit a really specific issue where DataFusion depends on Parquet,
> which supports various compression algs, including Brotli, and actix-web
> also depends on a slightly different Rust implementation of Brotli. Both of
> these Brotli libs package the same underlying C lib separately, resulting
> in multiply-defined symbols compiling using msvc (and maybe on other
> platforms? didn't test in CI in the end).
>
> Got a quick interim hack [1] in place for my use case which doesn't really
> use Parquet, so it's not pressing, but would be awesome to sort this
> properly upstream.
>
> I guess the only major tradeoff of having a comprehensive feature setup is
> that it could make testing slightly harder, in terms of making sure no-one
> breaks the build for specific feature combinations; this can always be
> mitigated with more CI though (yay, unlimited Actions minutes for public
> repos).
>
> Also, unrelated, is there a schedule for the sync calls? Will try and
> carve out some free time for the next one :)
>
> [1]
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>
> -----Original Message-----
> From: Andrew Lamb <al...@influxdata.com>
> Sent: 14 February 2021 11:14
> To: dev <de...@arrow.apache.org>
> Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>
> I would like to add the following item to the agenda call for the next
> Rust sync call:
>
> Dependencies
>
> Background: As the dependency stack gets larger, it will be harder to use
> DataFusion as an embedded query engine and the compile / dev times will get
> higher.
>
> As we expand the supported functions of DataFusion this problem is likely
> to get worse. For example
> https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> https://github.com/apache/arrow/pull/9139
>
> Proposal: Add Rust "features" to the datafusion crate and make many of the
> new dependencies optional (so that we had features like regex and unicode
> and hash which would only pull in the dependencies / have those functions
> if the features were enabled.) This approach has worked well for Arrow
> (which has only chrono and num as required dependencies)
>


-- 
Daniël Heres

RE: [Rust] [DataFusion] Topic for next Rust Sync Call

Posted by Ruan Pearce-Authers <ru...@reservoirdb.com>.
I'd be interested in helping spec this out, it's especially tricky atm to track down issues when integrating DataFusion into the same binary as other medium/large dependencies.

Recently hit a really specific issue where DataFusion depends on Parquet, which supports various compression algs, including Brotli, and actix-web also depends on a slightly different Rust implementation of Brotli. Both of these Brotli libs package the same underlying C lib separately, resulting in multiply-defined symbols compiling using msvc (and maybe on other platforms? didn't test in CI in the end).

Got a quick interim hack [1] in place for my use case which doesn't really use Parquet, so it's not pressing, but would be awesome to sort this properly upstream.

I guess the only major tradeoff of having a comprehensive feature setup is that it could make testing slightly harder, in terms of making sure no-one breaks the build for specific feature combinations; this can always be mitigated with more CI though (yay, unlimited Actions minutes for public repos).

Also, unrelated, is there a schedule for the sync calls? Will try and carve out some free time for the next one :)

[1] https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468

-----Original Message-----
From: Andrew Lamb <al...@influxdata.com> 
Sent: 14 February 2021 11:14
To: dev <de...@arrow.apache.org>
Subject: [Rust] [DataFusion] Topic for next Rust Sync Call

I would like to add the following item to the agenda call for the next Rust sync call:

Dependencies

Background: As the dependency stack gets larger, it will be harder to use DataFusion as an embedded query engine and the compile / dev times will get higher.

As we expand the supported functions of DataFusion this problem is likely to get worse. For example
https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
https://github.com/apache/arrow/pull/9139

Proposal: Add Rust "features" to the datafusion crate and make many of the new dependencies optional (so that we had features like regex and unicode and hash which would only pull in the dependencies / have those functions if the features were enabled.) This approach has worked well for Arrow (which has only chrono and num as required dependencies)