You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/02 11:26:02 UTC
[GitHub] [arrow-datafusion] alamb opened a new issue, #2679: Proposal: remove automated ballista CI checks from DataFusion
alamb opened a new issue, #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Basically the tests added in https://github.com/apache/arrow-datafusion/pull/2582 to keep ballista and datafusion in sync add significant burden to DataFusion development and I propose removing them, at least temporarily
Here is a description of the process:
https://github.com/apache/arrow-datafusion/blob/907504c/.github/pull_request_template.md?plain=1#L31-L47
I think the rationale for the new CI was to add friction on DataFusion API changes to encourage a more stable API; However, with the currently ongoing efforts to rework the object store and parquet reading, I think all we are doing with the process is slowing things down.
The alternative, to have Ballista keep up with changes in DataFusion, sounds daunting at first, but my firsthand experience suggests it is not that bad. Specifically, https://github.com/influxdata/influxdb_iox, my project, uses DataFusion similarly to Ballista (as the core query engine) and uses a DataFusion pin directly from master. Instead of impinging on the DataFusion development process, we keep IOx up with DataFusion by [manually updating the DataFusion pin in IOX about once a week](https://github.com/influxdata/influxdb_iox/pulls?q=is%3Apr+update+datafusion+is%3Aclosed) , and sorting out any API changes.
This does take time, but it is mostly mechanical. We do occasionally find bugs that were introduced into DataFusion such as when we tried most recently with https://github.com/influxdata/influxdb_iox/pull/4743 and we then contribute a fix back upstream (e.g. https://github.com/apache/arrow-datafusion/pull/2674)
I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)
**Describe the solution you'd like**
I propose removing the Ballista CI check in DataFusion
Specifically this check: https://github.com/apache/arrow-datafusion/blob/907504c5aa768601f9d70ad2c8f928bedfa9b069/.github/workflows/rust.yml#L128-L172
And writing up instructions (maybe even automation) on how to upgrade the datafusion pin in Ballista manually
**Describe alternatives you've considered**
* Do nothing
* Bring ballsita back into DataFusion repositoru
**Additional context**
The move of Ballista to a new repo is tracked in: https://github.com/apache/arrow-datafusion/issues/2502
There are several discussions about this pain:
* https://github.com/apache/arrow-ballista/pull/48#discussion_r885486298
cc @andygrove @thinkharderdev @ming535 @Ted-Jiang @xudong963 @tustvold @korowa
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145203577
> Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny smile
If it's any consolation, once https://github.com/apache/arrow-datafusion/pull/2675 and https://github.com/apache/arrow-datafusion/pull/2682 are merged I have no more refactoring planned other than removing some of the re-exports in the crate (https://github.com/apache/arrow-datafusion/issues/2683) in an effort to stabilize the public API for DataFusion.
We're also moving to monthly releases now, with the next release planned for ~2 weeks from now (https://github.com/apache/arrow-datafusion/issues/2676) so re-basing monthly might be worth considering.
With the refactoring work complete I will have more time for reviews and to help get PRs merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] tustvold commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144960511
> Rebasing PRs/Forks when someone loves to move functions/doing breaking changes
Yeah, I do not envy you this task... I have enough difficulty keeping some of my longer-running PRs up to date :sweat_smile:
FWIW if there is some way I could help you move onto tracking master more directly I would be happy to help out. In particular I notice your fork of arrow-rs is about 9 months out of date, which is likely to become quite painful as DataFusion comes to rely on it even more... :sweat_smile:
> It's hard to contribute back to the upstream
I'm sorry to hear this has been your experience. As mentioned above I'll happily help get contributions shepherded through, but the further things get from arrow-rs the less context I have to be able to meaningfully review things...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145222032
Also @ovr in IOx, we have found that by using the extension mechanisms in DataFusion (user defined Exprs, User Defined Plan Nodes, etc) we have been able to add IOx specific functionality / plans without a fork of DataFusion, which minimizes the amount of maintenance effort we need to expend to use master datafusion directly
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ovr commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
ovr commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144922939
> I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)
We have two separate projects in the company which use DF as a Query Engine, and we came to the decision to maintain two different versions in the forked repository because it's impossible to use the latest version, there are some problems:
It's hard to contribute back to the upstream. We had subqueries before DF introduced it, We have ANY and UDTFs, etc. We tried to push it back to the upstream, but It's not merged and probably, will not be merged soon.
https://github.com/apache/arrow-datafusion/pull/2549
https://github.com/apache/arrow-datafusion/pull/2177
Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny 😄
Don't forget that DF relies on the SQL parser & Arrow which we already forked. The same story, it requires rebasing each time :D
For my project, we rebase every 1-2 months. While rebasing we choose the latest commit from the `master` branch (We don't wait for releases).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145296693
Seems reasonable. It does create a sort of cyclic dependency :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove closed issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
andygrove closed issue #2679: Proposal: remove automated ballista CI checks from DataFusion
URL: https://github.com/apache/arrow-datafusion/issues/2679
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] tustvold commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144761883
I think it is worth also highlighting that the arrow releases every 2 weeks are often breaking, at least on paper, and currently require going through this process.
I put some thoughts on this - https://github.com/apache/arrow-datafusion/issues/2583#issuecomment-1139477308
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion
Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144912807
I'm fine with removing these checks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org