You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/02 11:26:02 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #2679: Proposal: remove automated ballista CI checks from DataFusion

alamb opened a new issue, #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Basically the tests added in https://github.com/apache/arrow-datafusion/pull/2582 to keep ballista and datafusion in sync add significant burden to DataFusion development and I propose removing them, at least temporarily
   
   Here is a description of the process:
   https://github.com/apache/arrow-datafusion/blob/907504c/.github/pull_request_template.md?plain=1#L31-L47
   
   I think the rationale for the new CI  was to add friction on DataFusion API changes to encourage a more stable API; However, with the currently ongoing efforts to rework the object store and parquet reading, I think all we are doing with the process is slowing things down. 
   
   The alternative, to have Ballista keep up with changes in DataFusion, sounds daunting at first, but my firsthand experience suggests it is not that bad.  Specifically,  https://github.com/influxdata/influxdb_iox, my project, uses DataFusion similarly to Ballista (as the core query engine) and uses a DataFusion pin directly from master. Instead of impinging on the DataFusion development process, we keep IOx up with DataFusion by [manually updating the DataFusion pin in IOX about once a week](https://github.com/influxdata/influxdb_iox/pulls?q=is%3Apr+update+datafusion+is%3Aclosed) , and sorting out any API changes. 
   
   This does take time, but it is mostly mechanical. We do occasionally find bugs that were introduced into DataFusion such as when we tried most recently with https://github.com/influxdata/influxdb_iox/pull/4743 and we then contribute a fix back upstream (e.g. https://github.com/apache/arrow-datafusion/pull/2674)
   
   I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)
   
   **Describe the solution you'd like**
   I propose removing the Ballista CI check in DataFusion 
   
   Specifically this check: https://github.com/apache/arrow-datafusion/blob/907504c5aa768601f9d70ad2c8f928bedfa9b069/.github/workflows/rust.yml#L128-L172
   
   And writing up instructions (maybe even automation) on how to upgrade the datafusion pin in Ballista manually
   
   **Describe alternatives you've considered**
   * Do nothing
   * Bring ballsita back into DataFusion repositoru
   
   **Additional context**
   The move of Ballista to a new repo is tracked in: https://github.com/apache/arrow-datafusion/issues/2502
   
   There are several discussions about this pain:
   * https://github.com/apache/arrow-ballista/pull/48#discussion_r885486298
   
   
   
   cc @andygrove @thinkharderdev @ming535 @Ted-Jiang @xudong963 @tustvold @korowa 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145203577

   > Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny smile
   
   If it's any consolation, once https://github.com/apache/arrow-datafusion/pull/2675 and https://github.com/apache/arrow-datafusion/pull/2682 are merged I have no more refactoring planned other than removing some of the re-exports in the crate (https://github.com/apache/arrow-datafusion/issues/2683) in an effort to stabilize the public API for DataFusion.
   
   We're also moving to monthly releases now, with the next release planned for ~2 weeks from now (https://github.com/apache/arrow-datafusion/issues/2676) so re-basing monthly might be worth considering.
   
   With the refactoring work complete I will have more time for reviews and to help get PRs merged.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144960511

   > Rebasing PRs/Forks when someone loves to move functions/doing breaking changes
   
   Yeah, I do not envy you this task... I have enough difficulty keeping some of my longer-running PRs up to date :sweat_smile: 
   
   FWIW if there is some way I could help you move onto tracking master more directly I would be happy to help out. In particular I notice your fork of arrow-rs is about 9 months out of date, which is likely to become quite painful as DataFusion comes to rely on it even more... :sweat_smile: 
   
   > It's hard to contribute back to the upstream
   
   I'm sorry to hear this has been your experience. As mentioned above I'll happily help get contributions shepherded through, but the further things get from arrow-rs the less context I have to be able to meaningfully review things...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145222032

   Also @ovr in IOx, we have found that by using the extension mechanisms in DataFusion (user defined Exprs, User Defined Plan Nodes, etc) we have been able to add IOx specific functionality / plans without a fork of DataFusion, which minimizes the amount of maintenance effort we need to expend to use master datafusion directly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ovr commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
ovr commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144922939

   > I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)
   
   We have two separate projects in the company which use DF as a Query Engine, and we came to the decision to maintain two different versions in the forked repository because it's impossible to use the latest version, there are some problems:
   
   It's hard to contribute back to the upstream. We had subqueries before DF introduced it, We have ANY and UDTFs, etc. We tried to push it back to the upstream, but It's not merged and probably, will not be merged soon.
   
   https://github.com/apache/arrow-datafusion/pull/2549
   https://github.com/apache/arrow-datafusion/pull/2177
   
   Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny 😄 
   Don't forget that DF relies on the SQL parser & Arrow which we already forked. The same story, it requires rebasing each time :D
   
   For my project, we rebase every 1-2 months. While rebasing we choose the latest commit from the `master` branch (We don't wait for releases).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1145296693

   Seems reasonable. It does create a sort of cyclic dependency :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove closed issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
andygrove closed issue #2679: Proposal: remove automated ballista CI checks from DataFusion
URL: https://github.com/apache/arrow-datafusion/issues/2679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144761883

   I think it is worth also highlighting that the arrow releases every 2 weeks are often breaking, at least on paper, and currently require going through this process.
   
   I put some thoughts on this - https://github.com/apache/arrow-datafusion/issues/2583#issuecomment-1139477308
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2679: Proposal: remove automated ballista CI checks from DataFusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2679:
URL: https://github.com/apache/arrow-datafusion/issues/2679#issuecomment-1144912807

   I'm fine with removing these checks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org