You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 19:17:18 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #1916: Discussion: Is Ballista a standalone system or framework

alamb opened a new issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916

> I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?

I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code.

```
I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF?
```

Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the `udf plugin` to dynamically load udf. Because for people who use DF as a dependency to develop their own calculation engine, such as Ballista. Imagine one, if Ballista and DF are not under the same repository, but two separate projects, as a Ballista developer, I need to add my own udf to meet my special analysis needs. What I'm most likely to do is to manage my own udf, such as writing the implementation of udf directly in the Ballista crate. Or add a `udf plugin` to Ballista like this pr, which supports dynamic loading of udfs developed by Ballista users (not Ballista developers). Then I decide when to call the `register_udf` method of the DF to register these udfs in the `ExecutionContext` so that the DF can be used for calculation. Of course, we can directly put the udf plugin in DF, but this feature is not necessary for DF, and doing so will make the `register
_udf` method look redundant, but make the design of DF's udf not easy to understand.

So I would say that the people who need the `udf plugin` the most are those who use Ballista as a full-fledged computing engine, and they just download and deploy Ballista. They don't modify the source code of Ballista and DF because that would mean a better understanding of Ballista and DF. And once the source code of Ballista and DF is modified, it means that they need to invest more cost to merge and build when upgrading Ballista. But now if the user just downloads and deploys Ballista for use, there is no way for the user to register his udf into the DF. The core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF.

Finally, if we define Ballista's goal as a distributed implementation of datafusion, a library that needs to be used as a dependency of other projects, rather than a distributed computing engine (like presto/spark) that can be directly downloaded and deployed and used. It seems to me that the udf plugin is not necessary, because the core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Those projects that use ballista as depencency can manage their own udf and decide when to register their udf into DF.

_Originally posted by @gaojun2048 in https://github.com/apache/arrow-datafusion/issues/1881#issuecomment-1057712514_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058815495


   I agree ballista should be a standalone system like Spark SQL or Presto. 
   
   > I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.
   
   So it's not a Spark Core on RDD but Spark Core on SQL physical plan, right?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058424820


   I agree that the goal should be for Ballista to be a standalone system (or at least something that CAN be deployed as a standalone system), but one of the unique and valuable aspects of DataFusion is the extensibility. It is a really important differentiator to other solutions in this space and I think we should ensure that the extensibility of DataFusion extends to Ballista. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955

> Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

I feel some opportunities/differentiators for Ballista are the following:
1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation (it probably doesn't need a very complex resource management system anymore) and integrate easily with modern systems (cloud native and simpler scaling model and multi-tenancy)
4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1065943124

> Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

DataFusion is not JVM based, which could be an interesting differentiator.

I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different

Comparatively I think a singe node column oriented analytic query engine is a fairly well understood pattern (though I do think the DataFusion implementation is very good :bowtie: )

One thing I personally hope is that Ballista drives features into DataFusion so that making a new distributed engine using DataFusion becomes easier over time.

Some examples of this technical flow I think are:
1. The extraction of `datafusion-proto` struct serialization by @carols10cents
2. The object store abstraction from @yjshen
3. The listing table provider from @rdettai
4. Making planning `async`
1. The work that @mingmwang is doing to enable intra-processc concurrency

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058488662


   I also agree that Ballista should be a standalone system, with a client API that can be used as a library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058712220


   Extensibility is a major selling point, the only thing I'd be worried about is having stable interfaces.
   As for Ballista my only problem right now is that I have to embark the entirety of datafusion as a library to be able to send even the simples sql query using the client.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058871647


   +1 Ballista should be a standalone system.
   
   I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? 
   If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?
   
    
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955


   > Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄
   
   I feel some opportunities/differentiators for Ballista are the following:
   1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
   2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
   3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation and integrate easily with modern systems
   4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.
   
   We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yahoNanJing commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

yahoNanJing commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058769722


   Actually I think Ballista should act as a distributed computing framework like Spark Core. Like Spark is based on the RDD, Ballista is based on the ExecutionPlan for the DAG.
   
   Based on this framework, Ballista should also include several kinds of deployments. Currently, only standalone mode is provided. In the future, it's possible to introduce more resource managers, like Yarn, Mesos, etc.
   
   For the SQL part, I think it should be an independent part. The core of Ballista should not depend on the SQL. 
   ![Picture1](https://user-images.githubusercontent.com/90197956/156688633-6789dc67-a2c6-444b-9e9d-d4bf415d1c87.png)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1064876968


   I understand that Ballista is currently heading toward being standalone system, but I am wondering that is what the ecosystem needs. 
   
   I feel that being a plugable library is a big part of Datafusion's success. But the projects that are embedding Datafusion today as a single node compute engine, are they not going to need to be distributed tomorrow? If Ballista is really designed as a standalone system, those growing projects might use it as an example on how to distribute the Datafusion query plan, but they might not be able to reuse much code. 
   
   Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1064876968


   I understand that Ballista is currently heading toward being standalone system, but I am wondering that is what the ecosystem needs. 
   
   I feel that being a plugable library Datafusion is a big part of Datafusion's success. But the projects that are embedding Datafusion today as a single node compute engine, are they not going to need to be distributed tomorrow? If Ballista is really designed as a standalone system, those growing projects might use it as an example on how to distribute the Datafusion query plan, but they might not be able to reuse much code. 
   
   Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058916365


   > One of the possibilities I'm sure is, once we have #1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.
   
   good point.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1064103628


   I am now back working on Ballista after a bit of a break and can now contribute a bit more to this discussion.
   
   Although I see Ballista as a standalone system that users will likely install as Docker images or from a tarball, I think it is still important to publish the crates to crates.io so that `cargo install` is another installation option. 
   
   It is also important that we publish the `ballista` crate, which provides the client API, so that we can call it from other projects such as from Ballista Python bindings (which I am starting work on now, based on the work in datafusion-python).
   
   I noticed that we cannot publish the Ballista crates from the recent 7.0.0 release tarball and have filed https://github.com/apache/arrow-datafusion/issues/1980 to fix that for the next release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058871647


   +1 Ballista should be a standalone system.
   
   I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? 
   If so, what is the API to interface with other clients, logical plan? physical plan? And where do we want to put SQL parsing, optimizer and planner?
   
    
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955

> Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

I feel some opportunities/differentiators for Ballista are the following:
1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation (it probably doesn't need a very complex resource management system anymore) and integrate easily with modern systems (cloud native and simpler multi-tenancy)
4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] gaojun2048 commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

gaojun2048 commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058848293


   > I also agree that Ballista should be a standalone system, with a client API that can be used as a library.
   
   I agree with you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058410023


   For what it is worth, my mental model is similar to @gaojun2048  -- that DataFusion is mostly designed to be a library to build other systems -- for example the `datafusion-cli` and `datafusion` python bindings are in my mind "systems" built on datafusion. I realize the line is a little blurry here as pointed out by @realno  and that DataFusion itself has some things I would normally expect to live outside such a library (such as `CREATE TABLE` support)
   
   Ballista I have always thought of as planning to be a standalone system (that people would deploy without having to build it from source). 
   
   I am curious to hear what others think
   
   cc @edrevo @andygrove @thinkharderdev @matthewmturner @liukun4515 @Ted-Jiang @yjshen 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1059968000


   standalone system +1 
   Look forward to it going into production one day.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058889293


   > I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?
   
   Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @Jimexist is leading the effort in https://github.com/apache/arrow-datafusion/issues/1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute functionalities. 
   
   One of the possibilities I'm sure is, once we have https://github.com/apache/arrow-datafusion/pull/1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058871647


   +1 Ballista should be a standalone system.
   
   I think there is another side of this question- do we consider DataFusion as a pure library? 
   If so, what is the API to interface with other clients, logical plan? physical plan? And where do we want to put SQL parsing, optimizer and planner?
   
    
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058917961


   > I also agree that Ballista should be a standalone system, with a client API that can be used as a library.
   
   agree with that and same opinion in this https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057594304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058905885


   > Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @Jimexist is leading the effort in https://github.com/apache/arrow-datafusion/issues/1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules.
   
   I think this is a good direction. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Igosuki commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

Igosuki commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066201917


   Datafusion has a huge advantage, but the user story needs to be improved.
   Also, horizontal scalability and using it for ML in distributed is still an
   issue because it needs to be able to map partitions and have multi-stage
   jobs.
   
   Le dim. 13 mars 2022 à 22:51, Lin Ma ***@***.***> a écrit :
   
   > Also, as a standalone system, Ballista will compete with the heavy weights
   > in the category (Spark, Presto..). That is an interesting but very
   > ambitious goal 😄
   >
   > I feel some opportunities/differentiators for Ballista are the following:
   >
   >    1. non-JVM - this brings a lot of benefit such as lower footprint,
   >    memory efficiency, and no GC cost (Rust specific)
   >    2. A chance for more modern design principles - for example Spark was
   >    originally architected to best deployed to bare metal, it is hard to make
   >    some changes to be more cloud friendly
   >    3. Utilize modern resource management and orchestration technologies -
   >    reusing mature tools like k8s will simplify Ballista's implementation and
   >    integrate easily with modern systems
   >    4. Using Arrow as the backbone opens doors for more advanced use case
   >    such as ML - it may be efficiently integrated with Pandas or Tensorflow
   >    through Arrow.
   >
   > We heavily use systems like Spark for Analytics and ML, the above points
   > are pain points that worth consider switching.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADDFBWMEEONETGGK73ASOTU7ZWNXANCNFSM5P3MCIBQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955

> Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching. I feel the pivot point for having a new reference distributed compute platform is getting closer. :)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066188955

> Also, as a standalone system, Ballista will compete with the heavy weights in the category (Spark, Presto..). That is an interesting but very ambitious goal 😄

I feel some opportunities/differentiators for Ballista are the following:
1. non-JVM - this brings a lot of benefit such as lower footprint, memory efficiency, and no GC cost (Rust specific)
2. A chance for more modern design principles - for example Spark was originally architected to best deployed to bare metal, it is hard to make some changes to be more cloud friendly
3. Utilize modern resource management and orchestration technologies - reusing mature tools like k8s will simplify Ballista's implementation (it probably doesn't need a very complex resource management system any more) and integrate easily with modern systems (cloud native and simpler multi-tenancy)
4. Using Arrow as the backbone opens doors for more advanced use case such as ML - it may be efficiently integrated with Pandas or Tensorflow through Arrow.

We heavily use systems like Spark for Analytics and ML, the above points are pain points that worth consider switching.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1066162937


   > I think making a generic embedded distributing framework will be challenging as there are so many differing dimensions to consider (catalog structure, local caching, etc) that may be different
   
   Agreed! I was more thinking about designing the engine in modules that can be re-used more easily by other distributed system. For instance using [https://github.com/substrait-io/substrait] to serialize the tasks for the workers instead of the current custom protos could help re-use ballista workers with minimal effort from other systems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058871647


   +1 Ballista should be a standalone system.
   
   I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? 
   If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?
   
    First thought came to mind is to have physical plan be the interface, different clients can extend SQL parser, optimizer and planner as needed. Most can reuse the existing implementation, an example for Ballista is to extend the planner to support udf. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

realno edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058871647


   +1 Ballista should be a standalone system.
   
   I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? 
   If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?
   
    First thought came to mind is to have physical plan be the interface, different clients have the option to extend SQL parser, optimizer and planner as needed. Most can reuse the existing implementation, an example for Ballista is to extend the planner to support udf. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058887921


   i think i agree with @realno. for example the datafusion python bindings are effectively reusing the rust implementation which thus far for my (limited) use cases has worked well.  the python bindings also have udfs but i havent had chance to look into how theyre implemented and how that may contrast to ballistas needs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1916: Discussion: Is Ballista a standalone system or framework

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1916:
URL: https://github.com/apache/arrow-datafusion/issues/1916#issuecomment-1058889293


   > I think there is another side of this question- do we consider DataFusion as a pure library? Would this change how things are organized? If so, what is the API to interface with other clients, logical plan? physical plan? And where would SQL parsing, optimizer and planner go?
   
   Yes, I think so. And actually, we are already heading in that direction. The first step to make each module substitutable is to make each part, SQL parsing, optimizing, and executing in its own crate. @Jimexist is leading the effort in https://github.com/apache/arrow-datafusion/issues/1750. Once this separating of functionalities is done, the next step would be introducing interfaces between adjacent functionalities, depending on how would we substitute modules. 
   
   One of the possibilities I'm sure is, once we have https://github.com/apache/arrow-datafusion/pull/1887 or even https://github.com/andygrove/substrait-rs, we can link Apache Calcite to DataFusion's physical plan execution layer.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org