You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 04:14:42 UTC

[GitHub] [arrow-ballista] realno commented on issue #30: [Discuss] Ballista Future Direction

realno commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1133527214

   @thinkharderdev Thanks for the nice write up. Among the three options I am leaning towards 3 is more realistic to achieve. That is, the community can be more focused on making it work really well with a specific set of use cases first, which will hopefully grow the community further. 
   
   For streaming v.s. batch I don't have a strong opinion at the point, I believe whoever can use it in a real use case should try to drive the project forward. 
   
   We are not ready yet to do anything serious, though we have two major use cases in mind:
   1. Replace Spark for batch processing - 100s of billions of rows regularly with mega plans (100s of millions of expressions and projections)
   2. High concurrency, medium latency (sub second) queries from data storage of 100s of billions or rows - result sets are not large   
   
   Use case 2 seems similar to your use case. I am curious when you said fully streaming execution did you mean like Flink? I think there is value to support operators/algorithms that needs to see the entire dataset/partition multiple times (for example ML), so a hybrid model would be good. For example, if the compiler can analyze the query and turn part of it "fully streaming" if possible.
   
   Some other requirements for us are:
   1. easier and cheaper than Spark to operate 
   2. natively support k8s - delicate resource and cluster management entirely to k8s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org