You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/14 20:44:12 UTC

[GitHub] [arrow-ballista] dariusgm commented on issue #30: [Discuss] Ballista Future Direction

dariusgm commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1155694167

   What are the downsides of apache spark, why somebody should use ballista?
   
   imo as far as I read posts and watched a talk on yt, the memory consumption of spark is huge. Even a hello world in spark will consume a lot of memory. The reduced memory consumption can be a big advantage for ballista. I always have to increase the memory of our production spark jobs - sometimes to even 32GB per executor.
   
   What can do apache spark good?
   
   Its integration very good with the hadoop distributed file system. This is from my perspective the big advantage: the computation is taken place as close to the data as possible, moving only data when really required. Again, as far as I understand, this is currently not possible with ballista?
   
   And, as I am a data engineer and doing a lot of analytics, I really like the spark-shell to play around with the data. 
   
   I just started using rust, so maybe I am not the biggest help for implementing features, but I know spark from a developer perspective quite well as I am using it daily.
   
   What could be a direction: Have the data mostly/only in ram in the cluster, reducing the (slow) HDD/SSD reads and running the computation than on the arrow In-memory data frames. That maybe a niche to fit into. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org