You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 17:51:24 UTC

[GitHub] [arrow-ballista] GavinRay97 commented on issue #30: [Discuss] Ballista Future Direction

GavinRay97 commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1133730265

   > To be transparent, my team is building a query engine which is sensitive to time-to-first-result latency so we are very interested in fully streaming execution (and hoping to upstream as much as we can) but want to make sure that this is in line for the desired direction of Ballista for the rest of the community.
   
   I also have major usecases for latency-sensitive, potentially-multisource queries.
   It boils down to being able to use it for end-user/interactive applications
   
   One of the biggest bummers to me about Spark is that its architecture cripples it for latency-sensitive workloads
   I wanted to see what the latency was like to do a basic, two-DB join query between in-memory databases:
   - https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L80-L112
   
   Something like:
   ```
   SELECT ... FROM db1.foo JOIN db2.bar ON ... LIMIT 1
   ```
   
   Using the latest Spark nightly snapshot, this takes 150-200ms on my personal machine.
   
   A significant portion of this is spent on things relevant to multi-node computation but not required for doing in-memory on a single node (serialization, broadcasts, scheduling/coordination)
   
   The codegen + execution time isn't that bad
   
   Understandably Spark isn't tailored for this. But there's a lot of great technology in there (Catalyst, Tungsten) that are state-of-the-art for query optimization and performance, and it's a bummer that you can't configure Spark (to my knowledge) for a "local" mode or directly interact with just the pieces you need to manually evaluate expressions/do query optimization.
   
   Would be great if the future of Ballista accommodated for this. Opens up interesting possibilities.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org