You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/03 21:13:31 UTC

[GitHub] [spark] rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-489241706

Thanks for your work, @aokolnychyi and @dbtsai !
I'm super excited about this PR as a concrete place to start a discussion on improving the performance of the existing typed Dataset operations.

I've worked on a continuation of @JoshRosen 's [prototype](https://github.com/apache/spark/compare/master...JoshRosen:expression-analysis?diff=unified&name=expression-analysis) about two years ago, so I have some first-hand experience on both the implementation details and the applicability of this direction.
I'll be sharing my thoughts on this topic in the coming couple of days. It might end up being a long write-up, but please stay tuned!

In the meantime, though, I'd really like to call on the community to share their use cases of using the existing typed Dataset operations, so that we'll be able to better evaluate how much benefit will this project bring to real-world queries.

In a lot of cases, simply moving uses of the typed Dataset operations to the equivalent untyped DataFrame operations can substantially speed up queries; third-party solutions like [Quill](https://github.com/getquill/quill) can provide a typed API for Scala but directly generate untyped operations that are fast to begin with (in Quill's case, the generated code is SQL). So for people that are not content with the current performance of the typed Dataset operations, they already have two directions they can pursue:
1. Just use untyped DataFrame API. **Pros**: fast; **Cons**: not statically typed in the host language (Scala)
2. Use third-party bindings like Quill. **Pros**: fast and typed; **Cons**: doesn't cover all the use cases of the typed Dataset API.

Or, if a query uses bulk operations like `mapPartitions`, then the overhead of *could* be negligible.

There are a few cases where users may be forced to use the typed Dataset operations, e.g. when they need to use Structured Streaming APIs like `mapGroupsWithState` and `flatMapGroupsWithState`. In such scenario, it is indeed very important to be able to speed up typed Dataset operations because there may not be a good alternative.

So questions to everybody:
- How much are you using typed Dataset operations?
- Which operations?
- What kind of code are you putting into the lambdas for the typed Dataset operations?

Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org