You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/10 22:28:58 UTC

[GitHub] [spark] Victsm commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

Victsm commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-491447769
 
 
   We also have a reasonable collections of Dataset API use cases at LinkedIn, especially centered around offline feature engineering pipelines which relies on complex transformation logics that are not straightforward to express using DataFrame operations. We are also working on a similar prototype to address Dataset performance issue. We are trying to find a balance between bringing the benefits of bytecode analysis and dealing with its complexity. Instead of trying to fully convert the lambda function into Catalyst expression, which might run into many corner cases, we are rather focused on identifying which fields of the domain objects are being accessed and leverage that information in column pruning optimization to cut down serde and IO overhead. Want to chime into this discussion and provide our 2 cents and work with the community to see how we can push this Dataset performance enhancement effort forward.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org