You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/05 01:14:22 UTC

[GitHub] [spark] rxin commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

rxin commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-489378015
 
 
   As the person that initially filed the ticket, I actually no longer believe in it, for the following reasons:
   
   1. The typed Dataset API usage is very small. On Databricks, which covers thousands of organizations, roughly 1% of the workloads use the typed Dataset API. We didn't in particular encourage users to do one way vs another. They just ended up mostly using the untyped DataFrame API. So the number of users this would benefit would be small.
   
   2. It is really difficult to get this working well. Collectively I think we have sunk over two person-years on this with some pretty strong engineers, and the prototype we had was still pretty bad that we decided not to ship it in production and eventually deleted all the code from our codebase. It is very easy to make couple simple programs work to demo this feature, but users get confused when they hit a performance cliff because a very simple addition to their program now breaks the optimization.
   
   3. There's significant maintenance overhead, maybe the highest in all of Spark. This part alone if you take it to the extreme would be like building a brand new JVM. The number of people that will be able to understand the codebase will be tiny.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org