You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/15 05:46:28 UTC

[GitHub] [spark] felixcheung edited a comment on issue #24991: [SPARK-28188] Materialize Dataframe API

felixcheung edited a comment on issue #24991: [SPARK-28188] Materialize Dataframe API
URL: https://github.com/apache/spark/pull/24991#issuecomment-511280363
 
 
   > @rxin, this runs the query up to the point where `materialize` is called. The underlying RDD can then pick up from the last shuffle the next time it is used. This works better than caching in most cases when using dynamic allocation because executors are not sitting idle, but work can be resumed and shared across queries. We could rename the method if that would be more clear.
   > 
   > @srowen, I've seen this suggested on the dev list a few times and I think it is a good idea to add it. There is not guarantee that `count` does the same thing -- it could be optimized -- and it is a little tricky to get this to work with the dataset API. This version creates a new DataFrame from the underlying RDD so that the work is reused from the last shuffle, instead of allowing the planner to re-optimize with later changes (usually projections) and discard the intermediate result. We have found this really useful for better control over the planner, as well as to cache data using the shuffle system.
   
   I have to agree with this - I've seen `count()` or `cache()` mis-used too many times and too many times people need to go back to clean up and remove all calls to `count()`. So much so I'm planning to write an optimizer rule to remove them. I'm only partly kidding.
   
   Maybe this isn't the API for it, and that's ok, let's improve it then and make good suggestion to the community/contributor etc.
   
   I'm not sure `df.write.format("noop").save` is a good suggestion to general spark user.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org