You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/03 23:51:28 UTC

[GitHub] [spark] rdblue commented on issue #24991: [SPARK-28188] Materialize Dataframe API

rdblue commented on issue #24991: [SPARK-28188] Materialize Dataframe API
URL: https://github.com/apache/spark/pull/24991#issuecomment-508290694

@rxin, this runs the query up to the point where `materialize` is called. The underlying RDD can then pick up from the last shuffle the next time it is used. This works better than caching in most cases when using dynamic allocation because executors are not sitting idle, but work can be resumed and shared across queries. We could rename the method if that would be more clear.

@srowen, I've seen this suggested on the dev list a few times and I think it is a good idea to add it. There is not guarantee that `count` does the same thing -- it could be optimized -- and it is a little tricky to get this to work with the dataset API. This version creates a new DataFrame from the underlying RDD so that the work is reused from the last shuffle, instead of allowing the planner to re-optimize with later changes (usually projections) and discard the intermediate result. We have found this really useful for better control over the planner, as well as to cache data using the shuffle system.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org