You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/20 22:05:04 UTC

[GitHub] [spark] holdenk opened a new pull request #25515: [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalItr

holdenk opened a new pull request #25515: [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalItr
URL: https://github.com/apache/spark/pull/25515
 
 
   ### What changes were proposed in this pull request?
   
   This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.
   
   ### Why are the changes needed?
   
   In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3b40b76e3473288830ab958fc4834dcb33 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.
   
   ### Does this PR introduce any user-facing change?
   
   A new param is added to toLocalIterator
   
   ### How was this patch tested?
   
   New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`.
   
   I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case.
   
   > 
   > 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   > Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
   > Setting default log level to "WARN".
   > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   > Running timers:                                                                 
   > 
   > [Stage 32:>                                                         (0 + 1) / 1]
   > Results:
   > 
   > Prefetch time:
   > 
   > 100.228110831
   > 
   > 
   > Regular time:
   > 
   > 188.341721614
   > 
   > 
   > 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org