You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/27 02:16:53 UTC

[GitHub] [spark] HyukjinKwon commented on a change in pull request #24714: [SPARK-27846][CORE] Eagerly compute Configuration.properties in sc.hadoopConfiguration

HyukjinKwon commented on a change in pull request #24714: [SPARK-27846][CORE] Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714#discussion_r287629486
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/SparkContext.scala
 ##########
 @@ -285,7 +285,18 @@ class SparkContext(config: SparkConf) extends Logging {
    * @note As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
    * plan to set some global configurations for all Hadoop RDDs.
    */
-  def hadoopConfiguration: Configuration = _hadoopConfiguration
+  def hadoopConfiguration: Configuration = {
+    // Performance optimization: this dummy call to .size() triggers eager evaluation of
+    // Configuration's internal  `properties` field, guaranteeing that it will be computed and
+    // cached before SessionState.newHadoopConf() uses `sc.hadoopConfiguration` to create
+    // a new per-session Configuration. If `properties` has not been computed by that time
+    // then each newly-created Configuration will perform its own expensive IO and XML
+    // parsing to load configuration defaults and populate its own properties. By ensuring
+    // that we've pre-computed the parent's properties, the child Configuration will simply
+    // clone the parent's properties.
+    _hadoopConfiguration.size()
 
 Review comment:
   @JoshRosen, out of curiosity, how much did this improve the performance? - just rough estimation is fine. I thought here' in a non-critical path so I thought better readability is prioritized if the gain is rather minor.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org