You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/27 05:31:31 UTC

[GitHub] [spark] JoshRosen commented on a change in pull request #24714: [SPARK-27846][CORE] Eagerly compute Configuration.properties in sc.hadoopConfiguration

JoshRosen commented on a change in pull request #24714: [SPARK-27846][CORE] Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714#discussion_r287650756
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/SparkContext.scala
 ##########
 @@ -285,7 +285,18 @@ class SparkContext(config: SparkConf) extends Logging {
    * @note As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
    * plan to set some global configurations for all Hadoop RDDs.
    */
-  def hadoopConfiguration: Configuration = _hadoopConfiguration
+  def hadoopConfiguration: Configuration = {
+    // Performance optimization: this dummy call to .size() triggers eager evaluation of
+    // Configuration's internal  `properties` field, guaranteeing that it will be computed and
+    // cached before SessionState.newHadoopConf() uses `sc.hadoopConfiguration` to create
+    // a new per-session Configuration. If `properties` has not been computed by that time
+    // then each newly-created Configuration will perform its own expensive IO and XML
+    // parsing to load configuration defaults and populate its own properties. By ensuring
+    // that we've pre-computed the parent's properties, the child Configuration will simply
+    // clone the parent's properties.
+    _hadoopConfiguration.size()
 
 Review comment:
   It's a little hard to get a precise end-to-end performance measurement here, unfortunately. In steady state this is't huge, but fixing it cleans up a ton of noise in Java profiler output: after this change I no longer see a bunch of frames from `org.apache.xerces` / `ClassLoader.getResource`, etc.
   
   I submitted this mostly for the sake upstreaming changes from my local fuzz-testing branch (which runs tons of short queries through the ThriftServer).  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org