You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/26 23:47:25 UTC

[GitHub] [spark] JoshRosen opened a new pull request #24714: [SPARK-27846] Eagerly compute Configuration.properties in sc.hadoopConfiguration

JoshRosen opened a new pull request #24714: [SPARK-27846] Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714
 
 
   ## What changes were proposed in this pull request?
   
   Hadoop `Configuration` has an internal `properties` map which is lazily initialized. Initialization of this field, done in the private `Configuration.getProps()` method, is rather expensive because it ends up parsing XML configuration files. When cloning a `Configuration`, this `properties` field is cloned if it has been initialized.
   
   In some cases it's possible that `sc.hadoopConfiguration` never ends up computing this `properties` field, leading to performance problems when this configuration is cloned in `SessionState.newHadoopConf()` because each cloned `Configuration` needs to re-parse configuration XML files from disk.
   
   To avoid this problem, we can call `Configuration.size()` to trigger a call to `getProps()`, ensuring that this expensive computation is cached and re-used when cloning configurations.
   
   I discovered this problem while performance profiling the Spark ThriftServer while running a SQL fuzzing workload.
   
   ## How was this patch tested?
   
   Examined YourKit profiles before and after my change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org