You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sahilTakiar <gi...@git.apache.org> on 2017/10/02 21:03:57 UTC

[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

GitHub user sahilTakiar opened a pull request:

    https://github.com/apache/spark/pull/19413

    [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

    ## What changes were proposed in this pull request?
    
    Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call.
    
    ## How was this patch tested?
    
    Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sahilTakiar/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19413
    
----
commit 680f32c311e33784e11763c109488d528178efc8
Author: Sahil Takiar <st...@cloudera.com>
Date:   2017-10-02T20:44:23Z

    [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82399/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82399/testReport)** for PR 19413 at commit [`fd1df3d`](https://github.com/apache/spark/commit/fd1df3d7c4ded4bb431153f1195f7dbb0e811491).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19413#discussion_r142418306
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -157,20 +157,23 @@ class HadoopRDD[K, V](
           if (conf.isInstanceOf[JobConf]) {
             logDebug("Re-using user-broadcasted JobConf")
             conf.asInstanceOf[JobConf]
    -      } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -        logDebug("Re-using cached JobConf")
    -        HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
           } else {
    -        // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
    -        // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
    -        // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
    -        // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
    -        HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    -          logDebug("Creating new JobConf and caching it for later re-use")
    -          val newJobConf = new JobConf(conf)
    -          initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
    -          HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
    -          newJobConf
    +        val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
    --- End diff --
    
    You are using two val with the same name here, normally we don't recommend that this way. How about follow vanzin's suggestion in the JIRA issue, like this:
    ```
    Option(HadoopRDD.getCachedMetadata(jobConfCacheKey)).getOrElse(
         HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
              ......
    })
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82429/testReport)** for PR 19413 at commit [`079a4e2`](https://github.com/apache/spark/commit/079a4e26ec900745411fdd60e4cd242b9a43320b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82397/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19413#discussion_r142470965
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -157,20 +157,23 @@ class HadoopRDD[K, V](
           if (conf.isInstanceOf[JobConf]) {
             logDebug("Re-using user-broadcasted JobConf")
             conf.asInstanceOf[JobConf]
    -      } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -        logDebug("Re-using cached JobConf")
    -        HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
           } else {
    -        // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
    -        // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
    -        // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
    -        // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
    -        HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    -          logDebug("Creating new JobConf and caching it for later re-use")
    -          val newJobConf = new JobConf(conf)
    -          initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
    -          HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
    -          newJobConf
    +        val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
    --- End diff --
    
    You could do that without the val:
    
    ```
    Option(HadoopRDD.getCachedMetadata(jobConfCacheKey))
      .map { conf =>
         logDebug(...)
         conf.asInstanceOf[...]
      }
      .getOrElse {
         // code in the "else" block
      }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82429/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19413#discussion_r142466295
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -157,20 +157,23 @@ class HadoopRDD[K, V](
           if (conf.isInstanceOf[JobConf]) {
             logDebug("Re-using user-broadcasted JobConf")
             conf.asInstanceOf[JobConf]
    -      } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -        logDebug("Re-using cached JobConf")
    -        HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
           } else {
    -        // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
    -        // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
    -        // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
    -        // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
    -        HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    -          logDebug("Creating new JobConf and caching it for later re-use")
    -          val newJobConf = new JobConf(conf)
    -          initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
    -          HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
    -          newJobConf
    +        val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
    --- End diff --
    
    I could do that, but then I would have to remove the `logDebug("Re-using cached JobConf")` statement. Unless there is some way to get around that in Scala.
    
    Otherwise, I can remove the val name to `cachedJobConf`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82399/testReport)** for PR 19413 at commit [`fd1df3d`](https://github.com/apache/spark/commit/fd1df3d7c4ded4bb431153f1195f7dbb0e811491).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82429/testReport)** for PR 19413 at commit [`079a4e2`](https://github.com/apache/spark/commit/079a4e26ec900745411fdd60e4cd242b9a43320b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82397/testReport)** for PR 19413 at commit [`680f32c`](https://github.com/apache/spark/commit/680f32c311e33784e11763c109488d528178efc8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    LGTM. Merging to master and back to 2.0 unless I hit a conflict.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19413#discussion_r142499531
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -157,20 +157,23 @@ class HadoopRDD[K, V](
           if (conf.isInstanceOf[JobConf]) {
             logDebug("Re-using user-broadcasted JobConf")
             conf.asInstanceOf[JobConf]
    -      } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -        logDebug("Re-using cached JobConf")
    -        HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
           } else {
    -        // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
    -        // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
    -        // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
    -        // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
    -        HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    -          logDebug("Creating new JobConf and caching it for later re-use")
    -          val newJobConf = new JobConf(conf)
    -          initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
    -          HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
    -          newJobConf
    +        val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
    --- End diff --
    
    Thanks for the tip. Updated the patch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    **[Test build #82397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82397/testReport)** for PR 19413 at commit [`680f32c`](https://github.com/apache/spark/commit/680f32c311e33784e11763c109488d528178efc8).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    (FYI, didn't merge to 2.0.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...

Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on the issue:

    https://github.com/apache/spark/pull/19413
  
    Fixed the style check.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19413


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org