You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sahilTakiar <gi...@git.apache.org> on 2017/10/02 21:03:57 UTC
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
GitHub user sahilTakiar opened a pull request:
https://github.com/apache/spark/pull/19413
[SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE
## What changes were proposed in this pull request?
Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call.
## How was this patch tested?
Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sahilTakiar/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19413.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19413
----
commit 680f32c311e33784e11763c109488d528178efc8
Author: Sahil Takiar <st...@cloudera.com>
Date: 2017-10-02T20:44:23Z
[SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82399/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/19413
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82399/testReport)** for PR 19413 at commit [`fd1df3d`](https://github.com/apache/spark/commit/fd1df3d7c4ded4bb431153f1195f7dbb0e811491).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19413#discussion_r142418306
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -157,20 +157,23 @@ class HadoopRDD[K, V](
if (conf.isInstanceOf[JobConf]) {
logDebug("Re-using user-broadcasted JobConf")
conf.asInstanceOf[JobConf]
- } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
- logDebug("Re-using cached JobConf")
- HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
} else {
- // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
- // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
- // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
- // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
- HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
- logDebug("Creating new JobConf and caching it for later re-use")
- val newJobConf = new JobConf(conf)
- initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
- HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
- newJobConf
+ val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
--- End diff --
You are using two val with the same name here, normally we don't recommend that this way. How about follow vanzin's suggestion in the JIRA issue, like this:
```
Option(HadoopRDD.getCachedMetadata(jobConfCacheKey)).getOrElse(
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
......
})
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82429/testReport)** for PR 19413 at commit [`079a4e2`](https://github.com/apache/spark/commit/079a4e26ec900745411fdd60e4cd242b9a43320b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82397/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19413#discussion_r142470965
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -157,20 +157,23 @@ class HadoopRDD[K, V](
if (conf.isInstanceOf[JobConf]) {
logDebug("Re-using user-broadcasted JobConf")
conf.asInstanceOf[JobConf]
- } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
- logDebug("Re-using cached JobConf")
- HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
} else {
- // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
- // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
- // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
- // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
- HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
- logDebug("Creating new JobConf and caching it for later re-use")
- val newJobConf = new JobConf(conf)
- initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
- HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
- newJobConf
+ val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
--- End diff --
You could do that without the val:
```
Option(HadoopRDD.getCachedMetadata(jobConfCacheKey))
.map { conf =>
logDebug(...)
conf.asInstanceOf[...]
}
.getOrElse {
// code in the "else" block
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82429/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on a diff in the pull request:
https://github.com/apache/spark/pull/19413#discussion_r142466295
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -157,20 +157,23 @@ class HadoopRDD[K, V](
if (conf.isInstanceOf[JobConf]) {
logDebug("Re-using user-broadcasted JobConf")
conf.asInstanceOf[JobConf]
- } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
- logDebug("Re-using cached JobConf")
- HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
} else {
- // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
- // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
- // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
- // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
- HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
- logDebug("Creating new JobConf and caching it for later re-use")
- val newJobConf = new JobConf(conf)
- initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
- HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
- newJobConf
+ val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
--- End diff --
I could do that, but then I would have to remove the `logDebug("Re-using cached JobConf")` statement. Unless there is some way to get around that in Scala.
Otherwise, I can remove the val name to `cachedJobConf`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82399/testReport)** for PR 19413 at commit [`fd1df3d`](https://github.com/apache/spark/commit/fd1df3d7c4ded4bb431153f1195f7dbb0e811491).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82429/testReport)** for PR 19413 at commit [`079a4e2`](https://github.com/apache/spark/commit/079a4e26ec900745411fdd60e4cd242b9a43320b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82397/testReport)** for PR 19413 at commit [`680f32c`](https://github.com/apache/spark/commit/680f32c311e33784e11763c109488d528178efc8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/19413
LGTM. Merging to master and back to 2.0 unless I hit a conflict.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on a diff in the pull request:
https://github.com/apache/spark/pull/19413#discussion_r142499531
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -157,20 +157,23 @@ class HadoopRDD[K, V](
if (conf.isInstanceOf[JobConf]) {
logDebug("Re-using user-broadcasted JobConf")
conf.asInstanceOf[JobConf]
- } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
- logDebug("Re-using cached JobConf")
- HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
} else {
- // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
- // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
- // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
- // Synchronize to prevent ConcurrentModificationException (SPARK-1097, HADOOP-10456).
- HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
- logDebug("Creating new JobConf and caching it for later re-use")
- val newJobConf = new JobConf(conf)
- initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
- HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
- newJobConf
+ val newJobConf = HadoopRDD.getCachedMetadata(jobConfCacheKey)
--- End diff --
Thanks for the tip. Updated the patch.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19413
**[Test build #82397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82397/testReport)** for PR 19413 at commit [`680f32c`](https://github.com/apache/spark/commit/680f32c311e33784e11763c109488d528178efc8).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/19413
(FYI, didn't merge to 2.0.)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19413
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration thro...
Posted by sahilTakiar <gi...@git.apache.org>.
Github user sahilTakiar commented on the issue:
https://github.com/apache/spark/pull/19413
Fixed the style check.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19413: [SPARK-20466][CORE] HadoopRDD#addLocalConfigurati...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19413
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org