You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2014/10/07 00:54:18 UTC

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/2683

    [SPARK-2585] Remove special handling of Hadoop JobConf

    > Previously we broadcast JobConf for HadoopRDD because it is large. Now we always broadcast RDDs and task closures so it should no longer be necessary to broadcast the JobConf anymore.
    
    This is a resubmission of @rxin's #1648 (closes #1648).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-2585

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2683
    
----
commit 1980f5ec1375385fe6b35bc755e3de3ff6c88472
Author: Reynold Xin <rx...@apache.org>
Date:   2014-07-30T04:39:14Z

    [SPARK-2585] Remove special handling of Hadoop JobConf.
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
    	sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

commit d16102d97a958396c99f159b913efbd33eb2a105
Author: Reynold Xin <rx...@apache.org>
Date:   2014-07-30T23:27:45Z

    Remove JobConf broadcast comment.

commit 1d67d9d8283d0d64949f48b2e3a15c9a23d3b169
Author: Josh Rosen <jo...@apache.org>
Date:   2014-10-06T22:52:02Z

    Add comment to address Aaron's review comment in #1648.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2683#issuecomment-58113372
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21350/consoleFull) for   PR 2683 at commit [`1d67d9d`](https://github.com/apache/spark/commit/1d67d9d8283d0d64949f48b2e3a15c9a23d3b169).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2683#discussion_r18491950
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -133,26 +106,17 @@ class HadoopRDD[K, V](
       private val createTime = new Date()
     
       // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads.
    -  protected def getJobConf(): JobConf = {
    -    val conf: Configuration = broadcastedConf.value.value
    -    if (conf.isInstanceOf[JobConf]) {
    -      // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it.
    -      conf.asInstanceOf[JobConf]
    -    } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -      // getJobConf() has been called previously, so there is already a local cache of the JobConf
    -      // needed by this RDD.
    -      HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
    -    } else {
    -      // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the
    -      // local process. The local cache is accessed through HadoopRDD.putCachedMetadata().
    -      // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects.
    -      // Synchronize to prevent ConcurrentModificationException (Spark-1097, Hadoop-10456).
    -      HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    --- End diff --
    
    @rxin I'm not sure that it's safe to remove this synchronization around the `JobConf` constructor.  This synchronization was originally added to address HADOOP-10456, which affects a number of Hadoop versions that we still want to be compatible with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen closed the pull request at:

    https://github.com/apache/spark/pull/2683


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2683#issuecomment-58269418
  
    Due to the `CONFIGURATION_INSTANTIATION_LOCK` thread-safety issue, I think that we'll still end up having to serialize the Configuration separately.  If we didn't, then we'd have to hold `CONFIGURATION_INSTANTIATION_LOCK` while deserializing each task, which could have a huge performance penalty (it's fine to hold the lock while loading the Configuration, since that doesn't take too long).
    
    Therefore, I don't think that we'll be able to safely remove this complexity, so I'm going to recommend closing this PR in favor of applying the small #2684 fix in 1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2683#issuecomment-58121986
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21350/consoleFull) for   PR 2683 at commit [`1d67d9d`](https://github.com/apache/spark/commit/1d67d9d8283d0d64949f48b2e3a15c9a23d3b169).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2683#issuecomment-58121989
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21350/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org