You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by koeninger <gi...@git.apache.org> on 2014/12/02 03:58:42 UTC

[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

GitHub user koeninger opened a pull request:

    https://github.com/apache/spark/pull/3543

    Closes SPARK-4229 Create hadoop configuration in a consistent way

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/koeninger/spark-1 SPARK-4229-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3543
    
----
commit c41a4b4d0752b1a5b057611c796e367c5a806be6
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-04T22:40:17Z

    SPARK-4229 use SparkHadoopUtil.get.conf so that hadoop properties are copied from spark config
    Resolved conflicts in favor of master.
    
    Conflicts:
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

commit b48ad63fb9c31de90c8b5b0541129e2c71bd3478
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-04T22:41:07Z

    SPARK-4229 document handling of spark.hadoop.* properties

commit 413f916bafc5b218ab334cb9d66b67f3dbc117f7
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-05T03:26:26Z

    SPARK-4229 fix broken table in documentation, make hadoop doc formatting match that of runtime env

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.

Github user koeninger commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21192361
  
    --- Diff: docs/configuration.md ---
    @@ -664,6 +665,24 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +    <td><code>spark.executor.heartbeatInterval</code></td>
    --- End diff --
    
    Pretty sure that's just diff getting confused based on where the hadoop doc changes were inserted, same lines are marked as removed lower in the diff


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68079375
  
      [Test build #24789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24789/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-96770045
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67258780
  
      [Test build #24512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24512/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571338
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala ---
    @@ -789,7 +790,7 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
           keyClass: Class[_],
           valueClass: Class[_],
           outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    -      conf: Configuration = new Configuration) {
    --- End diff --
    
    This should also be the configuration from the `sparkContext.hadoopConfiguration`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSQLContext.scala ---
    @@ -84,7 +85,7 @@ class JavaSQLContext(val sqlContext: SQLContext) extends UDFRegistration {
           beanClass: Class[_],
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): JavaSchemaRDD = {
    --- End diff --
    
    Same comment as I made in SQLContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67395790
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24551/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21658128
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    If we're going to use `CONFIGURATION_INSTANTIATION_LOCK` in multiple places, then I think it makes sense to move `CONFIGURATION_INSTANTIATION_LOCK` into `SparkHadoopUtil`, since that seems like a more logical place for it to live than `HadoopRDD`.  I like the idea of hiding the synchronization logic behind a method like `SparkHadoopUtil.newConfiguration`.
    
    Regarding whether `SparkContext.hadoopConfiguration` will lead to thread-safety issues: I did a bit of research on this while developing a workaround for the other configuration thread-safety issues and wrote [a series of comments](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160790) citing cases of code "in the wild" that depend on mutating `SparkContext.hadoopConfiguration`.  For example, there are a lot of snippets of code that look like this:
    
    ```scala
    sc.hadoopConfiguration.set("es.resource", "syslog/entry")
    output.saveAsHadoopFile[ESOutputFormat]("-")
    ```
    
    In Spark 1.x, I don't think we'll be able to safely transition away from using the shared `SparkContext.hadoopConfiguration` instance since there's so much existing code that relies on the current behavior.
    
    However, I think that there's much less risk of running into thread-safety issues as a result of this.  It seems fairly unlikely that you'll have multiple threads mutating the shared configuration in the driver JVM.  In executor JVMs, most Hadoop `InputFormats` (and other classes) don't mutate configurations, so we shouldn't run into issues; for those that do mutate, users can always enable the `cloneConf` setting.
    
    In a nutshell, I don't think that the shared `sc.hadoopConfiguration` is a good design that we would choose if we were redesigning it, but using it here seems consistent with the behavior that we have elsewhere in Spark as long as we're stuck with this for 1.x.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org