You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by koeninger <gi...@git.apache.org> on 2014/12/02 03:58:42 UTC

[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

GitHub user koeninger opened a pull request:

    https://github.com/apache/spark/pull/3543

    Closes SPARK-4229 Create hadoop configuration in a consistent way

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/koeninger/spark-1 SPARK-4229-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3543
    
----
commit c41a4b4d0752b1a5b057611c796e367c5a806be6
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-04T22:40:17Z

    SPARK-4229 use SparkHadoopUtil.get.conf so that hadoop properties are copied from spark config
    Resolved conflicts in favor of master.
    
    Conflicts:
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

commit b48ad63fb9c31de90c8b5b0541129e2c71bd3478
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-04T22:41:07Z

    SPARK-4229 document handling of spark.hadoop.* properties

commit 413f916bafc5b218ab334cb9d66b67f3dbc117f7
Author: cody koeninger <co...@koeninger.org>
Date:   2014-11-05T03:26:26Z

    SPARK-4229 fix broken table in documentation, make hadoop doc formatting match that of runtime env

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21192361
  
    --- Diff: docs/configuration.md ---
    @@ -664,6 +665,24 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +    <td><code>spark.executor.heartbeatInterval</code></td>
    --- End diff --
    
    Pretty sure that's just diff getting confused based on where the hadoop doc changes were inserted, same lines are marked as removed lower in the diff


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68079375
  
      [Test build #24789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24789/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-96770045
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67258780
  
      [Test build #24512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24512/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571338
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala ---
    @@ -789,7 +790,7 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
           keyClass: Class[_],
           valueClass: Class[_],
           outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    -      conf: Configuration = new Configuration) {
    --- End diff --
    
    This should also be the configuration from the `sparkContext.hadoopConfiguration`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSQLContext.scala ---
    @@ -84,7 +85,7 @@ class JavaSQLContext(val sqlContext: SQLContext) extends UDFRegistration {
           beanClass: Class[_],
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): JavaSchemaRDD = {
    --- End diff --
    
    Same comment as I made in SQLContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67395790
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24551/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21658128
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    If we're going to use `CONFIGURATION_INSTANTIATION_LOCK` in multiple places, then I think it makes sense to move `CONFIGURATION_INSTANTIATION_LOCK` into `SparkHadoopUtil`, since that seems like a more logical place for it to live than `HadoopRDD`.  I like the idea of hiding the synchronization logic behind a method like `SparkHadoopUtil.newConfiguration`.
    
    Regarding whether `SparkContext.hadoopConfiguration` will lead to thread-safety issues: I did a bit of research on this while developing a workaround for the other configuration thread-safety issues and wrote [a series of comments](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160790) citing cases of code "in the wild" that depend on mutating `SparkContext.hadoopConfiguration`.  For example, there are a lot of snippets of code that look like this:
    
    ```scala
    sc.hadoopConfiguration.set("es.resource", "syslog/entry")
    output.saveAsHadoopFile[ESOutputFormat]("-")
    ```
    
    In Spark 1.x, I don't think we'll be able to safely transition away from using the shared `SparkContext.hadoopConfiguration` instance since there's so much existing code that relies on the current behavior.
    
    However, I think that there's much less risk of running into thread-safety issues as a result of this.  It seems fairly unlikely that you'll have multiple threads mutating the shared configuration in the driver JVM.  In executor JVMs, most Hadoop `InputFormats` (and other classes) don't mutate configurations, so we shouldn't run into issues; for those that do mutate, users can always enable the `cloneConf` setting.
    
    In a nutshell, I don't think that the shared `sc.hadoopConfiguration` is a good design that we would choose if we were redesigning it, but using it here seems consistent with the behavior that we have elsewhere in Spark as long as we're stuck with this for 1.x.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65288815
  
      [Test build #24053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24053/consoleFull) for   PR 3543 at commit [`413f916`](https://github.com/apache/spark/commit/413f916bafc5b218ab334cb9d66b67f3dbc117f7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67379497
  
    I'll look at those tests' code in a little bit to see if I can figure out whether they're prone to random flakiness.  I don't recall seeing flakiness from these tests before, so this seems like it's worth investigating.  FYI, I have an open PR that tries to address some of the causes of streaming test flakiness: #3687


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67337516
  
    Jenkins is failing
    
    org.apache.spark.scheduler.SparkListenerSuite.local metrics
    org.apache.spark.streaming.flume.FlumeStreamSuite.flume input compressed stream
    
    I can't reproduce those test failures locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21192103
  
    --- Diff: docs/configuration.md ---
    @@ -664,6 +665,24 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +    <td><code>spark.executor.heartbeatInterval</code></td>
    --- End diff --
    
    Unrelated to this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r22438855
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala ---
    @@ -789,7 +790,7 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
           keyClass: Class[_],
           valueClass: Class[_],
           outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    -      conf: Configuration = new Configuration) {
    --- End diff --
    
    The scope of this PR is pretty wide in terms of the number of classes it touches, causing issues as different places needs to be handled differently. If you considered moving this sort of changes (`new Configuration` to `sparkContext.hadoopConfiguration`) into a different PR that might be easier to get in. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-137335223
  
    OK, makes sense. Can you close this PR for now then? If there's interest we can always reopen it against the latest master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r22446683
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala ---
    @@ -789,7 +790,7 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
           keyClass: Class[_],
           valueClass: Class[_],
           outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    -      conf: Configuration = new Configuration) {
    --- End diff --
    
    Based on what Marcelo Vanzin said on the dev list when I brought this issue
    up, the only reason the problem was still around for me to run into is
    because he changed some of the uses of new Configuration but not all of
    them.
    
    I agree it's used in a lot of different places, but I'm not sure how
    piecemeal fixes to only some of the places is helpful to users. Were there
    still specific concerns about particular classes?
    
    On Sun, Jan 4, 2015 at 6:28 AM, Tathagata Das <no...@github.com>
    wrote:
    
    > In
    > streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala
    > <https://github.com/apache/spark/pull/3543#discussion-diff-22438855>:
    >
    > > @@ -789,7 +790,7 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
    > >        keyClass: Class[_],
    > >        valueClass: Class[_],
    > >        outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
    > > -      conf: Configuration = new Configuration) {
    >
    > The scope of this PR is pretty wide in terms of the number of classes it
    > touches, causing issues as different places needs to be handled
    > differently. If you considered moving this sort of changes (new
    > Configuration to sparkContext.hadoopConfiguration) into a different PR
    > that might be easier to get in.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/3543/files#r22438855>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67379543
  
    (Might as well have Jenkins run this again just to see whether the failure is nondeterministic)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65304552
  
      [Test build #24053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24053/consoleFull) for   PR 3543 at commit [`413f916`](https://github.com/apache/spark/commit/413f916bafc5b218ab334cb9d66b67f3dbc117f7).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68076300
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21638810
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    So let me see if I have things straight
    
    - Currently, the code is using new Configuration() as a default, which may have some thread safety issues due to the constructor
    
    - my original patch uses SparkHadoopUtil.get.conf, which is a singleton, so should decrease the constructor thread safety problem, but increase the problems if the hadoop configuration is modified.  It also won't do the right thing for people who have altered the sparkConf, which makes it no good (I haven't run into this in personal usage of the patched version, because I always pass in a complete sparkConf via properties rather than setting it in code)
    
    - @tdas suggested to use this.sparkContext.hadoopConfiguration.  This will use the "right" spark config, but may have thread safety issues both at construction the time the spark context is created, and if the configuration is modified.
    
    So....
    
    Use tdas' suggestion, add a HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized block to SparkHadoopUtil.newConfiguration?  And people are out of luck if they have code that used to work because they were modifying new blank instances of Configuration, rather than the now-shared one? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571141
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    I think this should be using the "hadoopConfiguration" object in the SparkContext. That has all the hadoop related configuration already setup and should be what is automatically used. @marmbrus should have a better idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67395784
  
      [Test build #24551 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24551/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger closed the pull request at:

    https://github.com/apache/spark/pull/3543


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67258785
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24512/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68079378
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24789/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65176806
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65304557
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24053/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-137253615
  
    @andrewor14 master has diverged sufficiently from this PR that I don't think it's useful to keep it merge-able.  If we think someone's willing to accept the changes to core and sql those subtasks should be revisited with this general approach as a basis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-82923505
  
    I think its mostly a question of whether committers are comfortable with a
    PR that changes all of the uses of new Configuration.
    
    At this point it'd probably need another audit of the code to see if there
    are more uses, but that's mostly mechanical.
    
    On Tue, Mar 17, 2015 at 10:41 PM, Michael Armbrust <notifications@github.com
    > wrote:
    
    > ping. Whats the status here?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/3543#issuecomment-82725434>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-82725434
  
    ping.  Whats the status here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68758185
  
    Just for posterity: I think a couple of changes sneaked in between my original change being sent as a PR and it being commited, making my code miss some `Configuration` instantiations.
    
    At that time, I explicitly avoided changing default arguments to a few methods (since my thinking was that since it's an argument, the user should know what he's doing). But I don't really have an opinion about what's the right approach there, and changing it is fine with me.
    
    I also don't have enough background to comment on the thread-safety issues (others have looked at it in much more depth than I have)...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68076317
  
    @JoshRosen I leave it to you to figure out changes related to the `SparkHadoopUtil`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21571364
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala ---
    @@ -545,7 +546,7 @@ object StreamingContext extends Logging {
       def getOrCreate(
           checkpointPath: String,
           creatingFunc: () => StreamingContext,
    -      hadoopConf: Configuration = new Configuration(),
    --- End diff --
    
    I approve this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65306213
  
    LGTM. Could you rephrase the title to just say "[SPARK-4229]" instead of "Closes blah", to follow the convention used for PR titles?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21622115
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    @koeninger The issue that you linked is concerned with thread-safety issues when multiple threads concurrently modify the same `Configuration` instance.
    
    It turns out that there's another, older thread-safety issue related to `Configuration`'s constructor not being thread-safe due to non-thread-safe static state: https://issues.apache.org/jira/browse/HADOOP-10456.  This has been fixed in some newer Hadoop releases, but since it was only reported in April I don't think we can ignore it.  As a result, https://issues.apache.org/jira/browse/SPARK-1097 implements a workaround which synchronizes on an object before calling `new Configuration`.  Currently, I think the extra synchronization logic is only implemented in `HadoopRDD`, but it should probably be used everywhere just to be safe.  I think that `HadoopRDD` was the "highest-risk" place where we might have many threads creating Configurations at the same time, which is probably why that patch's author didn't add the synchronization everywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by koeninger <gi...@git.apache.org>.
Github user koeninger commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21610025
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    I seem to recall there being potential thread safety issues related to
    hadoop configuration objects, resulting in the need to create / clone them.
    
    Quick search turned up e.g.
    
    https://issues.apache.org/jira/browse/SPARK-2546
    
    I'm not sure how relevant that is to all of these existing situations where
    new Configuration() is being called.
    
    On Tue, Dec 9, 2014 at 5:07 PM, Tathagata Das <no...@github.com>
    wrote:
    
    > In sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
    > <https://github.com/apache/spark/pull/3543#discussion-diff-21571141>:
    >
    > > @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
    > >    def createParquetFile[A <: Product : TypeTag](
    > >        path: String,
    > >        allowExisting: Boolean = true,
    > > -      conf: Configuration = new Configuration()): SchemaRDD = {
    >
    > I think this should be using the "hadoopConfiguration" object in the
    > SparkContext. That has all the hadoop related configuration already setup
    > and should be what is automatically used. @marmbrus
    > <https://github.com/marmbrus> should have a better idea.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/3543/files#r21571141>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-68076406
  
      [Test build #24789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24789/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67249414
  
      [Test build #24512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24512/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Closes SPARK-4229 Create hadoop configuration ...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-65288067
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21658152
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    >  And people are out of luck if they have code that used to work because they were modifying new blank instances of Configuration, rather than the now-shared one?
    
    I don't think that users were able to access the old `new Configuration()` instance; I think that the only code that could have modified this would be the Parquet code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67379826
  
      [Test build #24551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24551/consoleFull) for   PR 3543 at commit [`bfc550e`](https://github.com/apache/spark/commit/bfc550ef0b7b535adb0aa019f30dd4771c24aece).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-136891685
  
    @koeninger would you mind updating this patch per @tdas' suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3543#issuecomment-67379507
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org