You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ericl <gi...@git.apache.org> on 2016/04/08 01:47:22 UTC

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/12248

    [SPARK-14475] Propagate user-defined context from driver to executors

    ## What changes were proposed in this pull request?
    
    This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.
    
    ## How was this patch tested?
    
    Unit tests.
    
    cc @JoshRosen 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark sc-2813

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12248
    
----
commit a8b07f1163a92072c1bc71cf63f804df1f430060
Author: Eric Liang <ek...@databricks.com>
Date:   2016-04-07T22:30:39Z

    Thu Apr  7 15:30:39 PDT 2016

commit 9104f0cd77c30dd0acf8e2a1c53e9b57e9241c02
Author: Eric Liang <ek...@databricks.com>
Date:   2016-04-07T23:33:56Z

    tests

commit b2fc541368e30ffd455c068b709d68251d09abd5
Author: Eric Liang <ek...@databricks.com>
Date:   2016-04-07T23:38:01Z

    Thu Apr  7 16:38:01 PDT 2016

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966811
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    +    val propBytes = Utils.serialize(task.localProperties)
    +    dataOut.writeInt(propBytes.length)
    +    dataOut.write(propBytes, 0, propBytes.length)
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152833
  
    **[Test build #55287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55287/consoleFull)** for PR 12248 at commit [`82646a9`](https://github.com/apache/spark/commit/82646a9abd0f10e874292de649b2dbed0db2db57).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152922
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55285/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207223641
  
    **[Test build #55320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55320/consoleFull)** for PR 12248 at commit [`0e46c58`](https://github.com/apache/spark/commit/0e46c58685ac8a605d2d98630488083cbef95c2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152847
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207350278
  
    Backing up, how does this differ from using broadcast variables for data, or for simply sending small properties objects in a closure? does it need all this complexity of yet another mechanism?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208656977
  
    Going to merge this in maser. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207557823
  
    > Your change is about passing around a Properties, right? You can simply access such an object anywhere you need to and it will be sent around as needed. There is nothing to do actually, not even explicitly setting it as some context property.
    
    That's not true, if you access a static `Properties` object within an executor node it won't have the value you set in the driver, since closures only capture variables in lexical scope.
    
    > Your example however seems to be about configuring some global per-function behavior, not sending props. In this example, why would the library not call setLogLevel internally, either in static initialization or as needed when any method is invoked -- why would the caller have to do it?
    
    It's more about configuring behavior based on some property set by some upstream caller of the function. The idea is that the user wants to configure loglevel just for this job, without impacting any other jobs potentially running on the cluster.
    
    > But, how is this helped by adding an additional Properties parameter?
    
    Sorry, I should have made the example more explicit. setLogLevel would be implemented in the driver side as `sc.setLocalProperty("mylib.loglevel", level)`. On the executor side the library would query `TaskContext.getLocalProperty("mylib.loglevel")` to determine the verbosity of debug logs.
    
    I think more generally that this adds a mechanism for passing values implicitly without requiring the user (that is writing Spark code) to manually reference it in each of their closures. You are right that this can be achieved via other mechanisms, but those may not be convenient or practical for the use case e.g. if you want to integrate with something like [X-trace](http://www.x-trace.net/wiki/doku.php) (which out of the scope of this PR, but would be easy to add once we have the mechanism).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207543635
  
    Your change is about passing around a `Properties`, right? You can simply access such an object anywhere you need to and it will be sent around as needed. There is nothing to do actually, not even explicitly setting it as some context property.
    
    Your example however seems to be about configuring some global per-function behavior, not sending props. In this example, why would the library not call `setLogLevel` internally, either in static initialization or as needed when any method is invoked -- why would the caller have to do it? But, how is this helped by adding an additional `Properties` parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207562982
  
    I mean making a `Properties` object in the driver, and using a reference to it in a function that is executed on the executors. That's certainly in scope. For the example you give that seems equally simple and can be bottled up inside the library, still.
    
    I understand Josh's use case more. There are certainly tasks and RDDs entirely internal to some Spark process. But those also won't know anything about what to do with some custom user properties. Maybe eventually they invoke a UDF that could use these properties. In many cases that UDF could still just refer to whatever config you like directly (right?) but I'm probably not thinking of some case where this fails to work.
    
    I take the point about this already being an API for the caller anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207143176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55276/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207143159
  
    **[Test build #55276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55276/consoleFull)** for PR 12248 at commit [`b2fc541`](https://github.com/apache/spark/commit/b2fc541368e30ffd455c068b709d68251d09abd5).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207149426
  
    **[Test build #55285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55285/consoleFull)** for PR 12248 at commit [`37f269e`](https://github.com/apache/spark/commit/37f269e0dd08f0acbff552bbb9d27b163d85d86f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207559966
  
    @srowen, I think that the main use-case for this feature is associating metadata associated with a Spark action / execution and making that metadata accessible in that action's tasks. 
    
    For instance, let's say that I run a Spark SQL query and want to propagate some metadata related to that query execution from the driver to the executors for use in tracing / debugging / instrumentation. Maybe I want to propagate a label associated with all tasks launched from the job, such as a job group name, and read that label in a custom log appender so that my log messages from those tasks contain that metadata.
    
    In this case, the actual RDD code isn't controlled by the user and they don't really have a place to interpose broadcast variables or other custom code for propagating this metadata.
    
    Even the user's library code were to use broadcast variables and define thread-local variables, etc., then they'd have to worry about some subtleties related to Spark's internal threading model: for example, thread-locals need to be handled carefully to make sure that they're correctly propagated across thread-boundaries in PythonRDD, RRDD, ScriptTransformation, PipedRDD, etc., and the set of places where you'd need to do that propagation corresponds exactly to the set of places where we already happen to be propagating the TaskContext thread-local.
    
    Given that `localProperties` is already a stable public API, I think it makes sense to make those properties accessible in tasks, since it seems like a small and logical extension of an existing API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r59074967
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
    @@ -42,20 +43,22 @@ import org.apache.spark.shuffle.ShuffleWriter
      * @param _initialAccums initial set of accumulators to be used in this task for tracking
      *                       internal metrics. Other accumulators will be registered later when
      *                       they are deserialized on the executors.
    + * @param localProperties copy of thread-local properties set by the user on the driver side.
      */
     private[spark] class ShuffleMapTask(
         stageId: Int,
         stageAttemptId: Int,
         taskBinary: Broadcast[Array[Byte]],
         partition: Partition,
         @transient private var locs: Seq[TaskLocation],
    -    _initialAccums: Seq[Accumulator[_]])
    -  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums)
    +    _initialAccums: Seq[Accumulator[_]],
    +    localProperties: Properties)
    +  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums, localProperties)
       with Logging {
     
       /** A constructor used only in test suites. This does not require passing in an RDD. */
       def this(partitionId: Int) {
    -    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null)
    +    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null, new Properties)
    --- End diff --
    
    I wonder if we can avoid making empty `Properties` all over ... an `Option[Properties]`? a setter that is called only where needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207217913
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55291/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207155748
  
    **[Test build #55291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55291/consoleFull)** for PR 12248 at commit [`964ee4b`](https://github.com/apache/spark/commit/964ee4b00cb5928162ad893c4bc863ec934eb2e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207275679
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207217771
  
    **[Test build #55291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55291/consoleFull)** for PR 12248 at commit [`964ee4b`](https://github.com/apache/spark/commit/964ee4b00cb5928162ad893c4bc863ec934eb2e9).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152919
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966802
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207217912
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207160207
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r59076114
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
    @@ -42,20 +43,22 @@ import org.apache.spark.shuffle.ShuffleWriter
      * @param _initialAccums initial set of accumulators to be used in this task for tracking
      *                       internal metrics. Other accumulators will be registered later when
      *                       they are deserialized on the executors.
    + * @param localProperties copy of thread-local properties set by the user on the driver side.
      */
     private[spark] class ShuffleMapTask(
         stageId: Int,
         stageAttemptId: Int,
         taskBinary: Broadcast[Array[Byte]],
         partition: Partition,
         @transient private var locs: Seq[TaskLocation],
    -    _initialAccums: Seq[Accumulator[_]])
    -  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums)
    +    _initialAccums: Seq[Accumulator[_]],
    +    localProperties: Properties)
    +  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums, localProperties)
       with Logging {
     
       /** A constructor used only in test suites. This does not require passing in an RDD. */
       def this(partitionId: Int) {
    -    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null)
    +    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null, new Properties)
    --- End diff --
    
    It seemed safer to make it required. I can change this to an option if you think creating a Properties each time is too much overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207581308
  
    @srowen, suppose you have a existing service running Spark jobs that read from a custom datasource. You want to add log4j trace annotations in order to attribute datasource logs back to the original caller of the service. However you want to avoid invasive changes to the existing code. This is a two-line change with the proposed API.
    
    ```
    // in RPC server running as driver
    def receive(request: RPC) {
        sc.setLocalProperty("traceId", request.traceId)  // add this line
        ...
    }
    
    // in datasource library running on executors
    def handleRead(...) {
        log4j.MDC.put("traceId", TaskContext.getLocalProperty("traceId"))  // add this line
        ...
    }
    ```
    
    The alternative is to explicitly reference `traceId` in each of the tasks, but this would clutter application code with many references to diagnostics info, discouraging the use of diagnostic tools.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207534620
  
    It's not though, if you want to propagate something new without manually passing it through all your closures this cannot be done today.
    
    For example, consider a spark library that wants to implement a per-job `myLib.setLogLevel()` call. With context propagation you library author can provide semantics like this:
    
    ```
    myLib.setLogLevel(INFO)
    sc.parallelize(...).map(myLib.f1).filter(myLib.f2).collect()
    ```
    
    What you have to do now is something more like:
    ```
    sc.parallelize(...).map { x =>
      myLib.setLogLevel(INFO)
      myLib.f1(x)
    }.filter { y =>
      myLib.setLogLevel(INFO)
      myLib.f2(y)
    }.collect()
    ```
    
    which is more verbose and hard to maintain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207507173
  
    Propagating references from a closure is already transparent; you just reference whatever you want like a Properties object and it goes with the task. What's the use case for something more than this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207153790
  
    **[Test build #55290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55290/consoleFull)** for PR 12248 at commit [`2542d01`](https://github.com/apache/spark/commit/2542d01efde219937ab3a7773488bcd98740acd6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207146732
  
    The MiMa failure is because `TaskContext` is a public abstract class, so adding a new method to it breaks binary compatibility for implementors of that class. However, `TaskContext` is not intended to actually be implemented by users; rather, it functions more as a public interface. Therefore, it should be safe to ignore this MiMa error:
    
    ```
    [error]  * abstract method getLocalProperty(java.lang.String)java.lang.String in class org.apache.spark.TaskContext is present only in current version
    [error]    filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.getLocalProperty")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207506665
  
    The main difference is that propagation is transparent to user code. For
    example, this could be used to implement something like X-trace without
    requiring manual instrumentation of closures
    
    On Fri, Apr 8, 2016, 2:40 AM Sean Owen <no...@github.com> wrote:
    
    > Backing up, how does this differ from using broadcast variables for data,
    > or for simply sending small properties objects in a closure? does it need
    > all this complexity of yet another mechanism?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12248#issuecomment-207350278>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r59077967
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
    @@ -42,20 +43,22 @@ import org.apache.spark.shuffle.ShuffleWriter
      * @param _initialAccums initial set of accumulators to be used in this task for tracking
      *                       internal metrics. Other accumulators will be registered later when
      *                       they are deserialized on the executors.
    + * @param localProperties copy of thread-local properties set by the user on the driver side.
      */
     private[spark] class ShuffleMapTask(
         stageId: Int,
         stageAttemptId: Int,
         taskBinary: Broadcast[Array[Byte]],
         partition: Partition,
         @transient private var locs: Seq[TaskLocation],
    -    _initialAccums: Seq[Accumulator[_]])
    -  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums)
    +    _initialAccums: Seq[Accumulator[_]],
    +    localProperties: Properties)
    +  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums, localProperties)
       with Logging {
     
       /** A constructor used only in test suites. This does not require passing in an RDD. */
       def this(partitionId: Int) {
    -    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null)
    +    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null, new Properties)
    --- End diff --
    
    Fair enough, I suppose allocating the empty map/properties object isn't that expensive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207143175
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207149663
  
    To fix MiMa, add the ignores to `MimaExcludes`; I'd follow the existing convention and create a new section at https://github.com/apache/spark/blob/49fb237081bbca0d811aa48aa06f4728fea62781/project/MimaExcludes.scala#L613 and reference this JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207160209
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55290/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208594890
  
    **[Test build #55545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55545/consoleFull)** for PR 12248 at commit [`0e46c58`](https://github.com/apache/spark/commit/0e46c58685ac8a605d2d98630488083cbef95c2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207150129
  
    **[Test build #55287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55287/consoleFull)** for PR 12248 at commit [`82646a9`](https://github.com/apache/spark/commit/82646a9abd0f10e874292de649b2dbed0db2db57).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208594533
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208650403
  
    **[Test build #55545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55545/consoleFull)** for PR 12248 at commit [`0e46c58`](https://github.com/apache/spark/commit/0e46c58685ac8a605d2d98630488083cbef95c2a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207141407
  
    **[Test build #55276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55276/consoleFull)** for PR 12248 at commit [`b2fc541`](https://github.com/apache/spark/commit/b2fc541368e30ffd455c068b709d68251d09abd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208650713
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152902
  
    **[Test build #55285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55285/consoleFull)** for PR 12248 at commit [`37f269e`](https://github.com/apache/spark/commit/37f269e0dd08f0acbff552bbb9d27b163d85d86f).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207913175
  
    Yes exactly, this is for implementing functionality such as tracing, where to users *any* existing code modification may be too burdensome due to e.g. too much plumbing or libraries they cannot modify.
    
    It's the same argument for thread-locals, but in this case spanning driver -> worker interactions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207160179
  
    **[Test build #55290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55290/consoleFull)** for PR 12248 at commit [`2542d01`](https://github.com/apache/spark/commit/2542d01efde219937ab3a7773488bcd98740acd6).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207274639
  
    **[Test build #55320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55320/consoleFull)** for PR 12248 at commit [`0e46c58`](https://github.com/apache/spark/commit/0e46c58685ac8a605d2d98630488083cbef95c2a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207588437
  
    This can as easily be ...
    
    ```
    Properties p = ...
    p.put("traceId", "foo")
    ...
    def handleRead(...) {
      log4j.MDC.put("traceId", p.get("traceId"))
      ...
    }
    ```
    
    I get that if `handleRead` is buried somewhere in a library function you have to plumb through access to Properties explicitly in the library. Going via static methods on a thread-local task context is a little less transparent, but it is more convenient. That's really the win, that anything in your code has direct magic access to context props; I don't think anything else actually gets simpler.
    
    I think the fact it's already an API reduces the cost of a change like this in comparison, so I can see the argument for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r59076091
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
    @@ -42,20 +43,22 @@ import org.apache.spark.shuffle.ShuffleWriter
      * @param _initialAccums initial set of accumulators to be used in this task for tracking
      *                       internal metrics. Other accumulators will be registered later when
      *                       they are deserialized on the executors.
    + * @param localProperties copy of thread-local properties set by the user on the driver side.
      */
     private[spark] class ShuffleMapTask(
         stageId: Int,
         stageAttemptId: Int,
         taskBinary: Broadcast[Array[Byte]],
         partition: Partition,
         @transient private var locs: Seq[TaskLocation],
    -    _initialAccums: Seq[Accumulator[_]])
    -  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums)
    +    _initialAccums: Seq[Accumulator[_]],
    +    localProperties: Properties)
    +  extends Task[MapStatus](stageId, stageAttemptId, partition.index, _initialAccums, localProperties)
       with Logging {
     
       /** A constructor used only in test suites. This does not require passing in an RDD. */
       def this(partitionId: Int) {
    -    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null)
    +    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null, new Properties)
    --- End diff --
    
    Properties objects are kind of analogous to `Maps` and I think that `Option[Map]` would be kind of a weird type in the same sense that `Option[Set]` (or any other collection type) is usually kind a weird code-smell So, this is fine with me as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207275684
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55320/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966987
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    +    val propBytes = Utils.serialize(task.localProperties)
    --- End diff --
    
    Hmm, good point. `Utils.serialize` is fine here, since it doesn't matter whether we use a custom serializer here and because today it's always going to be `JavaSerializer` anyways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12248


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966806
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    +    val propBytes = Utils.serialize(task.localProperties)
    --- End diff --
    
    Wasn't sure how to deserialize on the Executor side. Perhap env.serializer there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966470
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    +    val propBytes = Utils.serialize(task.localProperties)
    +    dataOut.writeInt(propBytes.length)
    +    dataOut.write(propBytes, 0, propBytes.length)
    --- End diff --
    
    Here, I think you can simply do `dataOut.write(propBytes)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966245
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    --- End diff --
    
    Since the properties aren't transient in `Task`, I guess this means that we'll write them out twice. If we want to avoid this, we can make `localProperties` into a `@transient` `var` which is `private[spark]` then re-set the field after deserializing the task. Tasks are send to executors using broadcast variables, so the extra space only makes a different for the first task from a stage that's run on an executor.
    
    As a result, if we think that these serialized properties will typically be small then the extra space savings probably aren't a huge deal, but if we want to heavily optimize then we can do the `var` trick.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-207152849
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55287/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208650715
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55545/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12248#issuecomment-208594528
  
    It seems like we agree that this API is easy-to-support in Spark and hard/impossible to implement as cleanly in client code. As a result, I think this is okay to merge, so I'm going to run this one more time and will merge a bit after Jenkins passes. If anyone thinks that we need more discussion before accepting this API, let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14475] Propagate user-defined context f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12248#discussion_r58966411
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -206,6 +210,11 @@ private[spark] object Task {
           dataOut.writeLong(timestamp)
         }
     
    +    // Write the task properties separately so it is available before full task deserialization.
    +    val propBytes = Utils.serialize(task.localProperties)
    --- End diff --
    
    Just curious, why not `serializer.serialize(..)`? This is fine, but just wondeirng.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org