You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vanzin <gi...@git.apache.org> on 2014/09/10 00:29:39 UTC

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/2337

    [SPARK-3446] Expose underlying job ids in FutureAction.

    FutureAction is the only type exposed through the async APIs, so
    for job IDs to be useful they need to be exposed there. The complication
    is that some async jobs run more than one job (e.g. takeAsync),
    so the exposed ID has to actually be a list of IDs that can actually
    change over time. So the interface doesn't look very nice, but...
    
    Change is actually small, I just added a basic test to make sure
    it works.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-3446

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2337.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2337
    
----
commit 1fed2bcbf7e68c342c23b154f7843be0010b9467
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2014-09-09T22:26:31Z

    [SPARK-3446] Expose underlying job ids in FutureAction.
    
    FutureAction is the only type exposed through the async APIs, so
    for job IDs to be useful they need to be exposed there. The complication
    is that some async jobs run more than one job (e.g. takeAsync),
    so the exposed ID has to actually be a list of IDs that can actually
    change over time. So the interface doesn't look very nice, but...
    
    Change is actually small, I just added a basic test to make sure
    it works.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-55047494
  
    I don't understand this claim: "...for job IDs to be useful they need to be exposed there."  Could you clarify, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56207872
  
    It would be good to test the complex case with multiple job ids, but overall looks good. @rxin you added this interface - can you take a look (this is a very small patch)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56407988
  
    > This is kind of messy (this isn't a documented / stable API, though).
    
    More than that, it's `private[spark]`, which means I have to hardcode the string in my code and hope it never changes...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17796741
  
    --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
    @@ -83,6 +83,15 @@ trait FutureAction[T] extends Future[T] {
        */
       @throws(classOf[Exception])
       def get(): T = Await.result(this, Duration.Inf)
    +
    +  /**
    +   * Returns the job IDs run by the underlying async operation.
    +   *
    +   * This returns the current snapshot of the job list. Certain operations may run multiple
    +   * job, so multiple calls to this method may return different lists.
    --- End diff --
    
    multiple jobs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56216044
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20583/consoleFull) for   PR 2337 at commit [`e166a68`](https://github.com/apache/spark/commit/e166a680575ae96032d7ca03aba4566105cdb388).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2337


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-55048286
  
    The point of adding the "jobId" method to SimpleFutureAction was so that code calling these async methods knew the IDs of the jobs they were triggering (see SPARK-2636). Except the job ID is not really exposed at all since SimpleFutureAction is not exposed through the async APIs.
    
    (Sure you could cast the result, but that's ugly, and that also does not cover ComplexFutureAction.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56408573
  
    > More than that, it's private[spark], which means I have to hardcode the string in my code and hope it never changes...
    
    Yeah, I wasn't suggesting that as a substitute for a real public API.
    
    > Just to be clear, I'm ok with switching to using job groups to achieve what HoS needs (and close this PR/bug), but even that path seems like it could use some changes to make the lives of people using the API easier.
    
    I think there are two separate design issues here:
    
    1. How do I get the jobId associated with a job that I've launched?
    2. Given that I know the jobId, how do I find out more information about the status of that job?
    
    Let's keep this open for now, since this PR sounds like an okay way to address 1) and these two concerns are largely orthogonal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17797122
  
    --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
    @@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] {
       // is cancelled before the action was even run (and thus we have no thread to interrupt).
       @volatile private var _cancelled: Boolean = false
     
    +  @volatile private var jobs: Seq[Int] = Nil
    --- End diff --
    
    Just wondering - any reason to make this a `var` instead of a `val` ListBuffer? And then we could return an immutable `Seq` in jobIds?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56280786
  
    My initial thought was that a "job group"-based approach might be a bit cleaner, but there are a few subtleties with that proposal that we need to consider.
    
    What if we had an API that accepts a "job group" and returns the ids of jobs in that group?  e.g.
    
    ```scala
    // in SparkContext:
    def getJobsForGroup(jobGroup: String): Seq[Int]
    ```
    
    What should we return here?  The list of _active_ jobs?  All jobs, including failed and completed ones?  
    
    It looks like this pull request addresses a case where you run some asynchronous action and want to retrieve the ids of all jobs associated with that action.  If `getJobsForGroup` returned all jobs, then it would be the caller's responsibility to "diff" the results before and after invoking the action to figure out which jobs were created as a result of that action.  This sort of works if you only have one thread launching jobs in that job group; if you had multiple threads, then you wouldn't be able to separate the jobs added by each thread (short of using a unique job group per-thread).  There might be use-cases where I want to fire off a bunch of asynchronous actions (using these async. APIs) then periodically poll the status of all my FutureActions and it seems like this could get really messy if I had to switch the job group before launching each action.
    
    Another subtle problem with this `getJobsForGroup` method: eventually, we'll have to garbage-collect entries from the jobGroup to jobIds mapping, which could lead to weird results when diffing two return values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56224763
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20583/consoleFull) for   PR 2337 at commit [`e166a68`](https://github.com/apache/spark/commit/e166a680575ae96032d7ca03aba4566105cdb388).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56107566
  
    Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17796804
  
    --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.concurrent.Await
    +import scala.concurrent.duration.Duration
    +
    +import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
    +
    +import org.apache.spark.SparkContext._
    +
    +class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers with LocalSparkContext {
    +
    +  before {
    +    sc = new SparkContext("local", "FutureActionSuite")
    --- End diff --
    
    can you add a test here for the case when multiple job id's are used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2337#issuecomment-56406159

Lots of questions, let's go one by one.

##### Motivation

This is discussed in SPARK-2636 (and probably a coupe of others), but I'll try to summarize it quickly here. Hive-on-Spark generates multiple jobs for a single query, and needs to monitor and collect metrics for each of those jobs - separately. The way to do this in Spark is through the use of `SparkListener`. But the missing piece is that when you call an action such as `collect()` or `saveAsHadoopFile()`, that does not return a job ID in any way. So HoS was using the async API, since that was the recommended workaround, and the fix for SPARK-2636 added the job's ID to the `FutureAction` API. The problem is that it did not expose the job IDs correctly, which is why I filed this bug and sent this PR.

##### Job Groups

I was not familiar with the API and it sounds great to me. It would make monitoring jobs in my remote API prototype (SPARK-3215) much cleaner. The only missing piece from looking at the API is that I don't see "job group" anywhere in the events sent to listeners. e.g.:

case class SparkListenerJobStart(jobId: Int, stageIds: Seq[Int], properties: Properties = null)
extends SparkListenerEvent

Unless `properties` contains the job group info somehow, which would look a little brittle to me but I can work with, that's something that would need to be fixed for HoS to be able to gather the information it needs.

##### Async API vs. Something Else

I'm not sold on using the async API, and in fact its use in my remote client prototype looks sort of hacky and ugly. But currently that's the only way to gather the information HoS needs. Any substitute needs to allow the caller to match events to the job that was submitted, which is not possible via other means today (or, at least, not that I can see).

I assume that job groups still work OK with the current async API, since the thread local data is using an `InheritableThreadLocal`.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56309633
  
    I've opened #2482 , a pull request (WIP) illustrating my proposal to remove `AsyncRDDActions` and replace it with a more general mechanism for asynchronously launching and monitoring Spark jobs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-57572402
  
    I've given it some thought and I don't think that we should merge the more general async. mechanism that I described in #2482.  It had some confusing semantics surrounding cancellation (see the discussion of Thread.interrupt) and was probably more general than what most users need.
    
    Given that we should probably keep the current async APIs, this PR's change looks good.  I'm going to merge this into `master`.  Thanks for this commit and sorry for the long review delay!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-55053465
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20055/consoleFull) for   PR 2337 at commit [`1fed2bc`](https://github.com/apache/spark/commit/1fed2bcbf7e68c342c23b154f7843be0010b9467).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-55048739
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20055/consoleFull) for   PR 2337 at commit [`1fed2bc`](https://github.com/apache/spark/commit/1fed2bcbf7e68c342c23b154f7843be0010b9467).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17821379
  
    --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
    @@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] {
       // is cancelled before the action was even run (and thus we have no thread to interrupt).
       @volatile private var _cancelled: Boolean = false
     
    +  @volatile private var jobs: Seq[Int] = Nil
    --- End diff --
    
    cool, makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56278413
  
    The API is slightly awkward as you suggested. Is this intended to get job progress? If yes, maybe we can do that through the "job group" to get the list of job ids?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-57666617
  
    Thanks Josh. I meant to comment on your other PR (also about the weird cancellation semantics), but life got in the way. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17800549
  
    --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
    @@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] {
       // is cancelled before the action was even run (and thus we have no thread to interrupt).
       @volatile private var _cancelled: Boolean = false
     
    +  @volatile private var jobs: Seq[Int] = Nil
    --- End diff --
    
    I'm trying to avoid synchonization. Having a mutable list here means I'd have to synchronize when returning the immutable Seq in `jobIds`; with the volatile var, I'm only doing read operations on the `Seq`s themselves, so I don't need to synchronize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56281339
  
    @rxin @pwendell Since we have job groups and the ability to cancel all jobs running in a job group (`sc.cancelJobGroup()`), then why do we need FutureAction?  It looks like the only benefit that it offers over regular Scala `Future` is cancellation.
    
    I imagine that many developers would like to be able to fire off an entire workflow, potentially comprising multiple actions, monitor its overall progress, and cancel the whole thing.  It seems like job groups offer a strictly more powerful set of features that allows users to perform progress-monitoring and cancellation on entire workflows, not just individual actions.
    
    If the motivation for FutureAction is that job groups are inconvenient for simple things, then I think we can address that by adding convenience wrappers that act like Python context managers and make it easy to run a block of code inside of a particular job group.  Or, we could add an API that executes an arbitrary user-defined code block using a specified job group and returns a cancelable future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17821378
  
    --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.concurrent.Await
    +import scala.concurrent.duration.Duration
    +
    +import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
    +
    +import org.apache.spark.SparkContext._
    +
    +class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers with LocalSparkContext {
    +
    +  before {
    +    sc = new SparkContext("local", "FutureActionSuite")
    --- End diff --
    
    ah got it, my bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2337#discussion_r17800470
  
    --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark
    +
    +import scala.concurrent.Await
    +import scala.concurrent.duration.Duration
    +
    +import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
    +
    +import org.apache.spark.SparkContext._
    +
    +class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers with LocalSparkContext {
    +
    +  before {
    +    sc = new SparkContext("local", "FutureActionSuite")
    --- End diff --
    
    Isn't that the test on L41 ("complex async action")?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56283750
  
    @vanzin it would be helpful to hear what the needs are for Hive on Spark. Other applications I've seen have been using the job group for this purpose. And this will actually work even if a query involves multiple jobs (which using this Future interface would make that much harder).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56407152
  
    > Unless properties contains the job group info somehow.
    
    It does, actually; the property is named `SparkContext.SPARK_JOB_GROUP_ID`.  Since `properties` can technically be `null`, you could use something like
    
    
    ```scala
    val jobGroupId = Option(properties).map(_.get(SparkContext.SPARK_JOB_GROUP_ID)).orNull)
    ```
    
    to check the job group.  This is kind of messy (this isn't a documented / stable API, though).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56408155
  
    Just to be clear, I'm ok with switching to using job groups to achieve what HoS needs (and close this PR/bug), but even that path seems like it could use some changes to make the lives of people using the API easier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org