You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by erikerlandson <gi...@git.apache.org> on 2014/07/31 16:27:05 UTC

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

GitHub user erikerlandson opened a pull request:

    https://github.com/apache/spark/pull/1689

    [SPARK-1021] Defer the data-driven computation of partition bounds in so...

    ...rtByKey() until evaluation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/erikerlandson/spark spark-1021-pr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1689.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1689
    
----
commit fa9bbca42d423b3cc9b10063073f7ab7507b1922
Author: Erik Erlandson <ee...@redhat.com>
Date:   2014-07-30T22:59:27Z

    [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55457077
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull) for   PR 1689 at commit [`50b6da6`](https://github.com/apache/spark/commit/50b6da6234188a147508654b08e6b67cbf01fbec).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-61508261
  
    @marmbrus, FWIW, the `correlationoptimizer14` test appears to be working for me.   I ran it using: `env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true ./dev/run-tests > ~/run-tests-1021.txt 2>&1`
    
    Not sure why, but running`sbt -Dspark.hive.whitelist=correlationoptimizer14 hive/test` was not causing the test to run in my environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15919352
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       }
     
       @throws(classOf[IOException])
    -  private def readObject(in: ObjectInputStream) {
    +  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
    +    if (valRB != null) return
    --- End diff --
    
    Do we not want to deserialize valRB if it is not null? Are you worried rangeBounds might be called while the deserialization is happening? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-51424177
  
    QA results for PR 1689:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15931609
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  @volatile private var valRB: Array[K] = null
    --- End diff --
    
    My understanding of getPartitions was that it executes once, and is therefore "allowed to be expensive".  Also, isn't rangeBounds generally only returning a reference to the array?  (except for the first time, where it's computed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-50765803
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55458226
  
    @erikerlandson  thanks for looking at this.
    
    A few questions:
    
    1. After this pull request, does anything still use SimpleFutureAction?
    2. If I understand this correctly, this could potentially block the single-threaded scheduler from doing anything else while waiting for the rangeBounds to be computed. Any comment on this?
    3. This is not always lazy still right? See a test case
    ```scala
    c.parallelize(1 to 1000).map(x => (x, x)).sortByKey().join(sc.parallelize(1 to 10).map(x=>(x,x)))
    ```
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1689


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55628401
  
    Or, maybe just look into playing the same game with the cogrouped RDDs that I did with sortByKey.   Don't get into invoking `defaultPartitioner` until somebody asks for the output RDD's partitioner, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52336202
  
    Latest push updates RangePartition sampling job to be async, and updates the async action functions so that they will properly enclose the sampling job induced by calling 'partitions'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55456438
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-54694535
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57043862
  
    BTW one thing that would be great to add is a test that makes sure we don't block the main dag scheduler thread. The reason I think we don't block is that we call rdd.partitions.length in submitJob:
    
    ```scala
    
      /**
       * Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
       * can be used to block until the the job finishes executing or can be used to cancel the job.
       */
      def submitJob[T, U](
          rdd: RDD[T],
          func: (TaskContext, Iterator[T]) => U,
          partitions: Seq[Int],
          callSite: CallSite,
          allowLocal: Boolean,
          resultHandler: (Int, U) => Unit,
          properties: Properties = null): JobWaiter[U] =
      {
        // Check to make sure we are not launching a task on a partition that does not exist.
        val maxPartitions = rdd.partitions.length
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55797086
  
    Yea I don't think we need to fully solve 3 here.
    
    My main concern with these set of changes is 2, since a single badly behaved RDD can potentially block the (unfortunately single threaded) scheduler forever. Let me think about this a little bit and get back to you.
    
    If you have an idea about how to fix that, feel free to suggest them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55805772
  
    So far the best idea I have for (2) is to set some kind of time-out on the evaluation.   The bound computation uses subsampling that will (when all goes well) cap the computation at constant time.  If the timeout triggers, some sub-optimal falback for partitioning might be used.  Or just fail the entire evaluation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55627362

Hi @rxin,

1) SimpleFutureAction is still referred to in submitJob method, but that doesn't appear to be invoked anywhere. I was reluctant to get rid of it, as it's all experimental, and I could envision use cases for it.

2) I see your point. I don't currently have any clever ideas to avoid that scenario when it happens.

3) Very interesting -- so this scenario is triggered because `defaultPartitioner` starts examining input RDD partitioners, which sets off the job when it trips over the data driven partitioning computation from `sortByKey`.

My impression is that this whack-a-mole with non-laziness stems from a combination of (a) a data-dependent partitioner(s), with (b) methods that refer to input partitioners as part of the construction of new RDDs. It *might* be possible to thread some design changes around so that references to partitioning are consistently encapsulated in a Future. Functions such as `defaultPartitioner` would then also have to return a Future, etc. Or, even more generally, somehow encapsulate *all* RDD initialization in a Future, with the idea that these futures would finally unwind when some Action was invoked.

However it seems (imo) outside the scope of this particular Jira/PR. Maybe we could start another umbrella Jira to track possible solutions along these lines.

Another orthogonal thought -- you can short circuit all this by providing a partitioner instead of forcing it to be computed from data. That's not as sexy, or widely applicable, as some deeper fix to the problem, but users can do it now as a workaround when it's feasible.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57108427
  
    I reverted this commit. @erikerlandson mind taking a look at this problem?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-50824343
  
    Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15919599
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  @volatile private var valRB: Array[K] = null
    --- End diff --
    
    Any idea on volatile's impact on read performance? rangeBounds is read multiple times in getPartition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52397817
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull) for   PR 1689 at commit [`f3448e4`](https://github.com/apache/spark/commit/f3448e47b671570cc95c99f809b68b9382c5cd1b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r18122212
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  @volatile private var valRB: Array[K] = null
    --- End diff --
    
    this is going to be called once for every record on workers actually. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-55464403
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull) for   PR 1689 at commit [`50b6da6`](https://github.com/apache/spark/commit/50b6da6234188a147508654b08e6b67cbf01fbec).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15920203
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  @volatile private var valRB: Array[K] = null
    --- End diff --
    
    It wouldn't surprise me if this performance figure varied with different combinations of hardware and Java version; but for at least one such combination, volatile reads are roughly 2-3x as costly as non-volatile reads as long as they are uncontended -- much more expensive when there are concurrent writes to contend with. http://brooker.co.za/blog/2012/09/10/volatile.html 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-50824621
  
    QA tests have started for PR 1689. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15900503
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,13 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  private var valRB: Array[K] = Array()
    --- End diff --
    
    Can we perhaps make this thread safe? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52336221
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull) for   PR 1689 at commit [`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57106886
  
    Since this PR was merged the correlationoptimizer14 test has been hanging.  We might want to consider rolling back.  You can reproduce the problem as follows: `sbt -Dspark.hive.whitelist=correlationoptimizer14 hive/test`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57043930
  
    Have either of you thought about how to coordinate this with Josh's work on SPARK-3626? https://github.com/apache/spark/pull/2482


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57043822
  
    @erikerlandson i'm going to merge this first. Maybe we can do the cleanup later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57043705
  
    Actually I looked at it again. I don't think it would block the scheduler because we compute partitions outside the scheduler thread. This approach looks good to me! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-50829158
  
    QA results for PR 1689:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-57110142
  
    @rxin @marmbrus I will check it out


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-51421389
  
    QA tests have started for PR 1689. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52339006
  
    Excellent!  I'll try to find some time to review this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52342401
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull) for   PR 1689 at commit [`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1689#issuecomment-52400243
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull) for   PR 1689 at commit [`f3448e4`](https://github.com/apache/spark/commit/f3448e47b671570cc95c99f809b68b9382c5cd1b).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r18122214
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       private var ordering = implicitly[Ordering[K]]
     
       // An array of upper bounds for the first (partitions - 1) partitions
    -  private var rangeBounds: Array[K] = {
    -    if (partitions <= 1) {
    +  @volatile private var valRB: Array[K] = null
    --- End diff --
    
    Maybe we can rename valRB to _rangeBounds and use this directly in getPartitions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r18122197
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       }
     
       @throws(classOf[IOException])
    -  private def readObject(in: ObjectInputStream) {
    +  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
    +    if (valRB != null) return
    --- End diff --
    
    that's not possible


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

Posted by erikerlandson <gi...@git.apache.org>.

Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1689#discussion_r15931660
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
       }
     
       @throws(classOf[IOException])
    -  private def readObject(in: ObjectInputStream) {
    +  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
    +    if (valRB != null) return
    --- End diff --
    
    also was assuming readObject might be called in multiple threads.   Can that happen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org