You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/02/16 22:31:35 UTC

[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/4629

    [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark

    Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.
    
    The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark narrow

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4629
    
----
commit eb26c62f4a3dc5920df2d2624918826d32d97bb5
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-16T21:17:11Z

    narrow dependency in PySpark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24778143
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -330,6 +331,15 @@ private[spark] object PythonRDD extends Logging {
       }
     
       /**
    +   * Return an RDD of values from an RDD of (Long, Array[Byte]), with preservePartitions=true
    +   *
    +   * This is useful for PySpark to have the partitioner after partitionBy()
    +   */
    +  def valueOfPair(pair: JavaPairRDD[Long, Array[Byte]]): JavaRDD[Array[Byte]] = {
    --- End diff --
    
    I think that `JavaPairRDD.values` should do the same thing; is there a reason why we can't call that directly from Python?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74621490
  
      [Test build #27612 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27612/consoleFull) for   PR 4629 at commit [`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24861247
  
    --- Diff: python/pyspark/tests.py ---
    @@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
             converted_rdd = RDD(data_python_rdd, self.sc)
             self.assertEqual(2, converted_rdd.count())
     
    +    def test_narrow_dependency_in_join(self):
    +        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
    --- End diff --
    
    nice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24778859
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -330,6 +331,15 @@ private[spark] object PythonRDD extends Logging {
       }
     
       /**
    +   * Return an RDD of values from an RDD of (Long, Array[Byte]), with preservePartitions=true
    +   *
    +   * This is useful for PySpark to have the partitioner after partitionBy()
    +   */
    +  def valueOfPair(pair: JavaPairRDD[Long, Array[Byte]]): JavaRDD[Array[Byte]] = {
    --- End diff --
    
    In Scala/Java API, RDD.values() will change the RDD from (K, V) into RDD of V, so `preservePartitions` should not be `true`.
    
    For PySpark, it change the RDD from (hash, [(K, V)]) to (K, V), `preservePartitions` should be true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74583747
  
      [Test build #27573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27573/consoleFull) for   PR 4629 at commit [`eb26c62`](https://github.com/apache/spark/commit/eb26c62f4a3dc5920df2d2624918826d32d97bb5).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74581079
  
      [Test build #27587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27587/consoleFull) for   PR 4629 at commit [`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24778077
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -961,7 +961,14 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /** Build the union of a list of RDDs. */
    -  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)
    +  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
    +    val partitioners = rdds.map(_.partitioner).toSet
    +    if (partitioners.size == 1 && partitioners.head.isDefined) {
    +      new PartitionerAwareUnionRDD(this, rdds)
    +    } else {
    +      new UnionRDD(this, rdds)
    +    }
    +  }
     
       /** Build the union of a list of RDDs passed as variable-length arguments. */
       def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
    --- End diff --
    
    Can we change this method to call the `union` method that you modified so the change will take effect here, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24857765
  
    --- Diff: python/pyspark/tests.py ---
    @@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
             converted_rdd = RDD(data_python_rdd, self.sc)
             self.assertEqual(2, converted_rdd.count())
     
    +    def test_narrow_dependency_in_join(self):
    +        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
    --- End diff --
    
    I've merged #3027, so I think we can now test this by setting a job group, running a job, then querying the statusTracker to determine how many stages were actually run as part of that job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24780473
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -961,7 +961,14 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /** Build the union of a list of RDDs. */
    -  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)
    +  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
    +    val partitioners = rdds.map(_.partitioner).toSet
    +    if (partitioners.size == 1 && partitioners.head.isDefined) {
    +      new PartitionerAwareUnionRDD(this, rdds)
    +    } else {
    +      new UnionRDD(this, rdds)
    +    }
    +  }
     
       /** Build the union of a list of RDDs passed as variable-length arguments. */
       def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
    --- End diff --
    
    fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24852684
  
    --- Diff: python/pyspark/tests.py ---
    @@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
             converted_rdd = RDD(data_python_rdd, self.sc)
             self.assertEqual(2, converted_rdd.count())
     
    +    def test_narrow_dependency_in_join(self):
    +        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
    --- End diff --
    
    do these tests actually check for a narrow dependency at all?  I think they will pass even without it.
    
    I'm not sure of a better suggestion, though.  I had to use `getNarrowDependencies` in another PR to check this:
    https://github.com/apache/spark/pull/4449/files#diff-4bc3643ce90b54113cad7104f91a075bR582
    
    but I don't think that is even exposed in pyspark ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74628437
  
      [Test build #611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/611/consoleFull) for   PR 4629 at commit [`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74590835
  
      [Test build #27587 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27587/consoleFull) for   PR 4629 at commit [`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Partitioner(object):`
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74585242
  
      [Test build #27582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27582/consoleFull) for   PR 4629 at commit [`ff5a0a6`](https://github.com/apache/spark/commit/ff5a0a6b5dd408f2a177459e6b5498ea72f57b85).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74776715
  
      [Test build #27657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27657/consoleFull) for   PR 4629 at commit [`dffe34e`](https://github.com/apache/spark/commit/dffe34ee262aa098c12323fd27995ce9f542fa95).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Partitioner(object):`
      * `class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):`
      * `class SparkStageInfo(namedtuple("SparkStageInfo",`
      * `class StatusTracker(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24787685
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -961,11 +961,18 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
       }
     
       /** Build the union of a list of RDDs. */
    -  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)
    +  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
    +    val partitioners = rdds.map(_.partitioner).toSet
    --- End diff --
    
    If `_.partitioner` is an option, then I think this can be simplified by using `flatMap` instead of `map`, since that would just let you check whether `partitioners.size == 1` on the next line without having to have the `isDefined` check as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74789577
  
    Thanks for adding the test.
    
    LGTM, so I'm going to merge this into `master` (1.4.0) and `branch-1.3` (1.3.0).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74621421
  
      [Test build #27612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27612/consoleFull) for   PR 4629 at commit [`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74579783
  
      [Test build #27583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27583/consoleFull) for   PR 4629 at commit [`940245e`](https://github.com/apache/spark/commit/940245e37bf08492d6b5cd7cd82f8f0886f6f8ca).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74590143
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27583/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24857907
  
    --- Diff: python/pyspark/tests.py ---
    @@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
             converted_rdd = RDD(data_python_rdd, self.sc)
             self.assertEqual(2, converted_rdd.count())
     
    +    def test_narrow_dependency_in_join(self):
    +        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
    --- End diff --
    
    done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4629


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74575564
  
      [Test build #27573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27573/consoleFull) for   PR 4629 at commit [`eb26c62`](https://github.com/apache/spark/commit/eb26c62f4a3dc5920df2d2624918826d32d97bb5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74579230
  
      [Test build #27582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27582/consoleFull) for   PR 4629 at commit [`ff5a0a6`](https://github.com/apache/spark/commit/ff5a0a6b5dd408f2a177459e6b5498ea72f57b85).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4629#discussion_r24853424
  
    --- Diff: python/pyspark/tests.py ---
    @@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
             converted_rdd = RDD(data_python_rdd, self.sc)
             self.assertEqual(2, converted_rdd.count())
     
    +    def test_narrow_dependency_in_join(self):
    +        rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
    --- End diff --
    
    This test is only for correctness, I will add more check for narrow dependency base one the Python progress API (#3027)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74622588
  
      [Test build #611 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/611/consoleFull) for   PR 4629 at commit [`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74583752
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27573/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74585248
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27582/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74607728
  
    LGTM overall; this is tricky logic, though, so I'll take one more pass through when I get home.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74589214
  
      [Test build #610 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/610/consoleFull) for   PR 4629 at commit [`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74590132
  
      [Test build #27583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27583/consoleFull) for   PR 4629 at commit [`940245e`](https://github.com/apache/spark/commit/940245e37bf08492d6b5cd7cd82f8f0886f6f8ca).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74597976
  
      [Test build #610 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/610/consoleFull) for   PR 4629 at commit [`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74763180
  
      [Test build #27657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27657/consoleFull) for   PR 4629 at commit [`dffe34e`](https://github.com/apache/spark/commit/dffe34ee262aa098c12323fd27995ce9f542fa95).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74590851
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27587/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74621492
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27612/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4629#issuecomment-74776721
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27657/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org