You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sohum2002 <gi...@git.apache.org> on 2017/10/08 10:25:05 UTC

[GitHub] spark pull request #19454: Added flatten functions for RDD and Dataset

GitHub user sohum2002 opened a pull request:

    https://github.com/apache/spark/pull/19454

    Added flatten functions for RDD and Dataset

    ## What changes were proposed in this pull request?
    This PR creates a _flatten_ function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.
    
    Author: Sohum Sachdev <so...@hotmail.com>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sohum2002/spark SPARK-18855_SPARK-18855

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19454
    
----
commit 075e7ef3f27af91c5190d039770cf15b08a66c81
Author: Sachathamakul, Patrachai (Agoda) <pa...@agoda.com>
Date:   2017-10-08T10:24:44Z

    Added flatten functions for RDD and Dataset

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Would appreciate some help in the Python implementation of the `flatten` function as I have never used pyspark. Could someone help me out?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19454#discussion_r143608933
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -63,6 +63,7 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         assert(nums.map(_.toString).collect().toList === List("1", "2", "3", "4"))
         assert(nums.filter(_ > 2).collect().toList === List(3, 4))
         assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
    +    assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == List(1,2,3,4,1,2,3,4))
    --- End diff --
    
    `.flatten.collect().toList`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82542/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Let's fix up the PR title from `[SPARK-18855 ][SQL]` to `[SPARK-18855][SQL]` BTW.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82550/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Honestly I don't think it is worth doing this.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82541/testReport)** for PR 19454 at commit [`075e7ef`](https://github.com/apache/spark/commit/075e7ef3f27af91c5190d039770cf15b08a66c81).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19454#discussion_r143612478
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
         mapPartitions(_.flatMap(func))
     
       /**
    +    * Returns a new Dataset by by flattening a traversable collection into a collection itself.
    +    *
    --- End diff --
    
    (and `by by` -> by` I guess)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Is this worth doing?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten fu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19454#discussion_r143351442
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2543,6 +2543,11 @@ class Dataset[T] private[sql](
         mapPartitions(_.flatMap(func))
     
       /**
    +    * Returns a new Dataset by by flattening a traversable collection into a collection itself.
    +    */
    --- End diff --
    
    Could you please add `@since 2.3.0`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    I think @srowen requested to fix it in a more performant way as well, for example, referring https://github.com/apache/spark/pull/16276, if I understood correctly and otherwise closing it.
    
    I don't feel strongly about adding this but I was thinking that we might have to go ahead given this API has been required multiple times without explicit objection IIUC and, looks consistent with Scala's [`flatten`](
     https://github.com/scala/scala/blob/05016d9035ab9b1c866bd9f12fdd0491f1ea0cbb/src/library/scala/collection/generic/GenericTraversableTemplate.scala#L169). However, IMHO, it might be worthwhile _only if_ this PR gives a clean shot.
    
    I'd suggest to close this if we (you and other reviewers here) have to spend a lot of time. Workaround is quite easy anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82550/testReport)** for PR 19454 at commit [`cc08623`](https://github.com/apache/spark/commit/cc08623519f4ddfdfcc883557c4cc53f11e6f0f7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Thank you all for your comments. I hope to improve in my future PRs. Cheers!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    This is missing from Python and Java. It also doesn't bother to implement this more efficiently than flatMap(identity). I am not sure this is worth while?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 closed the pull request at:

    https://github.com/apache/spark/pull/19454


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82542/testReport)** for PR 19454 at commit [`261e45a`](https://github.com/apache/spark/commit/261e45a9a2298df2d4d1f9adc1ca1ced22e90b60).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82542/testReport)** for PR 19454 at commit [`261e45a`](https://github.com/apache/spark/commit/261e45a9a2298df2d4d1f9adc1ca1ced22e90b60).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Could you please add test cases?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82541/testReport)** for PR 19454 at commit [`075e7ef`](https://github.com/apache/spark/commit/075e7ef3f27af91c5190d039770cf15b08a66c81).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    **[Test build #82550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82550/testReport)** for PR 19454 at commit [`cc08623`](https://github.com/apache/spark/commit/cc08623519f4ddfdfcc883557c4cc53f11e6f0f7).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82541/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    I actually think this can be confusing on Dataset[T], when the Dataset is just untyped and a DataFrame. Do we throw a runtime exception there?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...

Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    @HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the `flatMap(identity)` as mentioned by @srowen. Also, will add a python implementation. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19454#discussion_r143607680
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -382,6 +382,13 @@ abstract class RDD[T: ClassTag](
       }
     
       /**
    +    * Return a new RDD by flattening a traversable collection into a collection itself.
    +    */
    --- End diff --
    
    Please follow existing comment style like line 392.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19454
  
    BTW, for the answer to https://github.com/apache/spark/pull/19454#issuecomment-335138642, I think you should take a look at, for example, `flatMap` as a reference in `rdd.py` and related tests, for example, see `cd ./python/pyspark && grep -r "flatMap" tests.py` and Python [doctest](https://docs.python.org/2/library/doctest.html).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19454#discussion_r143606572
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
         mapPartitions(_.flatMap(func))
     
       /**
    +    * Returns a new Dataset by by flattening a traversable collection into a collection itself.
    +    *
    --- End diff --
    
    @group typedrel?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org