You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sohum2002 <gi...@git.apache.org> on 2017/10/08 10:25:05 UTC
[GitHub] spark pull request #19454: Added flatten functions for RDD and Dataset
GitHub user sohum2002 opened a pull request:
https://github.com/apache/spark/pull/19454
Added flatten functions for RDD and Dataset
## What changes were proposed in this pull request?
This PR creates a _flatten_ function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.
Author: Sohum Sachdev <so...@hotmail.com>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sohum2002/spark SPARK-18855_SPARK-18855
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19454.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19454
----
commit 075e7ef3f27af91c5190d039770cf15b08a66c81
Author: Sachathamakul, Patrachai (Agoda) <pa...@agoda.com>
Date: 2017-10-08T10:24:44Z
Added flatten functions for RDD and Dataset
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:
https://github.com/apache/spark/pull/19454
Would appreciate some help in the Python implementation of the `flatten` function as I have never used pyspark. Could someone help me out?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19454#discussion_r143608933
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -63,6 +63,7 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
assert(nums.map(_.toString).collect().toList === List("1", "2", "3", "4"))
assert(nums.filter(_ > 2).collect().toList === List(3, 4))
assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
+ assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == List(1,2,3,4,1,2,3,4))
--- End diff --
`.flatten.collect().toList`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82542/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19454
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19454
Let's fix up the PR title from `[SPARK-18855 ][SQL]` to `[SPARK-18855][SQL]` BTW.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82550/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/19454
Honestly I don't think it is worth doing this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82541/testReport)** for PR 19454 at commit [`075e7ef`](https://github.com/apache/spark/commit/075e7ef3f27af91c5190d039770cf15b08a66c81).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19454#discussion_r143612478
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
mapPartitions(_.flatMap(func))
/**
+ * Returns a new Dataset by by flattening a traversable collection into a collection itself.
+ *
--- End diff --
(and `by by` -> by` I guess)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/19454
Is this worth doing?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten fu...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/19454#discussion_r143351442
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2543,6 +2543,11 @@ class Dataset[T] private[sql](
mapPartitions(_.flatMap(func))
/**
+ * Returns a new Dataset by by flattening a traversable collection into a collection itself.
+ */
--- End diff --
Could you please add `@since 2.3.0`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19454
I think @srowen requested to fix it in a more performant way as well, for example, referring https://github.com/apache/spark/pull/16276, if I understood correctly and otherwise closing it.
I don't feel strongly about adding this but I was thinking that we might have to go ahead given this API has been required multiple times without explicit objection IIUC and, looks consistent with Scala's [`flatten`](
https://github.com/scala/scala/blob/05016d9035ab9b1c866bd9f12fdd0491f1ea0cbb/src/library/scala/collection/generic/GenericTraversableTemplate.scala#L169). However, IMHO, it might be worthwhile _only if_ this PR gives a clean shot.
I'd suggest to close this if we (you and other reviewers here) have to spend a lot of time. Workaround is quite easy anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82550/testReport)** for PR 19454 at commit [`cc08623`](https://github.com/apache/spark/commit/cc08623519f4ddfdfcc883557c4cc53f11e6f0f7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:
https://github.com/apache/spark/pull/19454
Thank you all for your comments. I hope to improve in my future PRs. Cheers!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19454
This is missing from Python and Java. It also doesn't bother to implement this more efficiently than flatMap(identity). I am not sure this is worth while?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 closed the pull request at:
https://github.com/apache/spark/pull/19454
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82542/testReport)** for PR 19454 at commit [`261e45a`](https://github.com/apache/spark/commit/261e45a9a2298df2d4d1f9adc1ca1ced22e90b60).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82542/testReport)** for PR 19454 at commit [`261e45a`](https://github.com/apache/spark/commit/261e45a9a2298df2d4d1f9adc1ca1ced22e90b60).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19454
Could you please add test cases?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82541/testReport)** for PR 19454 at commit [`075e7ef`](https://github.com/apache/spark/commit/075e7ef3f27af91c5190d039770cf15b08a66c81).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19454
**[Test build #82550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82550/testReport)** for PR 19454 at commit [`cc08623`](https://github.com/apache/spark/commit/cc08623519f4ddfdfcc883557c4cc53f11e6f0f7).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82541/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/19454
I actually think this can be confusing on Dataset[T], when the Dataset is just untyped and a DataFrame. Do we throw a runtime exception there?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19454
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten functions ...
Posted by sohum2002 <gi...@git.apache.org>.
Github user sohum2002 commented on the issue:
https://github.com/apache/spark/pull/19454
@HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the `flatMap(identity)` as mentioned by @srowen. Also, will add a python implementation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19454#discussion_r143607680
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -382,6 +382,13 @@ abstract class RDD[T: ClassTag](
}
/**
+ * Return a new RDD by flattening a traversable collection into a collection itself.
+ */
--- End diff --
Please follow existing comment style like line 392.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19454: [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19454
BTW, for the answer to https://github.com/apache/spark/pull/19454#issuecomment-335138642, I think you should take a look at, for example, `flatMap` as a reference in `rdd.py` and related tests, for example, see `cd ./python/pyspark && grep -r "flatMap" tests.py` and Python [doctest](https://docs.python.org/2/library/doctest.html).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19454: [SPARK-22152][SPARK-18855][SQL] Added flatten fun...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19454#discussion_r143606572
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2543,6 +2543,14 @@ class Dataset[T] private[sql](
mapPartitions(_.flatMap(func))
/**
+ * Returns a new Dataset by by flattening a traversable collection into a collection itself.
+ *
--- End diff --
@group typedrel?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org