You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kanzhang <gi...@git.apache.org> on 2014/04/19 00:28:30 UTC
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
GitHub user kanzhang opened a pull request:
https://github.com/apache/spark/pull/448
[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...
... that do not change schema
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kanzhang/spark SPARK-1460
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/448.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #448
----
commit 79aeef800d5c1c12acc02ecf5822d9814e792599
Author: Kan Zhang <kz...@apache.org>
Date: 2014-04-18T22:23:26Z
[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations that do not change schema
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42264403
I don't think it's very obvious :)
My guess would be that this.type is included in a scala specific signature somewhere, and the scala compiler adds casts after all the calls when the method is called on a refined type.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42249441
Yeah I agree they look the same. I guess this.type translates to the the current class and any other specialization for child classes is done by scala magic.
Other than the minor comments about removing unneeded code and avoiding too much scala in the java api I think this is about ready to merge.
Thanks again for investigating this in such detail! I've personally been bumping into this issue quite a lot.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42253939
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41230336
Hey, just pushed an update on Scala and Java API. Wanted to get some feedback before I move on to Python. Pls pay attention to signatures of filter, intersection and subtract on the Java API. Thx.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42360847
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14731/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40964336
Oh, I see. Thx.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11816864
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Isn't the base impl lazy also (till compute() is called)? It's kind of hard to tell the difference for users. What's your thinking in keeping the base implementation around (not overriding it)?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40850347
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12347808
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
@@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
- override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
+ override def persist(newLevel: StorageLevel): this.type = {
partitionsRDD.persist(newLevel)
this
}
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
-
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def cache(): EdgeRDD[ED] = persist()
--- End diff --
Got it, thanks
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40916074
@marmbrus you are right, I can't override randomSplit() due to invariance of Array.
How about cache(), persist(), unpersist()?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12351929
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
@mridulm agree with Patrick, you have to return ```this``` for ```this.type``` return type.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42360844
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12349982
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
@mridulm if you look at this patch, it explicitly overrides those for `SchemaRDD`. You can't use `this.type` there because the return type is actually a new RDD class (`FilteredRDD` and so on).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42377607
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42350425
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42385119
@pwendell the build didn't seem to start?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12347791
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
Neat, thanks !
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42381666
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41965083
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42366216
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42253947
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42382227
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41966458
@marmbrus @mateiz Here's an update on using this.type as return type. See following shell output. The result type for setName() method has changed from org.apache.spark.rdd.RDD[Record] to rdd.type. Similar things happened for SchemaRDD and JavaSchemaRDD (not a subclass). However, object types for rdd, srdd, and jsrdd still print the same. Are these changes of concern? The benefit is we don't have to reimplement those methods in subclasses.
Also, any tip on quickly checking Scaladoc and Javadoc impact? Thx.
{code}
scala> rdd.setName("RDD")
res0: rdd.type = RDD ParallelCollectionRDD[0] at parallelize at <console>:16
scala> rdd
res1: org.apache.spark.rdd.RDD[Record] = RDD ParallelCollectionRDD[0] at parallelize at <console>:16
scala> srdd.setName("SCHEMA RDD")
res2: srdd.type =
SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
scala> srdd
res3: org.apache.spark.sql.SchemaRDD =
SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
scala> jsrdd.setName("JAVA SCHEMA RDD")
res4: jsrdd.type =
JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
scala> jsrdd
res5: org.apache.spark.sql.api.java.JavaSchemaRDD =
JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
{code}
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40880763
I agree with leaving union out and adding repartition, coalesce and the other version of distinct. Also these should definitely be added to Java and Python too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42397473
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41971131
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11816456
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
+ applySchema(super.sample(withReplacement, fraction, seed))
+
+ override def subtract(other: RDD[Row], p: Partitioner): SchemaRDD =
+ applySchema(super.subtract(other, p))
+
+ override def union(other: RDD[Row]): SchemaRDD =
--- End diff --
done
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40909814
Thanks for your suggestions. I'll update.
Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42397475
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14762/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42248502
@marmbrus my untrained eyes couldn't spot any difference (you may want to verify it yourself just to make sure :). For example,
RDD.unpersist() with RDD[T] return type:
```
public org.apache.spark.rdd.RDD<T> unpersist(boolean);
flags: ACC_PUBLIC
Code:
stack=4, locals=2, args_size=2
0: aload_0
1: new #208 // class org/apache/spark/rdd/RDD$$anonfun$unpersist$1
4: dup
5: aload_0
6: invokespecial #209 // Method org/apache/spark/rdd/RDD$$anonfun$unpersist$1."<init>":(Lorg/apache/spark/rdd/RDD;)V
9: invokevirtual #211 // Method logInfo:(Lscala/Function0;)V
12: aload_0
13: invokespecial #108 // Method sc:()Lorg/apache/spark/SparkContext;
16: aload_0
17: invokevirtual #213 // Method id:()I
20: iload_1
21: invokevirtual #217 // Method org/apache/spark/SparkContext.unpersistRDD:(IZ)V
24: aload_0
25: getstatic #157 // Field org/apache/spark/storage/StorageLevel$.MODULE$:Lorg/apache/spark/storage/StorageLevel$;
28: invokevirtual #160 // Method org/apache/spark/storage/StorageLevel$.NONE:()Lorg/apache/spark/storage/StorageLevel;
31: invokespecial #186 // Method storageLevel_$eq:(Lorg/apache/spark/storage/StorageLevel;)V
34: aload_0
35: areturn
LocalVariableTable:
Start Length Slot Name Signature
0 36 0 this Lorg/apache/spark/rdd/RDD;
0 36 1 blocking Z
LineNumberTable:
line 167: 0
line 168: 12
line 169: 24
line 170: 34
Signature: #1819 // (Z)Lorg/apache/spark/rdd/RDD<TT;>;
public boolean unpersist$default$1();
flags: ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: iconst_1
1: ireturn
LocalVariableTable:
Start Length Slot Name Signature
0 2 0 this Lorg/apache/spark/rdd/RDD;
LineNumberTable:
line 166: 0
```
RDD.unpersist() with this.type return type:
```
public org.apache.spark.rdd.RDD<T> unpersist(boolean);
flags: ACC_PUBLIC
Code:
stack=4, locals=2, args_size=2
0: aload_0
1: new #208 // class org/apache/spark/rdd/RDD$$anonfun$unpersist$1
4: dup
5: aload_0
6: invokespecial #209 // Method org/apache/spark/rdd/RDD$$anonfun$unpersist$1."<init>":(Lorg/apache/spark/rdd/RDD;)V
9: invokevirtual #211 // Method logInfo:(Lscala/Function0;)V
12: aload_0
13: invokespecial #108 // Method sc:()Lorg/apache/spark/SparkContext;
16: aload_0
17: invokevirtual #213 // Method id:()I
20: iload_1
21: invokevirtual #217 // Method org/apache/spark/SparkContext.unpersistRDD:(IZ)V
24: aload_0
25: getstatic #157 // Field org/apache/spark/storage/StorageLevel$.MODULE$:Lorg/apache/spark/storage/StorageLevel$;
28: invokevirtual #160 // Method org/apache/spark/storage/StorageLevel$.NONE:()Lorg/apache/spark/storage/StorageLevel;
31: invokespecial #186 // Method storageLevel_$eq:(Lorg/apache/spark/storage/StorageLevel;)V
34: aload_0
35: areturn
LocalVariableTable:
Start Length Slot Name Signature
0 36 0 this Lorg/apache/spark/rdd/RDD;
0 36 1 blocking Z
LineNumberTable:
line 172: 0
line 173: 12
line 174: 24
line 175: 34
Signature: #1705 // (Z)Lorg/apache/spark/rdd/RDD<TT;>;
public boolean unpersist$default$1();
flags: ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: iconst_1
1: ireturn
LocalVariableTable:
Start Length Slot Name Signature
0 2 0 this Lorg/apache/spark/rdd/RDD;
LineNumberTable:
line 171: 0
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12303586
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
@@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
override protected def getPreferredLocations(s: Partition): Seq[String] =
partitionsRDD.preferredLocations(s)
- override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
+ override def persist(newLevel: StorageLevel): this.type = {
--- End diff --
@rxin can we just remove these functions if RDD returns this.type instead? It looks like you were only overloading them for the covariant return type.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42115311
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42350963
@pwendell did a rebase just now and pushed
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11800221
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Ah, I see the execution eventually calls the base method, so it is equivalent?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11800178
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Yes, I noticed it. It has a different signature and doesn't override the base sample() method. I thought this was done on purpose - you wanted to keep both implementations, right?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42393027
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12023710
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
@@ -45,4 +50,146 @@ class JavaSchemaRDD(
override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
val rdd = baseSchemaRDD.map(new Row(_))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def cache(): JavaSchemaRDD = {
+ baseSchemaRDD.cache()
+ this
+ }
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def persist(): JavaSchemaRDD = {
+ baseSchemaRDD.persist()
+ this
+ }
+
+ /**
+ * Set this RDD's storage level to persist its values across operations after the first time
+ * it is computed. This can only be used to assign a new storage level if the RDD does not
+ * have a storage level set yet..
+ */
+ def persist(newLevel: StorageLevel): JavaSchemaRDD = {
+ baseSchemaRDD.persist(newLevel)
+ this
+ }
+
+ /**
+ * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
+ *
+ * @param blocking Whether to block until all blocks are deleted.
+ * @return This RDD.
+ */
+ def unpersist(blocking: Boolean = true): JavaSchemaRDD = {
+ baseSchemaRDD.unpersist(blocking)
+ this
+ }
+
+ /** Assign a name to this RDD */
+ def setName(name: String): JavaSchemaRDD = {
+ baseSchemaRDD.setName(name)
+ this
+ }
+
+ // Transformations (return a new RDD)
+
+ /**
+ * Return a new RDD that is reduced into `numPartitions` partitions.
+ */
+ def coalesce(numPartitions: Int, shuffle: Boolean = false): JavaSchemaRDD =
+ baseSchemaRDD.coalesce(numPartitions, shuffle)
+
+ /**
+ * Return a new RDD containing the distinct elements in this RDD.
+ */
+ def distinct(): JavaSchemaRDD = baseSchemaRDD.distinct()
+
+ /**
+ * Return a new RDD containing the distinct elements in this RDD.
+ */
+ def distinct(numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.distinct(numPartitions)
+
+ /**
+ * Return a new RDD containing only the elements that satisfy a predicate.
+ */
+ def filter(f: JFunction[Row, java.lang.Boolean]): JavaSchemaRDD =
+ baseSchemaRDD.filter(x => f.call(new Row(x)).booleanValue())
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did.
+ *
+ * Note that this method performs a shuffle internally.
+ */
+ def intersection(other: JavaSchemaRDD): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other)
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did.
+ *
+ * Note that this method performs a shuffle internally.
+ *
+ * @param partitioner Partitioner to use for the resulting RDD
+ */
+ def intersection(other: JavaSchemaRDD, partitioner: Partitioner): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other, partitioner)
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did. Performs a hash partition across the cluster
+ *
+ * Note that this method performs a shuffle internally.
+ *
+ * @param numPartitions How many partitions to use in the resulting RDD
+ */
+ def intersection(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other, numPartitions)
+
+ /**
+ * Return a new RDD that has exactly `numPartitions` partitions.
+ *
+ * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
+ * a shuffle to redistribute data.
+ *
+ * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
+ * which can avoid performing a shuffle.
+ */
+ def repartition(numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.repartition(numPartitions)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ *
+ * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
+ * RDD will be <= us.
+ */
+ def subtract(other: JavaSchemaRDD): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ */
+ def subtract(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other, numPartitions)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ */
+ def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other, p)
+}
+
+object JavaSchemaRDD {
+
+ implicit def fromSchemaRDD(rdd: SchemaRDD): JavaSchemaRDD =
--- End diff --
Point taken, will update. Thx
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41230295
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12346651
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
It's to allow child classes to not have to override functions like `persist` and `cache` that are used for chaining:
http://scalada.blogspot.com/2008/02/thistype-for-chaining-method-calls.html
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11813669
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
+ applySchema(super.sample(withReplacement, fraction, seed))
+
+ override def subtract(other: RDD[Row], p: Partitioner): SchemaRDD =
+ applySchema(super.subtract(other, p))
+
+ override def union(other: RDD[Row]): SchemaRDD =
--- End diff --
don't forget to remove this one.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12346057
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
@@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
- override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
+ override def persist(newLevel: StorageLevel): this.type = {
partitionsRDD.persist(newLevel)
this
}
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
-
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def cache(): EdgeRDD[ED] = persist()
--- End diff --
Why were these removed ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12304083
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
@@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
override protected def getPreferredLocations(s: Partition): Seq[String] =
partitionsRDD.preferredLocations(s)
- override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
+ override def persist(newLevel: StorageLevel): this.type = {
--- End diff --
Oh right, I guess I was actually looking at the functions `cache` and `persist` below. Those could be dropped though?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12303651
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
@@ -45,4 +48,141 @@ class JavaSchemaRDD(
override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
val rdd = baseSchemaRDD.map(new Row(_))
+
+ override def toString: String = baseSchemaRDD.toString
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def cache(): this.type = {
--- End diff --
It might be safer, clearer to use the explicit type here. Only for the Java stuff though.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42257630
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14687/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41971132
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14611/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41234925
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12021763
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +315,82 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
+ * of base RDD functions that do not change schema.
+ *
+ * @param rdd RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(rdd: RDD[Row]): SchemaRDD = {
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
+ }
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ override def cache(): SchemaRDD = {
--- End diff --
I should have mentioned this before, but we could consider using `this.type` instead in the Base RDD class for these methods. I'm not sure if that is breaking API or too much scala magic though. @mateiz ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40854403
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42393186
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40854278
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42365832
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40856764
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14248/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42272515
@kanzhang this is out of date with master - mind merging it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11813558
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Yeah, you are going to end up getting the same thing. I'd say we drop this one and leave the other. Right now it probably doesn't matter, but the other one is lazy and gives the optimizer a chance to possibly improve things before actually executing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11818451
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Oh, you are right. The base impl is probably lazy too. The distinction I was trying to make is that while normal RDD operations are lazy, they are not holistically optimized before execution. Where as if we create a logical operator and defer the creation of RDDs, there may be some extra chances for optimization (at some point in the future). We definitely want to override the base impl, but we don't need to have multiple redundant methods for creating samples.
Also note that you might need to sync with the changes being made in #462 .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41452343
Looking pretty good. Thanks again for working on this!
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42115307
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42249374
Actually, let me run with -s option shortly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12022387
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +315,82 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
+ * of base RDD functions that do not change schema.
+ *
+ * @param rdd RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(rdd: RDD[Row]): SchemaRDD = {
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
+ }
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ override def cache(): SchemaRDD = {
--- End diff --
I'd be okay trying it, my questions then are what it looks like in Scaladoc and what it looks like in Java. We should also double-check that Scala expects binary compatibility for this kind of return type.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42268877
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41968161
PS. Subclasses that override those methods may have to be updated and recompiled (like what I did in EdgeRDD, VertexRDD). Better ideas?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12305806
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
@@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
override protected def getPreferredLocations(s: Partition): Seq[String] =
partitionsRDD.preferredLocations(s)
- override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
+ override def persist(newLevel: StorageLevel): this.type = {
--- End diff --
Ah, right.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12346614
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
@@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
- override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
+ override def persist(newLevel: StorageLevel): this.type = {
partitionsRDD.persist(newLevel)
this
}
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
-
- /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
- override def cache(): EdgeRDD[ED] = persist()
--- End diff --
@mridulm It's not necessary to overload these anymore.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42224442
@kanzhang, Good point about the public APIs. I talked with @pwendell and we are a little concerned that this.type will make binary computability hard long term, mostly because we don't understand the bytecode being produced. Could you possibly run `javap` on some of the RDD classes and let us know what the function signatures look like for the changed methods?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42248461
Yeah, let's look at that. If it's just Object we may be fine.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42381669
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14746/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42377614
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12304047
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
@@ -45,4 +48,141 @@ class JavaSchemaRDD(
override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
val rdd = baseSchemaRDD.map(new Row(_))
+
+ override def toString: String = baseSchemaRDD.toString
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def cache(): this.type = {
--- End diff --
Ah, yes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42257628
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42262939
@marmbrus @kanzhang I'm a bit confused how the bytecode could be totally unchanged.
Let's say I create a new class `FooRDD.scala` in a totally different package and I depend on `RDD` from Spark only via binary (e.g. I'm linking against Spark). How does the compiler know whether `new FooRDD().cache()` should return `FooRDD` or `RDD` if it's not included in the byte-code somewhere?
I think I may be missing something obvious ;)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11819317
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
Thanks for the heads-up. So #462 is already doing it, I'll skip it (I meant overriding it with the query one too).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42115441
Added Python API changes. Skipped filter() as I don't see a way to translate Python functions into Java land. Py4j supports implementing Java interfaces in Python callbacks, but that's not applicable here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12345988
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
I am fairly ignorent of scala; I am not sure I follow, where is type coming from ? And what is it exactly ?
Also , does this change mean it is an incompat interface change ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40856762
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12023704
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +315,82 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
+ * of base RDD functions that do not change schema.
+ *
+ * @param rdd RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(rdd: RDD[Row]): SchemaRDD = {
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
+ }
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ override def cache(): SchemaRDD = {
--- End diff --
Thanks for pointing it out, will verify this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42377588
@pwendell thanks for the heads-up. made those changes, let's see how it goes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42264759
@marmbrus I did some local byte code inspection and I think it's injecting this information into the scalasig of the class (in our case, the RDD class). I made a simple class and ran scalap and javap. To java it looks the same but scalap can figure it out.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42371682
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14738/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r11791420
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -313,4 +314,46 @@ class SchemaRDD(
}
}
}
+
+ /**
+ * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
+ * RDD functions that do not change schema.
+ *
+ * @param childRDD RDD derived from this one and has same schema
+ *
+ * @group schema
+ */
+ private def applySchema(childRDD: RDD[Row]): SchemaRDD =
+ new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
+ applySchema(super.coalesce(numPartitions, shuffle))
+
+ override def distinct(numPartitions: Int): SchemaRDD =
+ applySchema(super.distinct(numPartitions))
+
+ override def filter(f: Row => Boolean): SchemaRDD =
+ applySchema(super.filter(f))
+
+ override def intersection(other: RDD[Row]): SchemaRDD =
+ applySchema(super.intersection(other))
+
+ override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
+ applySchema(super.intersection(other, partitioner))
+
+ override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+ applySchema(super.intersection(other, numPartitions))
+
+ override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
--- End diff --
There is already a sample method that returns SchemaRDD above.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41230289
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40854395
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42264854
@marmbrus anyways I don't see this tying our hands much in the future, so seems like a good idea. It will potentially force people to re-compile coming from pre-1.0 since the scala type signature has changed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42373741
@kanzhang hey you'll need to silence some of the binary compatibility checks in project/MimaBuild.scala:
```
excludeSparkClass("org.apache.spark.graphx.VertexRDD")
excludeSparkClass("org.apache.spark.graphx.EdgeRDD")
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41234927
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14414/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12357557
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
Thanks for clarifying, in retrospect that looks obvious !
On 07-May-2014 2:52 am, "Patrick Wendell" <no...@github.com> wrote:
> In core/src/main/scala/org/apache/spark/rdd/RDD.scala:
>
> > @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
> > * it is computed. This can only be used to assign a new storage level if the RDD does not
> > * have a storage level set yet..
> > */
> > - def persist(newLevel: StorageLevel): RDD[T] = {
> > + def persist(newLevel: StorageLevel): this.type = {
> > // TODO: Handle changes of StorageLevel
>
> @mridulm <https://github.com/mridulm> if you look at this patch, it
> explicitly overrides those for SchemaRDD. You can't use this.type there
> because the return type is actually a new RDD class (FilteredRDD and so
> on).
>
> —
> Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/448/files#r12349982>
> .
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42268837
LGTM then - I looked around a bit and it seems like this use case is exactly what `this.type` is for in Scala.
@marmbrus @mateiz any further feedback?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12021785
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
@@ -45,4 +50,146 @@ class JavaSchemaRDD(
override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
val rdd = baseSchemaRDD.map(new Row(_))
+
+ // =======================================================================
+ // Base RDD functions that do NOT change schema
+ // =======================================================================
+
+ // Common RDD functions
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def cache(): JavaSchemaRDD = {
+ baseSchemaRDD.cache()
+ this
+ }
+
+ /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
+ def persist(): JavaSchemaRDD = {
+ baseSchemaRDD.persist()
+ this
+ }
+
+ /**
+ * Set this RDD's storage level to persist its values across operations after the first time
+ * it is computed. This can only be used to assign a new storage level if the RDD does not
+ * have a storage level set yet..
+ */
+ def persist(newLevel: StorageLevel): JavaSchemaRDD = {
+ baseSchemaRDD.persist(newLevel)
+ this
+ }
+
+ /**
+ * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
+ *
+ * @param blocking Whether to block until all blocks are deleted.
+ * @return This RDD.
+ */
+ def unpersist(blocking: Boolean = true): JavaSchemaRDD = {
+ baseSchemaRDD.unpersist(blocking)
+ this
+ }
+
+ /** Assign a name to this RDD */
+ def setName(name: String): JavaSchemaRDD = {
+ baseSchemaRDD.setName(name)
+ this
+ }
+
+ // Transformations (return a new RDD)
+
+ /**
+ * Return a new RDD that is reduced into `numPartitions` partitions.
+ */
+ def coalesce(numPartitions: Int, shuffle: Boolean = false): JavaSchemaRDD =
+ baseSchemaRDD.coalesce(numPartitions, shuffle)
+
+ /**
+ * Return a new RDD containing the distinct elements in this RDD.
+ */
+ def distinct(): JavaSchemaRDD = baseSchemaRDD.distinct()
+
+ /**
+ * Return a new RDD containing the distinct elements in this RDD.
+ */
+ def distinct(numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.distinct(numPartitions)
+
+ /**
+ * Return a new RDD containing only the elements that satisfy a predicate.
+ */
+ def filter(f: JFunction[Row, java.lang.Boolean]): JavaSchemaRDD =
+ baseSchemaRDD.filter(x => f.call(new Row(x)).booleanValue())
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did.
+ *
+ * Note that this method performs a shuffle internally.
+ */
+ def intersection(other: JavaSchemaRDD): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other)
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did.
+ *
+ * Note that this method performs a shuffle internally.
+ *
+ * @param partitioner Partitioner to use for the resulting RDD
+ */
+ def intersection(other: JavaSchemaRDD, partitioner: Partitioner): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other, partitioner)
+
+ /**
+ * Return the intersection of this RDD and another one. The output will not contain any
+ * duplicate elements, even if the input RDDs did. Performs a hash partition across the cluster
+ *
+ * Note that this method performs a shuffle internally.
+ *
+ * @param numPartitions How many partitions to use in the resulting RDD
+ */
+ def intersection(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.intersection(other, numPartitions)
+
+ /**
+ * Return a new RDD that has exactly `numPartitions` partitions.
+ *
+ * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
+ * a shuffle to redistribute data.
+ *
+ * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
+ * which can avoid performing a shuffle.
+ */
+ def repartition(numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.repartition(numPartitions)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ *
+ * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
+ * RDD will be <= us.
+ */
+ def subtract(other: JavaSchemaRDD): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ */
+ def subtract(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other, numPartitions)
+
+ /**
+ * Return an RDD with the elements from `this` that are not in `other`.
+ */
+ def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
+ baseSchemaRDD.subtract(other, p)
+}
+
+object JavaSchemaRDD {
+
+ implicit def fromSchemaRDD(rdd: SchemaRDD): JavaSchemaRDD =
--- End diff --
While I'm generally a fan of scala trickery. I have found massaging type with implicits often comes back to bite you. You may be safe here, but I'd still prefer we just call a wrap method explicitly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42350396
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41965097
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12348272
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
- def persist(newLevel: StorageLevel): RDD[T] = {
+ def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel
--- End diff --
So I guess this cant be applied to checkpointRDD and randomSplit ?
What about things like filter, distinct, repartition, sample, filterWith etc ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42254028
Pushed an update based on @marmbrus comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:
https://github.com/apache/spark/pull/448#discussion_r12304013
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
@@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
override protected def getPreferredLocations(s: Partition): Seq[String] =
partitionsRDD.preferredLocations(s)
- override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
+ override def persist(newLevel: StorageLevel): this.type = {
--- End diff --
@marmbrus it actually has its own logic.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40855861
Thanks for doing this!
I think we are actually okay for `intersect` and `subtract` as anything in the result must be a row that was in the original RDD and thus must have a correct schema. If you intersect with a different schema you will get back an empty rdd. If you subtract with an different schema the subtraction will be a no-op and you'll get back the original RDD.
Union is a little more troublesome. We could check the schema and throw an error if they don't match, but that is kinda changing the semantics relative the the standard `union` call on RDD. Also, when we do a SQL union we do type widening, so just calling RDD union and returning a `SchemaRDD` is a little weird.
So, I'd propose we leave union out, as users that want SQL semantics here can already call unionAll. @mateiz might have thoughts here too.
A few other methods we can add that also don't change the schema:
- `distinct()` with no `numPartitions`
- `repartition`
- `setName(...)` ?
- `randomSplit` (not sure if this is okay since `Array` is invariant)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42371680
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/448
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40949024
> How about cache(), persist(), unpersist()?
Good catch!
> Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?
It is just called `SchemaRDD` and is located in `python/pyspark/sql.py`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-41972942
@marmbrus Instead of current JavaSchemaRDD.fromSchemaRDD, I'm tempted to add a toJavaSchemaRDD method in SchemaRDD to allow better method chaining. Thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42116989
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42217278
@marmbrus @mateiz Just a heads-up, changing to use this.type as return type may break 1.0 public API if we don't get this change in (see comment above).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42252573
The -s option isn't very interesting. The couple lines added are identical in both cases.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42266085
@pwendell yes, scalap shows it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42116990
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14634/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42451285
Thanks for updating this. I'm merging it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42393192
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-42366235
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40856180
Also, we should make the same changes to the Java and Python API if possible.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...
Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40850230
First try, pls comment. Not very comfortable with methods that take other RDDs, like intersect, subtract and union, since caller has to make sure they of the same schema.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---